Failover/Failure Scenarios for MetroCluster
I’m not going to re-invent the wheel here. These failure scenarios are all pretty self-explanatory and can be found in TR-3788.pdf. There’s far more scenarios in that document but here I’ll cover off some of the most common types.
Scenario: Loss of power to disk shelf
Expected behaviour: Relevant disks for offline and the plex is broken. There’s no disruption to data availability to hosts running HA (VMware High Availability) or FT (Fault Tolerance), no change is detected by the ESXi Server. When the shelf is powered back on the plexes will sync automatically
Impact on data availability: None
Scenario: Loss of one link in one disk loop
Expected behaviour: A notification appears on the controller to advise that disks are only accessible via one switch. There’s no disruption to data availability to hosts running HA or FT, no change is detected by the ESXi Server. When the connection is reset an alert on the controller will advise of connectivity across two switches
Impact on data availability: None
Scenario: Failure and Failback of Storage Controller
Expected behaviour on failover: There’s no disruption to data availability to hosts running HA or FT. No interruption to VMs running on ESXi Server. The partner node reports an outage. There’s a momentary pause in disk activity while the datastore connectivity (iSCSI, NFS, FC) is refreshed as the connection moves via other controller. After takeover has completed normal activity is resumed
Expected behaviour on failback: There’s no disruption to data availability to hosts running HA or FT. No interruption to VMs running on ESXi Server. There’s a momentary pause in disk activity while the datastore connectivity (iSCSI, NFS, FC) is refreshed as the connection is enabled on the controller again. After the giveback has completed normal activity is resumed
Impact on data availability: None
Scenario: Mirrored storage network isolation (Cluster Interconnects down)
Expected behaviour: There’s no disruption to data availability to hosts running HA (VMware High Availability) or FT (Fault Tolerance), no change is detected by the ESXi Server. VIA Interconnect is down alert will appear on the controllers
Impact on data availability: None
Scenario: Total ESXi host failure on one site
Expected behaviour: FT VMs move automatically to hosts on remaining site, HA VMs are migrated to secondary site and reboot VMs. When the ESXi hosts come back online the VMs can be migrated manually or it will happen automatically depending on DRS group rules.
Impact on data availability: None
Scenario: Total Network Isolation on ESXi hosts and loss of hard drive
Expected behaviour: The relevant disks go offline and the plex is broken. FT VMs move automatically to hosts on remaining site, HA VMs are migrated to secondary site and reboot VMs. When the ESXi hosts come back online the VMs can be migrated manually or it will happen automatically depending on DRS group rules. When the storage shelves are replaced they will automatically resync the plexes.
Impact on data availability: None
Scenario: Loss of one Fabric Interconnect switch
Expected behaviour: The controller displays a message that some disks are connected via one switch and that the cluster interconnects are down. There’s no change to ESXi servers or VMs. When the switch comes back online the controllers display message that the fabric interconnects are back online.
Impact on data availability: None
Scenario: Failure of entire Data Center
Expected behaviour: Chaos!!! Not really. I would advise that if you’re looking at DR testing or even need to perform a failover to check out another blog series I did regarding MetroCluster failover. If the failure is in Site 1 all ESXi hosts will show as being offline or not responding. VMware HA will kick in and migrate all VMs to the other site. Alerts will appear on the controller in Site 2 that site 1 controller and fabric interconnects are offline and that paths to remote storage is offline. The plexes are broken and the mirrored plex for Site 1 becomes writeable. There is a pause on disk access during the refresh of datastore links. Once a failover is performed it takes some time for the plexes to sync etc. but once it completes the entire environment will be running from ESXi servers in Site 2. All Site 1 ESXi servers will still appear offline
Once the issues in Site 1 have been resolved and the interconnects are back online and remote storage can be reached the plex in Site 2 will automatically resync with its mirrored plex from Site 1. The ESXi hosts will appear back online and VMs will automatically migrate if DRS rules are in place for that, otherwise a manual failover can be done. The mirrored plex for Site 1, running and now owned by Site 2, will need to resync to the primary plex in Site 1. This is a manual command. Next you will now need a giveback command run to make the plex in Site 1 the primary again and enable the mirror once more. N.B. The above scenario can cause the plex numbers to change on resync
Impact on data availability: None
There are also a number of scenarios for rolling failures but I’m not going to go into that here. Really, MetroCluster is designed to handle all types of failures so it’s not a surprised that if it can handle the above scenarios it can also take care of rolling failures.