NetApp 7-Mode MetroCluster Disaster Recovery - Part 2

This is the part 2 of the 3-part post about MetroCluster failover procedure. This section covers the giveback process and also a note about a final review. The other sections of this blog post can be found here:

The planning and environment checks have take place and now it’s time to execution day. I’ll go through the process here of how the test cases were followed during testing itself. Please note that Site A (SiteA) is the site where the shutdown is taking place. Site B (SiteB) is the failover site for the purpose of this test.

Test Case 1 – Virtual Infrastructure Health Check

This is a health check of all the major components before beginning the execution of the physical shutdown

1.1 – Log into Cisco UCS Manager on both sites using an admin account.

1.2 – Select the Servers tab, and expand Servers -> Service Profiles -> root -> Sub-Organizations -> <SiteName>. List of blades installed in relevant environment will appear here

1.3 – Verify the Overall status. All blades appear with Ok status. Carry on with the next

1.4 – Log into vCenter using desktop client or web client. Select the vCenter server name at top of tree, select Alarms in right-hand pane and select triggered alarms. No alarms should appear

1.5 – Verify all ESX hosts are online and not in maintenance mode

1.7 – Log onto NetApp onCommand System Manager. Select SiteA controller and open application

1.8 – Expand SiteA/SiteB and expand both controllers, select Storage and Volumes. Verify that all volumes are online

1.9 – Launch Fabric MetroCluster Data Collector (FMC_DC) and verify that the configured node is ok. The pre-configured FMC_DC object returns green – this means that all links are health and takeover can be initiated

1.10 – Log onto storage controllers via SSH – SiteB & SiteA, run the following commands to ensure no alerts/faults before continuing:

#sysconfig –a

#sysconfig –r

#aggr status –v

#vol status -f

#storage show fault

No errors or warnings appear. If errors do appear do not continue with next test case until resolved or confirmed from vendor that issue can be ignored.

1.11 – On the storage controllers run the command to check that FSID changes are disabled. This stops the volumes from being re-ID’d on failover which can cause numerous problem on giveback. This is also a recommended practice from the NetApp documentation:

#options cf.takeover

SiteB> options cf.takeover
cf.takeover.change_fsid off
cf.takeover.detection.seconds 15
cf.takeover.on_disk_shelf_miscompare off
cf.takeover.on_failure on
cf.takeover.on_network_interface_failure off
cf.takeover.on_network_interface_failure.policy all_nics (same value in local+partner recommended)
cf.takeover.on_panic on
cf.takeover.on_reboot on
cf.takeover.on_short_uptime on
cf.takeover.use_mcrc_file off (value might be overwritten in takeover)

Verify that cf.takeover.change_fsid is off. If not run the command below to turn the options off:

#options cf.takeover.change_fsid off

1.12 – Run the below command and confirm that both the local and partner node IP addesses for Service Processor are listed:

#cf hw_assist status

1.13 – Run the commands to verify that controller failover can take place and that the partner is up:

#cf partner

#cf status

Test Case 2 – ESXi host and UCS blade shutdown

This test cases covers shutting down the ESXi hosts so that VMs are handle by HA in vCenter to migrate across to SiteB ESXi hosts

2.1 – Log into Cisco UCS Manager on SiteA using admin account.

2.2 – Select the Equipment tab, expand Equipment -> Chassis -> Chassis1 -> Servers. Right-click on each server and select shutdown server.

Leave the defaults ticked and click ok

Perform the above for Chassis 2 also

2.3 – Log into vCenter using desktop client or web client. Select Host -> Inventory -> Hosts & clusters. Verify that all SiteA hosts are shutdown.

2.4 – Select Production data center and expand each of the clusters. Select each ESX host that has SiteB in the name, select the Summary tab and verify the host resources for Memory and CPU. This is to check that in the event of a DR there are resources available on both sites to handle a failure of the other site. As per VMware best practice a stretch cluster should have it’s HA admission control set to 50%.

2.5 – Select each SiteB ESXi host, click on Virtual Machines tab and verify that no VMs are offline.

Test Case 3 – NetApp Cabinet Shutdown

3.1 – Log onto SiteB & SiteA via SSH

3.2 – Run ‘lun show’ command, take note of luns

3.3 – Run ‘vol status’ command, take note of volumes

3.4 – Physically pull the cable from the Service Processor on storage controller in SiteA. Run the command afterwards :

#cf hw_assist status

3.5 – Log on both SiteA_MDS1 & SiteA_MDS2 via SSH

3.6 – Run commands ‘show int brief’ on both switches and verify port for InterSwitch Link is online

3.7 – Physically pull the cable from InterSwith-Link-Interface on both switches

3.8 – Run commands ‘show int brief’ on both switches and verify port for InterSwitch Link is offline

At this point you will see a huge number of alerts about disks being missing/offline. There will also be an alert to show that the partner mailbox is offline and failover cannot take place.

[SiteB:raid.mirror.snapResExpanded:notice]: Aggregate Snapshot copy reserve in SyncMirror aggregate 'aggr0' is increased from 5% to 10% while the mirror is degraded or resyncing. Aggregate Snapshot copy reserve will be reverted to the old value when resync is complete.
 [SiteB:raid.mirror.snapResExpanded:notice]: Aggregate Snapshot copy reserve in SyncMirror aggregate 'aggr1' is increased from 5% to 23% while the mirror is degraded or resyncing. Aggregate Snapshot copy reserve will be reverted to the old value when resync is complete.
 [SiteB:ha.takeoverImpDegraded:warning]: Takeover of the partner node is impossible due to lack of connectivity to the partner mailbox disks.
 [SiteB:monitor.globalStatus.critical:CRITICAL]: Controller failover of SiteA is not possible: partner mailbox disks not accessible or invalid.
 [SiteB:callhome.disks.missing:warning]: Call home for MULTIPLE DISKS MISSING
 [SiteB:callhome.partner.down:CRITICAL]: Call home for PARTNER DOWN, TAKEOVER IMPOSSIBLE

3.9 – On the BusBar (Power Management connection) for rack cabinet of NetApp in SiteA turn to off for both buses. This will shutdown the entire rack to simulate a failover. All disk shelves, storage controller, MDS switches and ATTO bridges will shutdown immediately.

3.10 – Go back to SiteB console. Run the following commands:

#cf partner

#cf status

Verify partner is down

SiteB > cf partner
 SiteA
 SiteB > cf status
 SiteAmay be down, takeover disabled because of reason (partner mailbox disks not accessible or invalid)
 SiteB has disabled takeover by SiteA(interconnect error)
 VIA Interconnect is down (link 0 down, link 1 down).
 The DR partner site might be dead.
 To take it over, power it down or isolate it as described in the Data Protection Guide, and then use cf forcetakeover -d.

3.11 – Run a continuous ping to SiteA. Once failover to SiteB has taken effect this controller will respond to pings again

3.12 – Run the command ‘cf forcetakeover –d’. Watch console output on the SiteB FAS6250 during takeover

SiteB> cf forcetakover -d
 Following the command, mirrored volumes will be split and
 Clients of the partner controller may experience data loss because of disaster.
 Prior to issuing this command, the partner controller should be powered off.
 If the partner controller is operational or if it becomes operational at any time
 while this controller is running in takeover mode, your filesystems may be destroyed.
 Do you wish to continue [y/n] ?? y
 cf: forcetakeover -d initiated by operator
 SiteB> [SiteB:cf.misc.operatorDisasterTakeover:notice]: Failover monitor: forcetakeover -d initiated by operator
 [SiteB:cf.fsm.takeover.disaster:info]: Failover monitor: takeover attempted after 'cf forcetakeover -d' command.
 [SiteB:cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
 [SiteB:cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
 [SiteB:fmmb.lock.disk.remove:info]: Disk ?.? removed from partner mailbox set.
 [SiteB:fmmb.current.lock.disk:info]: Disk SiteB_MDS2:DiskNumber is a partner HA mailbox disk.
 [SiteB:fmmb.current.lock.disk:info]: Disk SiteB_MDS1:DiskNumber is a partner HA mailbox disk.
 [SiteB:fmmb.instStat.change:info]: normal mailbox instance on partner side.
 [SiteB:cf.partner.nvram.state:info]: Partner mailbox was stale. Partner NVRAM might not be synchronized and some data may be lost.
 [SiteB:coredump.host.spare.none:info]: No sparecore disk was found for host 1.
 [SiteA:raid.vol.mirror.degraded:error]: Aggregate partner:aggr0 is mirrored and one plex has failed. It is no longer protected by mirroring.
 [SiteB:callhome.syncm.plex:CRITICAL]: Call home for SYNCMIRROR PLEX FAILED
 [SiteA:raid.vol.mirror.degraded:error]: Aggregate partner:aggr1 is mirrored and one plex has failed. It is no longer protected by mirroring.
 [SiteB:callhome.syncm.plex:CRITICAL]: Call home for SYNCMIRROR PLEX FAILED
 [SiteB:raid.vol.reparity.issue:warning]: Aggregate partner:aggr1_sas has invalid NVRAM contents.
 [SiteB:raid.vol.reparity.issue:warning]: Aggregate partner:aggr0 has invalid NVRAM contents.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID 'eb0560d3-2a9c-11e3-adc0-123478563412' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID 'eb0560d3-2a9c-11e3-adc0-123478563412' was built in 43 msec, after scanning 12 inodes and restarting 13 times with a final result of success.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr1' UUID '8e41d051-2b61-11e3-8457-123478563412' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.
 [SiteB:raid.config.check.failedPlex:error]: Plex partner:/aggr1/plex0 has failed.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr1/plex0 has failed.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr0/plex2 has failed.
 [SiteB:raid.config.check.failedPlex:error]: Plex partner:/aggr0/plex0 has failed.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr1_sas' UUID '8e41d051-2b61-11e3-8457-123478563412' was built in 524 msec, after scanning 195 inodes and restarting 24 times with a final result of success.
 [SiteA:wafl.takeover.nvram.warn:error]: WAFL takeover: Last few seconds of updates to partner may be lost
 [SiteB:wafl.replay.done:info]: WAFL log replay completed, 0 seconds
 [SiteA:raid.mirror.snapResExpanded:notice]: Aggregate Snapshot copy reserve in SyncMirror aggregate 'partner:aggr1' is increased from 5% to 30% while the mirror is degraded or resyncing. Aggregate Snapshot copy reserve will be reverted to the old value when resync is complete.
 [SiteA:wafl.vvol.offline:info]: Volume 'VOLUME' has been set temporarily offline
 [SiteA:wafl.vvol.offline:info]: Volume 'VOLUME' has been set temporarily offline
 [SiteA:wafl.vvol.offline:info]: Volume 'VOLUME' has been set temporarily offline
 [SiteA:wafl.vvol.offline:info]: Volume 'VOLUME' has been set temporarily offline
 ..........................................................
 [SiteA:wafl.vvol.offline:info]: Volume 'VOLUME' has been set temporarily offline
 [SiteA:wafl.vvol.offline:info]: Volume 'VOLUME_vol' has been set temporarily offline
 [SiteA:export.update.fsid:info]: Updating the old FSID: 1249426690 with new FSID: 1249426690. Start time: 2329006904.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr1' UUID 'cc806a24-c46b-11e4-a27a-123478563412' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr1' UUID 'cc806a24-c46b-11e4-a27a-123478563412' was built in 198 msec, after scanning 195 inodes and restarting 18 times with a final result of success.
 [SiteA:raid.mirror.snapResReverted:notice]: Aggregate Snapshot copy reserve in SyncMirror aggregate 'partner:aggr1' was reverted from 30% back to 5%.
 [SiteA:wafl.vvol.offline:info]: Volume 'vol0' has been set temporarily offline
 [SiteA:export.update.fsid:info]: Updating the old FSID: 1341325062 with new FSID: 1341325062. Start time: 2329016861.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID 'd27a9508-c46b-11e4-a27a-123478563412' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.
 [SiteA:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID 'd27a9508-c46b-11e4-a27a-123478563412' was built in 26 msec, after scanning 12 inodes and restarting 14 times with a final result of success.
 [SiteA:raid.fm.disasterSummary:info]: RAID disaster takeover summary: number of partner volumes and aggregates=2, number of rewrite-fsid partner volumes and aggregates=0, number of out-of-date partner volumes and aggregates=0, number of ignored partner volumes and aggregates=0, number of local volumes and aggregates=0, number of out-of-date local volumes and aggregates=0
 [SiteA:fcp.service.startup:info]: FCP service startup
 [SiteA:vdisk.onlineComplete:info]: Partner LUN(s) online completed.
 [SiteA:httpd.config.mime.missing:warning]: /etc/httpd.mimetypes.sample file is missing.
 [SiteA:httpd.config.mime.missing:warning]: /etc/httpd.mimetypes file is missing.
 [SiteA:httpd.config.mime.missing:warning]: /etc/httpd.mimetypes.sample file is missing.
 [SiteA/SiteB: proto_init03:info]: Vfiler discovery complete
 [SiteA:raid.rg.reparity.start:notice]: /aggr0/plex2/rg0: starting parity recomputation
 [SiteA:raid.rg.reparity.start:notice]: /aggr1/plex2/rg0: starting parity recomputation
 [SiteA:raid.rg.reparity.start:notice]: /aggr1/plex2/rg3: starting parity recomputation
 [SiteA:raid.rg.reparity.start:notice]: /aggr1/plex2/rg1: starting parity recomputation
 [SiteA:cifs.startup.partner.succeeded:info]: CIFS: CIFS partner server is running.
 [SiteB:cf.rsrc.transitTime:notice]: Top Takeover transit times raid_disaster=22139, raid=1980, wafl=1979 {paggrs_to_done=1243, prvol_to_done=401, part_vols_mnt_end=347, pvvols_to_done=335, prvol_mnt_end=14, verify_names=0, destroy_vvol=0}, wafl_replay=742 {replay_log=579, mark_replaying=163, init=0, catalog_init=0, replay_log_missing=0, nvfail=0, partner_log=0, enable_log=0}, registry_postrc_phase1=618, raid_replay=541, rc=453 {always_do_just_after_etc_rc=61, hostname=59, ifc
 [SiteB:callhome.sfo.takeover.m.dr:warning]: Call home for CONTROLLER TAKEOVER COMPLETE MANUAL(DR)
 [SiteB:callhome.reboot.takeover:error]: Call home for PARTNER REBOOT (CONTROLLER TAKEOVER)
 [SiteB:cf.fm.takeoverComplete:notice]: Failover monitor: takeover completed
 [SiteB:cf.fm.takeoverDuration:info]: Failover monitor: takeover duration time is 30 seconds.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr1/plex0 has failed.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr0/plex2 has failed.
 [SiteA:nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
 [SiteB:sis.changelog.full:warning]: SIS change logging metafile for volume partner:pkvoranfs02 is full.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr1/plex0 has failed.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr0/plex2 has failed.
 [SiteB:callhome.performance.snap:info]: Call home for PERFORMANCE SNAPSHOT
 [SiteB:raid.rg.reparity.done:notice]: /aggr0/plex2/rg0: parity recomputation completed in 2:13.55
 [SiteA:cifs.ldap.address.invalid:info]: Could not read the saved AD Lightweight Directory Access Protocol(LDAP) server information. The system will get the AD information by doing AD discovery.
 [SiteA:auth.ldap.trace.LDAPConnection.statusMsg:info]: AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for DOMAIN
 [SiteA:auth.ldap.trace.LDAPConnection.statusMsg:info]: AUTH: TraceLDAPServer- Found 4 AD LDAP server addresses using DNS site query (domain).
 [SiteA:auth.ldap.trace.LDAPConnection.statusMsg:info]: AUTH: TraceLDAPServer- Found 21 AD LDAP server addresses using generic DNS query.
 [SiteA:auth.ldap.trace.LDAPConnection.statusMsg:info]: AUTH: TraceLDAPServer- AD LDAP server address discovery for DOMAIN complete. 21 unique addresses found.
 [SiteA:raid.rg.reparity.start:notice]: /aggr1/plex2/rg2: starting parity recomputation
 [SiteB:raid.rg.reparity.done:notice]: /aggr1/plex2/rg1: parity recomputation completed in 3:04.53
 [SiteB:raid.rg.reparity.done:notice]: /aggr1/plex2/rg0: parity recomputation completed in 3:12.08
 [SiteB:raid.rg.reparity.done:notice]: /aggr1/plex2/rg3: parity recomputation completed in 3:15.66
 [SiteB:raid.rg.reparity.done:notice]: /aggr1/plex2/rg2: parity recomputation completed in 1:47.84
 cf forcetakeover -d Sat Mar 7 13:00:00 EST [SiteB:kern.uptime.filer:info]: 1:00pm up 76 days, 16:12 7739705244 NFS ops, 579060671 CIFS ops, 0 HTTP ops, 122981079 FCP ops, 0 iSCSI ops
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr1_sas/plex0 has failed.
 [SiteB:raid.config.check.failedPlex:error]: Plex local:/aggr0/plex2 has failed.
 [SiteB:raid.vol.unprotected.remotesyncmirror:error]: Aggregate local:aggr1 is mirrored and one plex is not online. The volume will not be available if a takeover occurs and the online plex is not accessible to the partner node.
 [SiteB:raid.vol.unprotected.remotesyncmirror:error]: Aggregate local:aggr0 is mirrored and one plex is not online. The volume will not be available if a takeover occurs and the online plex is not accessible to the partner node.
 [SiteB:cf.ic.hourlyRVnag:error]: Cluster Interconnect sessions with partner have been DOWN for 24 minute(s)
 [SiteB:cf.ic.hourlyNicDownTime:info]: Interconnect adapter link #0 has been down for 24 minutes
 [SiteB:cf.ic.hourlyNicDownTime:info]: Interconnect adapter link #1 has been down for 24 minutes

3.13 – Record the time taken to failover

3.14 – Run the following commands once takeover has completed, verify that the volumes are online and that there are no faults on disks/controller etc. :

#sysconfig –a

#sysconfig –r

#aggr status –v

#vol status -f

#storage show fault

#disk show -p

3.15 – Log into vCenter via desktop client or web client and verify VMs are online within the virtual datacenter. Check on status on SiteA related NFS volumes

Next we’ll look at the giveback procedure in NetApp 7-Mode MetroCluster Disaster Recovery – Part 3

NetApp 7-Mode MetroCluster Disaster Recovery – Part 2

Leave a Reply Cancel reply

Share this:

Leave a Reply Cancel reply