NetApp 7-Mode MetroCluster Disaster Recovery - Part 3

This is the last of the 3-part post about MetroCluster failover procedure. This section covers the giveback process and also a note about a final review. The other sections of this blog post can be found here:

Test Case 4 – Virtual Infrastructure Health Check

This test case covers a virtual infrastructure system check, not only to get an insight into the current status of the system but to also compare against the outcomes from test case 1.

4.1 – Log into vCenter using the desktop client or web client. Expand the virtual data center and verify all SiteB ESXi hosts are online

4.2 – Log onto NetApp onCommand System Manager. Select primary storage controller and open application

4.3 – Expand SiteA/SiteB and expand primary storage controller, select Storage and Volumes. Volumes should all appear with online status

4.4 – Log into Solarwinds – Check the events from the last 2 hours and take note of all devices from Node List which are currently red

Test Case 5 – NetApp Controller GiveBack Procedure

The focus of this test case is the giveback procedure. This is the most critical part of the entire DR test as a synchronization needs to take place between plexes on both sites and finally the giveback command has to be executed. If there are any issues this is the point where it can bring your SiteA controller to its knees. Even if you manage to cripple a controller (as I did in the past) you can rest assured that your data is still alive and well on your SiteB site and will remain there until you get SiteA back online. This will most likely require the assistance of NetApp Support.

5.1 – Contact NetApp support on the support case opened earlier and advise of the current status of the environment. Follow their steps to return the system to original operation state. This is recommended practice from NetApp. The steps below should be carried out with NetApp Support supervision. You can perform the giveback yourself but I would recommend getting another set of eyes on things first.

5.2 – Before commencing with the giveback ensure that the power on the following devices is set to off:

Storage controller
Disk Shelves
MDS Switches
ATTO Bridges

These need to be powered off correctly before powering back on the Busbar for the cabinet

5.3 – Turn on Busbar power for the NetApp cabinet in SiteA data center

5.4 – Power on the ATTO bridges and then power on the storage shelves. Take a note of any disks that fail to come back online and open a call with NetApp support to get them replaced

5.5 – Put the InterSwitch Link cable back into the interface connection on the MDS Switches

5.6 – Power on MDS Switches

5.7 – Log on both SiteA_MDS1 & SITEA_MDS2 via SSH

5.8 – Run commands ‘show int brief’ on both switches and confirm Interswitch link connection is active.

5.9 – Power on the storage controller

Note: This was recommended by NetApp support at the time. Normally this would not happen until after step 5.13

5.10 – Log onto SiteA controller via SSH

5.11 – Recreate the mirrors between the sites

#aggr mirror aggr0 -v aggr0(1)

#aggr mirror aggr1 -v aggr1(1)

5.12 – Verify that the plexes are synchronizing now that the disk shelves in SiteA are online.

Run ‘aggr status –v’

Plex /aggr1/plex4 (online, normal, resyncing 99% completed, pool0)
 RAID group /aggr1/plex4/rg0 (normal, block checksums)

Run the command ‘aggr status -r mir’ to get a view of how the mirror synchronization is doing.

Also run the following commands to verify systems are ok

#sysconfig –a

#sysconfig –r

#vol status -f

5.13 – Once resynchronization has completed between the sites, expected to take approx. 1 hour, plug in the Service Processor cable.

5.14 – Go back to SSH console SiteB and run the command ‘cf giveback’

Initially I got an error that I couldn’t perform a giveback due to CIFS users having files open. To resolve this I had to connect to SiteA via SSH and run the ‘cifs terminate‘ command and the re-run the giveback command.

SiteB(takeover)> cf giveback
 please make sure you have rejoined your aggregates before giveback.
 Do you wish to continue [y/n] ?? y
 SiteB(takeover)> [SiteB:cf.misc.operatorGiveback:info]: Failover monitor: giveback initiated by operator
 [SiteB:cf.fm.givebackStarted:notice]: Failover monitor: giveback started.
 [SiteA:fcp:sis.op.stopped:error]: SIS operation for /vol/VOLUME has stopped
 [SiteA:fcp:sis.cfg.setFailed:error]: Saving SIS volume configuration for volume /vol/VOLUME: Volume is offline
 [SiteA:fcp.service.shutdown:info]: FCP service shutdown
 Sat Mar 7 16:56:54 EST [SiteB:cf.rsrc.transitTime:notice]: Top Giveback transit times wafl=6826 {finish=4624, sync_clean=1919, forget=280, vol_refs=3, mark_abort=0, wait_offline=0, wait_create=0, abort_scans=0, drain_msgs=0, zombie_wait=0}, wafl_gb_sync=2044, snmp_giveback=1092, raid=255, registry_giveback=70, sanown_replay=35, snapmirror=31, nfsd=19, nlm=18, lock manager=17
 [SiteB:cf.fm.givebackComplete:notice]: Failover monitor: giveback completed
 [SiteB:cf.fm.givebackDuration:notice]: Failover monitor: giveback duration time is 11 seconds.
 [SiteB:cf.fsm.stateTransit:info]: Failover monitor: TAKEOVER --> UP
 [SiteB:cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor: takeover of SiteA disabled (partner mailbox disks not accessible or invalid).
 [SiteB:cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of SiteB by SiteA  disabled (unsynchronized log).
 [SiteB:callhome.sfo.giveback:info]: Call home for CONTROLLER GIVEBACK COMPLETE
 [SiteB:cf.fsm.backupMailboxOk:notice]: Failover monitor: backup mailbox OK
 [SiteB:cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor: takeover of SiteA disabled (waiting for partner to recover).
 [SiteB:rv.connection.torndown:info]: cfo_rv2 is torn down on NIC 1
 [SiteB:scsitarget.vtic.down:notice]: The VTIC is down.
 [SiteB:rv.connection.torndown:info]: cfo_rv is torn down on NIC 0
 [SiteB:rv.connection.established:info]: cfo_rv is connected on NIC 0
 [SiteB:scsitarget.vtic.up:notice]: The VTIC is up.
 [SiteB:cf.fsm.partnerNotResponding:notice]: Failover monitor: partner not responding
 [SiteB:scsitarget.vtic.down:notice]: The VTIC is down.
 [SiteB:rv.connection.torndown:info]: cfo_rv is torn down on NIC 0
 [SiteB:rv.connection.established:info]: cfo_rv is connected on NIC 0
 [SiteB:scsitarget.vtic.up:notice]: The VTIC is up.
 [SiteB:cf.fsm.partnerOk:notice]: Failover monitor: partner ok
 [SiteB:cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor: takeover of SiteA disabled (partner booting).
 [SiteB:cf.ic.xferTimedOut:error]: WAFL interconnect transfer timed out
 [SiteB:scsitarget.vtic.down:notice]: The VTIC is down.
 [SiteB:rv.connection.torndown:info]: cfo_rv is torn down on NIC 0
 [SiteB:rv.connection.established:info]: cfo_rv is connected on NIC 0
 [SiteB:scsitarget.vtic.up:notice]: The VTIC is up.
 [SiteB:cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor: takeover of SiteA disabled (unsynchronized log).

[SiteB:rv.connection.established:info]: cfo_rv2 is connected on NIC 1
 [SiteB:cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of SiteA enabled
 [SiteB:cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of SiteB by SiteA enabled
 [SiteB:cf.hwassist.recvKeepAlive:info]: hw_assist: Received hw_assist KeepAlive alert from partner(SiteA ).
 [SiteB:monitor.globalStatus.ok:info]: The system's global status is normal.

SiteB>

Watch console output on the SiteB FAS6250 during giveback. Record the time taken to failover

5.15 – Run these commands to ensure that everything has failed over and no failures have been detected

#sysconfig –a

#sysconfig –r

#vol status -f

5.16 – Log into Cisco UCS Manager on both sites using admin account.

5.17 – Select the Equipment tab, expand Equipment -> Chassis -> Chassis1 -> Servers. Right-click on each server and select Power On server. Do the same for chassis 2

UCS Blade warnings during failover

5.18 – Log into vCenter using desktop client or web client. Select Host -> Inventory -> Hosts & clusters. Verify that all SiteA hosts are online

5.19 – Manually migrate the VMs back across to their original location

And you have now successfully run a simulated DR test for MetroCluster

There was one more step carried out after step 5 which I won’t cover here but it was just a precursory infrastructure health check. Another aspect not covered during the failover window involved application testing. These tests were to ensure that the critical applications were still online and working as expected from one controller. As it was a very specific application test relating to Documentum I have decided not to detail that here. During the failover window I would recommend running as many applications test as possible.

FINAL NOTE:

This DR test just covered the failover from one site to another and not vice-versa. A decision was made to test just one-way failover as the opposite site had been failed over as part of a previous DR test. While that test has a number of issue it still succeeded in reaching its RPO and RTOs and was then used as a foundation for the test covered in the above steps. The only way to really perform a DR test for MetroCluster is to put your system in a proper failure scenario, performing a component by component failure will break your MetroCluster. It’s designed to handle component failures but like any equipment it has a limit. MetroCluster by its very nature is a disaster-avoidance infrastructure. As components fail or are taken offline the smarts in the controllers handle this with grace. It’s so good at handling component failures that it actually needs to have the power pulled in one site in order to run a DR test. I’ve always wanted to pull the power on a storage unit but was always worried about bringing it back online afterwards. I was still worried during this test but it all proved to be okay in the end.

The actual failover time, excluding the initial health checks etc, from when the ‘cf forcetakeover -d‘ command was run took just 34 minutes. During this process the IP address for the shutdwon controller came online and all the volumes in the remote site Plexes were converted into read/write volumes. The failback time took a total of 57 minutes. This was made up of 45 minutes to resync the aggregrate mirrors and just 12 minutes to complete the ‘cf giveback‘ process. There was no data loss during this period and RPO was so small it was immeasurable. Consider I was working against RPO of 24 hours and RTO of 2 days, which may seem generous but not when the whole infrastructure is taken into account, then the outcomes from the test really showed the management team that the money spend on highly-available stretched infrastructure was worth it.

The last thing to remind you of before wrapping up is to make sure that the SP is disconnected from the site you are going to shutdown otherwise you will see some unexpected issue. In the event of a real DR the hardware assist would kick in and the failover process wouldn’t be as manually intensive. Also, if we’re honest you’ll be on the phone to NetApp support to get assistance on what steps to perform next anyway. I don’t know how useful this procedure will be to anyone else but when I was looking for information on how to perform a MetroCluster failover there was a serious scarcity. Hopefully somebody else that will be performing a similar test will find it useful and that it will help them to avoid mistakes I made in my first DR test. NetApp MetroCluster is a really cool piece of technology and if you can get your hands on it then I’d recommend it, if for no other reason than to know that your data is safe.

4 thoughts on “NetApp 7-Mode MetroCluster Disaster Recovery – Part 3”

PKM says:

March 11, 2017 at 3:40 pm

Appreciate your effort and the detailed information. It helps me understating the MC-DR. Thank you

- derek.hennessy says:
  
  March 11, 2017 at 7:30 pm
  
  You’re welcome. Glad it was of use to someone 🙂
  
Xam says:

May 29, 2015 at 6:23 pm

Hi,

Your DR test is well written but you’re missing a step ! When you power on the disaster site, the resync is only for the aggregates of the survival site. In order to resync the aggregate of the disaster site, you must mirror them again with the command “fasSiteB/FasSiteA>aggr mirror aggr0 -v aggr0(1)” (for example in case of use of aggr0 as aggregate name) and only after the resync is complete for both aggregates of both site you can power on the node and initiate the giveback.

- Virtual Notions says:
  
  June 1, 2015 at 10:02 am
  
  Hi Xam,
  Thanks for pointing that out. I just checked back on my documentation and I did have the mirror commands there. I’ve updated the steps in the post to show that. Regarding powering on the storage controller. I agree that it should be after the mirrors have re-synchronized but during my DR test I had NetApp support on line at the time and they advised to power on the controller at the point I’ve specified in the steps. The documentation from NetApp does mention to do it after the re-sync has completed. I’ve added a note to the article about this. Thanks for reading the article and providing feedback.

Share this:

4 thoughts on “NetApp 7-Mode MetroCluster Disaster Recovery – Part 3”

Leave a Reply Cancel reply