NetApp 7-Mode MetroCluster Disaster Recovery – Part 3

This is the last of the 3-part post about MetroCluster failover procedure. This section covers the giveback process and also a note about a final review. The other sections of this blog post can be found here:

Test Case 4 – Virtual Infrastructure Health Check

This test case covers a virtual infrastructure system check, not only to get an insight into the current status of the system but to also compare against the outcomes from test case 1.

4.1 – Log into vCenter using the desktop client or web client. Expand the virtual data center and verify all SiteB ESXi hosts are online

4.2 – Log onto NetApp onCommand System Manager. Select primary storage controller and open application

4.3 – Expand SiteA/SiteB and expand primary storage controller, select Storage and Volumes. Volumes should all appear with online status

4.4 – Log into Solarwinds – Check the events from the last 2 hours and take note of all devices from Node List which are currently red

NetApp 7-Mode MetroCluster Disaster Recovery – Part 2

April 21, 2015 by derek.hennessy Posted in MetroCluster, Netapp Tagged 7-Mode, DR Failover Test, MetroCluster, MetroCluster DR, NetApp, NetApp MetroCluster Leave a comment

This is the part 2 of the 3-part post about MetroCluster failover procedure. This section covers the giveback process and also a note about a final review. The other sections of this blog post can be found here:

The planning and environment checks have take place and now it’s time to execution day. I’ll go through the process here of how the test cases were followed during testing itself. Please note that Site A (SiteA) is the site where the shutdown is taking place. Site B (SiteB) is the failover site for the purpose of this test.

Test Case 1 – Virtual Infrastructure Health Check

This is a health check of all the major components before beginning the execution of the physical shutdown

1.1 – Log into Cisco UCS Manager on both sites using an admin account.

1.2 – Select the Servers tab, and expand Servers -> Service Profiles -> root -> Sub-Organizations -> <SiteName>. List of blades installed in relevant environment will appear here

1.3 – Verify the Overall status. All blades appear with Ok status. Carry on with the next

1.4 – Log into vCenter using desktop client or web client. Select the vCenter server name at top of tree, select Alarms in right-hand pane and select triggered alarms. No alarms should appear

1.5 – Verify all ESX hosts are online and not in maintenance mode

1.7 – Log onto NetApp onCommand System Manager. Select SiteA controller and open application

1.8 – Expand SiteA/SiteB and expand both controllers, select Storage and Volumes. Verify that all volumes are online

1.9 – Launch Fabric MetroCluster Data Collector (FMC_DC) and verify that the configured node is ok. The pre-configured FMC_DC object returns green – this means that all links are health and takeover can be initiated

NetApp 7-Mode MetroCluster Disaster Recovery – Part 1

April 21, 2015 by derek.hennessy Posted in MetroCluster, Netapp Tagged 7-Mode, 7-Mode MetroCluster, DR, DR MetroCluster Failover, NetApp, NetApp MetroCluster 2 Comments

Recently I had the honour of performing a NetApp 7-Mode MetroCluster DR Test. After my previous outing which can be read in its full gory details on another blog post I was suitably apprehensive about performing the test once again. Following the last test I worked with NetApp Support to find a root cause of the DR failure. The final synopsis is that it was due to the Service Processor being online while the DR site was down which caused hardware support to kick in automatically. This meant that a takeover was already running when the ‘cf forcetakeover -d’ command was issues. If the Service Processor is online for even a fraction of a second longer than the controller is it will initiate a takeover. Local NetApp engineers confirmed this was the case thanks to another customer suffering a similar issue and they performed multiple tests both with the Service Processor connected and disconnect. Only those tests that had the Service Processor disconnected were successful. However it wasn’t just the Service Processor. The DR procedure that I followed was not suitable for the test. WARNING: DO NOT USE TR-3788 FROM NETAPP AS THE GUIDELINE FOR FULL SITE DR TESTING. You’ll be in a world of pain if you do.

I had intended on this being just one blog post but it escalated quickly and had to be broken out. The first post is around the overview of steps followed and the health check steps carried out in advance. Part 2 covers the physical kit shutdown and the failover process. Part 3 goes into detail around the giveback process and some things that were noted during the DR test. To access the other parts of the post quickly you can use the links below.

Share this:

Share this:

Share this: