Recently I had the honour of performing a NetApp 7-Mode MetroCluster DR Test. After my previous outing which can be read in its full gory details on another blog post I was suitably apprehensive about performing the test once again. Following the last test I worked with NetApp Support to find a root cause of the DR failure. The final synopsis is that it was due to the Service Processor being online while the DR site was down which caused hardware support to kick in automatically. This meant that a takeover was already running when the ‘cf forcetakeover -d’ command was issues. If the Service Processor is online for even a fraction of a second longer than the controller is it will initiate a takeover. Local NetApp engineers confirmed this was the case thanks to another customer suffering a similar issue and they performed multiple tests both with the Service Processor connected and disconnect. Only those tests that had the Service Processor disconnected were successful. However it wasn’t just the Service Processor. The DR procedure that I followed was not suitable for the test. WARNING: DO NOT USE TR-3788 FROM NETAPP AS THE GUIDELINE FOR FULL SITE DR TESTING. You’ll be in a world of pain if you do.
I had intended on this being just one blog post but it escalated quickly and had to be broken out. The first post is around the overview of steps followed and the health check steps carried out in advance. Part 2 covers the physical kit shutdown and the failover process. Part 3 goes into detail around the giveback process and some things that were noted during the DR test. To access the other parts of the post quickly you can use the links below.