post

NetApp 7-Mode MetroCluster Disaster Recovery – Part 1

Recently I had the honour of performing a NetApp 7-Mode MetroCluster DR Test. After my previous outing which can be read in its full gory details on another blog post I was suitably apprehensive about performing the test once again. Following the last test I worked with NetApp Support to find a root cause of the DR failure. The final synopsis is that it was due to the Service Processor being online while the DR site was down which caused hardware support to kick in automatically. This meant that a takeover was already running when the ‘cf forcetakeover -d’ command was issues. If the Service Processor is online for even a fraction of a second longer than the controller is it will initiate a takeover. Local NetApp engineers confirmed this was the case thanks to another customer suffering a similar issue and they performed multiple tests both with the Service Processor connected and disconnect. Only those tests that had the Service Processor disconnected were successful. However it wasn’t just the Service Processor. The DR procedure that I followed was not suitable for the test. WARNING: DO NOT USE TR-3788 FROM NETAPP AS THE GUIDELINE FOR FULL SITE DR TESTING. You’ll be in a world of pain if you do.

I had intended on this being just one blog post but it escalated quickly and had to be broken out. The first post is around the overview of steps followed and the health check steps carried out in advance. Part 2 covers the physical kit shutdown and the failover process. Part 3 goes into detail around the giveback process and some things that were noted during the DR test. To access the other parts of the post quickly you can use the links below.

  1. NetApp 7-Mode MetroCluster Disaster Recovery – Part 2
  2. NetApp 7-Mode MetroCluster Disaster Recovery – Part 3

Read More

post

SRM 5.1 Failover Test

Over the weekend I had to run a failover test for an application within SRM. As SRM can only replicate down to the datastore level and not the VM level this meant doing a full test failover of all VMs but ensuring beforehand that all protected VMs in the Protection Group were set to Isolated Network on the recovery site. This ensure that even though all VMs would be started in the recovery site they would not be accessible on the network and therefore not cause any conflicts. The main concern, outside of a VM not connecting to the isolated network, was that the VM being tested and the application that sits on it are running on Windows 2000. Yes, that’s not a typo the server is running Windows 2000. The application is from back around that period as well so if it drops and can’t be recovered then it’s a massive headache.

Failover Test:

 Step 1: Power down the production VM

SRM steps shutdown server

Step 2: Perform Test Recovery

Go to Recovery Plans -> Protection Groups and select Test

SRM Protection Group Test

When the prompt comes to begin the test verify the direction of the recovery, from the protected site to the recovery site. Enable the Replicate recent changes to recovery site. In most cases you will be already running synchronous writes between the sites and the data will just about be up to date anyway. It is recommended however to perform a recent change replication anyway to make sure that all data is up to date.

SRM Test Recover Plan

 

Click Next and then click Start to confirm the test recovery

SRM Test Recovery Plan Complete Read More