NetApp 7-Mode MetroCluster Disaster Recovery – Part 1

Recently I had the honour of performing a NetApp 7-Mode MetroCluster DR Test. After my previous outing which can be read in its full gory details on another blog post I was suitably apprehensive about performing the test once again. Following the last test I worked with NetApp Support to find a root cause of the DR failure. The final synopsis is that it was due to the Service Processor being online while the DR site was down which caused hardware support to kick in automatically. This meant that a takeover was already running when the ‘cf forcetakeover -d’ command was issues. If the Service Processor is online for even a fraction of a second longer than the controller is it will initiate a takeover. Local NetApp engineers confirmed this was the case thanks to another customer suffering a similar issue and they performed multiple tests both with the Service Processor connected and disconnect. Only those tests that had the Service Processor disconnected were successful. However it wasn’t just the Service Processor. The DR procedure that I followed was not suitable for the test. WARNING: DO NOT USE TR-3788 FROM NETAPP AS THE GUIDELINE FOR FULL SITE DR TESTING. You’ll be in a world of pain if you do.

I had intended on this being just one blog post but it escalated quickly and had to be broken out. The first post is around the overview of steps followed and the health check steps carried out in advance. Part 2 covers the physical kit shutdown and the failover process. Part 3 goes into detail around the giveback process and some things that were noted during the DR test. To access the other parts of the post quickly you can use the links below.

NetApp Metrocluster DR test – tears, fears and joys

January 9, 2015 by derek.hennessy Posted in MetroCluster, Netapp Tagged 7-Mode, 7-Mode MetroCluster, DR MetroCluster Failover, Failed DR, MetroCluster, NetApp 2 Comments

Edit: I have since completed a successful Metrocluster Failover Test and it is documented in full in the following blog posts:

Just before the Xmas break I had to perform a Metrocluster DR test. I don’t know why all DR tests need to be done just before a holiday period, it always seems to happen that way. Actually I do know but it doesn’t make the run up to the holidays any more comfortable for an IT engineer. Before I began the DR test I had a fairly OK knowledge of how MetroCluster worked but afterwards it’s definitely vastly improved. If you want to learn how MetroCluster works and how it can be fixed I’d recommend breaking your environment and work with NetApp support to fix it again. Make sure to put aside quite a bit of time so that you can get everything working again and your learning experience will be complete. (You may have problems convincing your boss to let you break your environment though). I haven’t worked with MetroCluster before so while I understood how it worked and what it could do I really didn’t understand the ins-and-outs and how it is different to a normal 7-Mode HA-cluster. The short version is that it’s not all that different but it is far more complex when it comes to takeovers and givebacks and just a bit more sensitive also. During my DR test I went from a test to an actual DR and infrastructure fix. Despite the problems we faced the data management and availability was rock solid and we suffered absolutely no data loss.

I won’t go deeply into what MetroCluster is and how it works here, I may cover that in a separate blog post, but the key thing to be aware of is that the aggregates are classed as Plexes and use SyncMirror to ensure that all writes in a primary Plex gets synchronously written to the secondary Plex so that all data exists in to places. SyncMirror differs from SnapMirror by synchronizing the aggregrates whereas SnapMirror occurs at the volume level. MetroCluster itself is classed as a disaster avoidance system and satisfied this by having multiple copies of synchronised data on different sites. The MetroCluster in our environment is part of a larger Flexpod environment which includes fully redundant Cisco Nexus switches, Cisco UCS chassis and blades and a back-end dark fibre network between sites. A 10,000 foot view of the environment looks something like the below diagram and I think you can agree that there are a lot of moving parts here.

So what exactly happened during the DR test and why did I lose a storage controller? Well I can out my hand up for my part in the failure and the other part comes down to NetApp documentation which is not clear as to what operations needs to be performed and in what sequence. As I’ll discuss later there’s quite a need for them to update their documentation on MetroCluster testing. The main purpose of this post is not to criticise MetroCluster but for me to highlight the mistakes that I made and how they contributed to the failure that occurred and what I would do differently in the future. It’s more of a warning for others to not make the same mistakes I did. Truth be told if the failure occurred on something that wasn’t a MetroCluster I may have been in real trouble and worst of all lost some production data. Thankfully MetroCluster jumped to my rescue on this front. Read More

Podcast review – NetApp Communities Podcast

December 9, 2014 by derek.hennessy Posted in Podcasts Tagged NetApp, NetApp Communities Podcast, Podcast Review, Podcasts Leave a comment

Over the past couple of months I’ve been getting more and more into Tech podcasts. I had listened to one or two from time to time but never really kept up with any of them. Since I’ve been taking public transport more and more (as well as getting tired of listening to the same music on Spotify) I’ve been really enjoying passing the time listening to some tuned in tech-heads going into whats new on the market, deep-dive technical discussion and general chit-chat about the state of IT and what lies ahead. Some of the podcasts have been insightful and extremely educational. I’ve gone through a few different podcasts and have settled on the below list as my staple diet. I’m going to review each of the podcasts and highlight what I like about them.

NetApp Communities Podcast
Packet Pushers Podcast
In Tech We Trust Podcast
Cisco Champion Radio
ProfessionalVMware – vBrownBag

So, NetApp Communities Podcast. This runs on an almost weekly basis and is primarily hosted by Nick Howell (@datacenterdude) with extra insight coming from Glenn Sizemore (@glnsize) and Pete Flecha (@pedroarrow). I haven’t listened to every single podcast but I believe the format has changed in recent times to be more around a core group of guys and then SMEs for various product teams or vendors join in to discuss some of the offerings of NetApp and it’s partner vendors and also have a look at the general IT landscape. For example, recently after Cloud OnTap was announced at NTAP Insight the chief architect Kevin Hill joined the guys for a discussion of its features, capabilities and how it extended the NetApp portfolio. It gave greater detail as to what Cloud OnTap could achieve which was not necessarily obvious in the released documentation. Another great part of the Communities Podcast is the Recaps that take place at the large vendor conferences such as Insight, VMWorld and TechEd. It’s not always possible to stay on top of all the events out there and the amount of information and news that gets produced by these events so it’s great to have a digestible, bite-sized nugget that can be consumed quickly and means you can keep up to date with the latest announcement.

There’s a great dynamic between the presenters and one of the things I really like is the complete lack of ego this guys have. I’ve listened to a number of podcasts where the speakers have some sort of God complex and there’s a bit of a pissing contest not only between the presenters but also between the various vendors. There’s no attacking vendors in this podcast and that another thing I really like. There’s a definite focus on NetApp products and services and the guys do a great job providing in-depth details of these as well as highlighting the positioning of the products. But there is no attacking other vendors. It’s about highlighting what’s good about NetApp and its integration into other products/vendors, and that’s it. Don’t get me wrong, there’s quite a bit of ‘NetApp are great’ moments but I getting the feeling its more out the pride these guys have in their jobs and in the company they work for. There’s an obvious love affair here between NetApp and its employees. These guys sound happy and sound like they love their jobs and it really shows.

The production quality of the podcast is top-class. Even at the Recap events which are sometimes recorded live on the conference floor the quality of the sound recording is good. There is the odd time where isn’t not fantastic but they are few and far between. On the whole it’s a great recording and rarely do you have to suffer through listening to someone talking inside a tin-can.

This is a great podcast for anyone interested in NetApp and a good one for anyone that’s not. Give it a try on whatever podcast app you use or you can access them over on datacenterdude podcasts

Share this:

Edit: I have since completed a successful Metrocluster Failover Test and it is documented in full in the following blog posts:

Share this:

Share this: