I do realise that comparing VMTurbo to Tesla is a bit much but it’s really not all that far away from the truth. When Tesla began designing their electric cars it was at a time when electric cars were unfashionable and some previous manufacturers had produced some real pieces of crap so most people were just thinking why bother. Why waste time on something that’s not going to sell? There’s already enough electric cars on the market by more traditional and trusted manufacturers. And this is where the link to VMTurbo comes in. VMTurbo has entered an already saturated market place with another monitoring tool. As with Tesla however, they have come to market with a product that does things very differently and rocked the status quo. VMTurbo is not just a monitoring tool but an analysis appliance that provides realtime recommendations and updates to reduce the number of alerts within virtual infrastructures and works towards keeping a desired state throughout the environment by supplying applications with the resources they require and ensure efficiency of available resource consumption. Two companies in different technological spaces are upsetting the market by thinking outside the box and finding novel solutions. If only VMTurbo had an Insane Mode like Tesla…
Over the past 10 years what has really changed with monitoring solutions? Well, not much. We now get more data and information to process but no real assistance regarding resolutions to fix issues. The most extensive monitoring solution I’ve used before is Microsoft SCOM, it’s just immense, and it covers all the way from application down to the virtual layer (with some assistance from 3rd party products for VMware like Veeam MP for VMware) but it’s just too big and time consuming and the recommendations provided in the alerts are really just pointing back to KB articles. And this has primarily been my concern with these tools. They are designed to provide as much information as possible to the point where the administrator/operator is getting overloaded and doesn’t know where to begin to fix the problem. Ideally these tools should analyse that data and provide actionable and automated recommendations so that it can intelligently keep your environment running efficiently. This would free up time spent by admins going through reams of data and allow them to work on adding value to the business rather than being stuck down some IT rabbit hole. VMware vCOPs does provide analysis and reports on anomalies and issues that cause peaks outside of thresholds and baselines but it doesn’t identify clearly what has caused this and even if it is something to investigate. I’ve spent countless hours tracing back alerts from vCOPs to find that everything is ok within the environment and it was just a different workload temporarily running on the VM that caused the anomalies to trigger. And as for capacity planning, well that’s just another massive pain in the ass. vCOPs does however provide fairly decent capacity planning in comparison to most other tools but it’s still clumsy and limited regarding customisation. It’s a bit like trying to reason with a 2 year old, you think you’re making progress but you eventually realise that it’s not going to do exactly as you wanted and will still throw your iPhone down the toilet anyway. (pesky kids :-)) vCOPS lets you add hosts/VMs etc but it’s just feels clunky and makes it difficult factor in all aspects of your infrastructure.
Over the weekend I had to run a failover test for an application within SRM. As SRM can only replicate down to the datastore level and not the VM level this meant doing a full test failover of all VMs but ensuring beforehand that all protected VMs in the Protection Group were set to Isolated Network on the recovery site. This ensure that even though all VMs would be started in the recovery site they would not be accessible on the network and therefore not cause any conflicts. The main concern, outside of a VM not connecting to the isolated network, was that the VM being tested and the application that sits on it are running on Windows 2000. Yes, that’s not a typo the server is running Windows 2000. The application is from back around that period as well so if it drops and can’t be recovered then it’s a massive headache.
Step 1: Power down the production VM
Step 2: Perform Test Recovery
Go to Recovery Plans -> Protection Groups and select Test
When the prompt comes to begin the test verify the direction of the recovery, from the protected site to the recovery site. Enable the Replicate recent changes to recovery site. In most cases you will be already running synchronous writes between the sites and the data will just about be up to date anyway. It is recommended however to perform a recent change replication anyway to make sure that all data is up to date.
Click Next and then click Start to confirm the test recovery
Just before the Xmas break I had to perform a Metrocluster DR test. I don’t know why all DR tests need to be done just before a holiday period, it always seems to happen that way. Actually I do know but it doesn’t make the run up to the holidays any more comfortable for an IT engineer. Before I began the DR test I had a fairly OK knowledge of how MetroCluster worked but afterwards it’s definitely vastly improved. If you want to learn how MetroCluster works and how it can be fixed I’d recommend breaking your environment and work with NetApp support to fix it again. Make sure to put aside quite a bit of time so that you can get everything working again and your learning experience will be complete. (You may have problems convincing your boss to let you break your environment though). I haven’t worked with MetroCluster before so while I understood how it worked and what it could do I really didn’t understand the ins-and-outs and how it is different to a normal 7-Mode HA-cluster. The short version is that it’s not all that different but it is far more complex when it comes to takeovers and givebacks and just a bit more sensitive also. During my DR test I went from a test to an actual DR and infrastructure fix. Despite the problems we faced the data management and availability was rock solid and we suffered absolutely no data loss.
I won’t go deeply into what MetroCluster is and how it works here, I may cover that in a separate blog post, but the key thing to be aware of is that the aggregates are classed as Plexes and use SyncMirror to ensure that all writes in a primary Plex gets synchronously written to the secondary Plex so that all data exists in to places. SyncMirror differs from SnapMirror by synchronizing the aggregrates whereas SnapMirror occurs at the volume level. MetroCluster itself is classed as a disaster avoidance system and satisfied this by having multiple copies of synchronised data on different sites. The MetroCluster in our environment is part of a larger Flexpod environment which includes fully redundant Cisco Nexus switches, Cisco UCS chassis and blades and a back-end dark fibre network between sites. A 10,000 foot view of the environment looks something like the below diagram and I think you can agree that there are a lot of moving parts here.
So what exactly happened during the DR test and why did I lose a storage controller? Well I can out my hand up for my part in the failure and the other part comes down to NetApp documentation which is not clear as to what operations needs to be performed and in what sequence. As I’ll discuss later there’s quite a need for them to update their documentation on MetroCluster testing. The main purpose of this post is not to criticise MetroCluster but for me to highlight the mistakes that I made and how they contributed to the failure that occurred and what I would do differently in the future. It’s more of a warning for others to not make the same mistakes I did. Truth be told if the failure occurred on something that wasn’t a MetroCluster I may have been in real trouble and worst of all lost some production data. Thankfully MetroCluster jumped to my rescue on this front. Read More
I don’t know how this happens but for some reason I end up spending quite a bit of my time trying to get Trend solutions to work. And most of the time it’s in a scenario that hasn’t been covered in the knowledge base articles. I’ve recently been working on a project to create a virtualized test and development environment on Flexpod which involved placing a copy of production behind a firewall. This involves similar IPs and server names but also a problem is that the OfficeScan server requires two vNICs which isn’t really a solution that Trend advise. This problem delayed the project by almost two weeks as I tried numerous fixes and then waited on assistance from support that wasn’t up to scratch to finally getting a Trend employee that really knew his stuff to provide assistance. The configuration I required wasn’t something he had seen in the past but it was definitely something he’d like to see working so we spent a few days trying out different methods to get things working and the steps below is how it was finally fixed.
To install OfficeScan Server on a VM in a DMZ with two vNICs, one with external access to the corporate network and one with internal access via static routes to servers within the DMZ or cell. Only two VMs have ‘egress’ connections to the corporate network and run through an ASA firewall. All other VMs are hidden within the cell in their own domain, do not have internet access and exist across multiple vLANS and also each server VM has multiple vNICs.
Due to the VMs in the cell being a test and development environment for production-based scada systems it has to sit behind a firewall as the scada teams requested that the test and dev environment have the same IP addresses and machine names as their prod environment. Yes, I know that’s crazy and I have brought it up numerous times that this should not be done but I’ve been over-ruled so I’ve just had to deal with it.
This is not a recommended configuration from Trend for OfficeScan, they recommend another product for this type of DMZ based protection. We required the OfficeScan server to be able to communicate on two different vNICS, vLANs and IP address on the same ports. This is also not something that Trend has documentation on. In our environment we run a centralised Control Manager that manages the licenses for different OfficeScan servers deployed within the environment. In most instances the different OfficeScan servers sit in different domains. The Control Manager server can only see the newly deployed OfficeScan server on its egress connection, in this case we can say 18.104.22.168. However, all VMs within the cell can only see the OfficeScan server on the internal facing address 22.214.171.124. Trend does have an IPTemplates fix on their knowledge base to allow for multiple vNICs on a client but this doesn’t working in the server side. The issue we faced is that when agents were installed on clients they were being denied on the firewall from reaching the egress connection of 126.96.36.199, and rightly so. Traffic allowed in to the vNIC from the Control Manager server is on port 8080 and the internal client VMs need to communicate with the OfficeScan server on ports <whatever5digitportnum> i.e 19099, and 8080, so allowing internal VMs to hit that egress connection would create a major security risk.
The required fix involved modifying the OfficeScan files so that it could communicate with the Control Manager server for updates and licensing but also allow the client VMs in the cell to communicate on the internal facing IP address/FQDN.
I originally came to the In Tech We Trust podcast as it was mentioned by Nick Howell over on the NetApp Communities Podcast so I thought I’d check it out. In Tech We Trust is a relatively new podcast that is hosted and run by some podcasting stalwarts from the IT industry in Nigel Poulton, Hans De Leenheer, Mark Farley, Gabriel Chapman and Rick Vanover. These guys have some serious IT backgrounds and knowledge to bring to the table and if you want to know what the trends are in IT, what’s the top topics to look at etc. then this is the podcast to listen to. In some recent podcasts they’ve discussed Docker, OpenStack, MesoSphere and VSAN vs VSA. That’s just the technology aspect. For me the most interesting area that the guys cover is the positioning of these products and where vendors are positioning themselves in the market. All of the presenters have an understanding of the business side of IT and provide really good insight into topics like ‘what the Cisco divorce from EMC means’ or ‘why IBM are selling off their server platform to Lenovo, can Lenovo make a play into storage?’. In order to understand the industry fully and also understand which trends are likely to become mainstream there’s a need for general IT workers to be across the business aspects of IT and not just the technical side.
There’s are a great rapport between all the presenters. Nigel Poulton is the primary host and keeps the discussions moving throughout the podcast. However, the responsibility of the main presenter moves around between the presenters from time to time and also if someone drops off or can’t make the podcast then it can still go ahead as there’s another host available. The core presenters remains the same but depending on their availability some presenters can’t make it all the time (I’m looking at you Rick Vanover!!). I think the discussions when all presenters are online are more in-depth and varied that when it’s just one or two presenters. This doesn’t diminish the podcasts where there’s just a few online though. There are times when the discussion gets heated but there are no wallflowers here, everyone gets stuck in and gives their opinion and there’s no ‘yes-men’ which I think add the most value to the podcast. It’s a discussion not a mouth-piece for how great/crap a specific product is.
The production quality of the podcast is good. There are times when people drop off and sound like their talking from the back of a cave but considering there can be up to 6 presenters on the same podcast at once then this would have to be expected. Despite this criticism I have to admit that this rarely happens. The vast majority of the time there are no issues and given that all the presenters are in different time-zones it’s amazing that they are able to produce a podcast at all. There’s some serious commitment from everyone to be able to churn out one podcast a week.
If you’re going to listen to just one podcasts to get a feel for the interactions between the guys then I’d recommend Episode 15 – Our First Annual Christmas Quiz. I was listening to this in bed last night and couldn’t stop laughing. My wife thought I was having some sort of spasm attack. I had to reassure her that a joke about 69 being your favourite number is funny no matter what age you are. Some of the other podcasts to check out are Episode 16 – 5 Trends in 2014, Episode 11 – Archiving Fibre Channel Connections to a Titsup Cloud, Episode 9 – Startups vs Big Companies, Episode 6 – Cisco Divorces EMC, Episode 5 – VSANs vs VSAs and Episode 1 – Cisco Wars and EMC Shipping Alpha Code.
If you have some free time definitely catch up on an In Tech We Trust podcast, I really don’t think you’ll regret it. You can find the podcasts either over at the iTunes store or on PodBean.
I had a bit of an unexpected ego-boost this morning on my first day back in the office after the Xmas break. I received an email from IT Central Station about the list of top reviewers for 2014 and I’m number 10 on the list. This was completely unexpected. Normally I don’t write product reviews on third party sites, actually this was the first time, so it’s good to hear that the review was useful for other people. You can read the review here: IT Central Station – Veeam Backup Review
Like a lot of other people I’m working out my priorities for the coming year and contributing more to the community is on that list. I’m going to try to add more reviews on third party sites and frequent community boards more often as a contributor rather than just a lurker. Bring on 2015, getting top 10 on IT Central Station is definitely a good start.