post

VMware ESXi 5.5 – Unable to Consolidate virtual machine disk files

I’ve been working on an issue over the past couple of days where a backup has constantly been failing. the problem was isolated down to the fact that the VM has a warning that it required disks to be consolidated. Nothing major, or so I thought. I had a look at the datastore where the VM resides and it has 185 snapshot vmdk disks. Well that can’t be right! So I did a bit of investigation and found a number of VMware KB articles around the problem. The basic option is to follow KB 2003638 and just run a basic consolidation by going to Snapshot -> Consolidate.

consolidate snapshot

You’ll then be prompted to select Yes/No as you’ll have to consolidate the Redo logs. Select Yes.

consolidate snapshot continue

At this point it looked as it the consolidation was going to work but at about 20% it failed. The next error shows that the file is locked.

consolidate snapshot fail disk locked

There are a number of recommendations around what can be done to remove the lock on the file. One is to run a vMotion/svMotion in VMware to another host. Unfortunately due to these both being standalone ESXi hosts with no vMotion network or capabilities that couldn’t be done. Some people recommend reboot the ESXi host to release the lock but per my issue above, there was no vMotion network and these hosts run production manufacturing systems and cannot just be randomly rebooted. Waiting on a downtime approval would take too long. The next step was to restart the management agents on the ESXi host. This was done by connecting to the ESXi host via SSH and running the following commands: Read More

post

SRM 5.1 Failover Test

Over the weekend I had to run a failover test for an application within SRM. As SRM can only replicate down to the datastore level and not the VM level this meant doing a full test failover of all VMs but ensuring beforehand that all protected VMs in the Protection Group were set to Isolated Network on the recovery site. This ensure that even though all VMs would be started in the recovery site they would not be accessible on the network and therefore not cause any conflicts. The main concern, outside of a VM not connecting to the isolated network, was that the VM being tested and the application that sits on it are running on Windows 2000. Yes, that’s not a typo the server is running Windows 2000. The application is from back around that period as well so if it drops and can’t be recovered then it’s a massive headache.

Failover Test:

 Step 1: Power down the production VM

SRM steps shutdown server

Step 2: Perform Test Recovery

Go to Recovery Plans -> Protection Groups and select Test

SRM Protection Group Test

When the prompt comes to begin the test verify the direction of the recovery, from the protected site to the recovery site. Enable the Replicate recent changes to recovery site. In most cases you will be already running synchronous writes between the sites and the data will just about be up to date anyway. It is recommended however to perform a recent change replication anyway to make sure that all data is up to date.

SRM Test Recover Plan

 

Click Next and then click Start to confirm the test recovery

SRM Test Recovery Plan Complete Read More