I’ve been working on an issue over the past couple of days where a backup has constantly been failing. the problem was isolated down to the fact that the VM has a warning that it required disks to be consolidated. Nothing major, or so I thought. I had a look at the datastore where the VM resides and it has 185 snapshot vmdk disks. Well that can’t be right! So I did a bit of investigation and found a number of VMware KB articles around the problem. The basic option is to follow KB 2003638 and just run a basic consolidation by going to Snapshot -> Consolidate.
You’ll then be prompted to select Yes/No as you’ll have to consolidate the Redo logs. Select Yes.
At this point it looked as it the consolidation was going to work but at about 20% it failed. The next error shows that the file is locked.
There are a number of recommendations around what can be done to remove the lock on the file. One is to run a vMotion/svMotion in VMware to another host. Unfortunately due to these both being standalone ESXi hosts with no vMotion network or capabilities that couldn’t be done. Some people recommend reboot the ESXi host to release the lock but per my issue above, there was no vMotion network and these hosts run production manufacturing systems and cannot just be randomly rebooted. Waiting on a downtime approval would take too long. The next step was to restart the management agents on the ESXi host. This was done by connecting to the ESXi host via SSH and running the following commands:
/etc/init.d/hostd restart /etc/init.d/vpxa restart
This caused the host to be unmanageable for a brief moment. I re-ran the consolidation task tried earlier but got the same error message. Next I started to go through the KB article KB 10051 – Investigatin virtual machine file locks on ESXi/ESX. This was a good article up to a point, but it could be clearer. I connected to the ESXi host via SSH. As I knew the VM I moved straight to locating the lock and removing it. Instead of using the /var/log/messages as mentioned in the KB article I opened vmware.log in vi. From there I ran a search for “lock“. What I found was that the 1-000830-delta.vmdk was locked.
Based on the disk number it was possible to run the command ‘vmkfstools -D <vm_name>_1-000830-delta.vmdk‘ which returned the MAC address of the device causing the Read Only (RO). I didn’t capture this at the time but it will look something similar to that shown on the VMware KB article
[root@test-esx1 testvm]# vmkfstools -D test-000008-delta.vmdk Lock [type 10c00001 offset 45842432 v 33232, hb offset 4116480 gen 2397, mode 2, owner 00000000-00000000-0000-000000000000mtime 5436998] <-------------- MAC address of lock owner RO Owner[0] HB offset 3293184 4f284470-4991d61b-4b28-001a64c335dc <------------------------------ MAC address of read-only lock owner Addr <4, 80, 160>, gen 33179, links 1, type reg, flags 0, uid 0, gid 0, mode 100600 len 738242560, nb 353 tbz 0, cow 0, zla 3, bs 2097152
With the MAC address you can check in vCenter to see which vNIC and vSwitch the MAC address was assigned to. In my case it was the ESXi management vNIC.
From here I deviated from the KB article. I ran the following command:
# lsof | grep <vm_name>_1-000830-delta.vmdk
This returned two processes.
9778541 vpxa-worker 11 51 /vmfs/volumes/5356c1f8-55d703b6-d4b5-b83861d73252/<VM_Name>/<name_of_locked_file> 1053523 vpxa-worker 2 8 /vmfs/volumes/5356c1f8-55d703b6-d4b5-b83861d73252/<VM_Name>/<name_of_locked_file>
I killed the older job as that was the one causing the lock.
Kill 1053523
Now that the lock was removed I was able to re-run the Consolidate Snapshots and it ran successfully.
Caveat:
Not all vmdks removed as part of the consolidation. Some remained and this was due to those disks being mounted via hot-add to the backup server
Cause of the issue:
So why did this happen? The backup software was failing to clear up a snapshot which then had a knock on effect for each backup after that causing a number of new disks to be created. The Hot-add feature of the backup software to the backup server VM meant that the backup server had a lock on on of the vmdks. It didn’t release due to a failed backup at some time in the past and every time a new backup was taken the disks just kept growing. Consolidating the snapshots actually caused the backup server to shutdown and could not be powered back on again until it had all hot-add disks removed. I chose not to delete from disk and will perform that cleanup as a manual task.
We’ve run into this issue several times with NetVault leaving orphaned snapshots behind when using the NetVault VMware Plug-in to back up vm’s.
The solution to fix this is a little more simple than above.
All you need to do is connect to the appropriate host that’s currently running the vm in question over SSH. Once connected, run /etc/init.d/vpxa restart. (do not type ”, obviously) This will restart the worker process that has a lock on your vm.
Go back into your vsphere client -> right-click the vm with the disk consolidation error, select Snapshot-> Take Snapshot, give it a name and select ‘quiesce file system’ if vmware tools is installed . Once the snapshot is completed, right-click on the vm again, go into Snapshot->Snapshot Manager and select ‘Delete All…’ This will consolidate all snapshots (both visible and hidden ones) back into a single disk. Let the process finish, and your disk consolidation error is gone.
You can verify your process worked by going into Edit Settings.. on the problematic vm and look at the hard disk(s) assigned. They should be back to their original name vs. -000002.vmdk
more info about this process here: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002310
Hi Tim. Thanks for the feedback. The method I provided is just one way to get the consolidation to work. I did try the process provided in your link but unfortunately it didn’t work for me at the time. That’s not to say it wouldn’t work in a future situation.
After writing this I found an easy fix to a similar issue I was having with consolidation. Basically the backup proxy server was using HotAdd for it’s backups and the snapshot was basically locked by the proxy server. After removing it from the inventory in the settings for the proxy server the consolidation worked.
I think that with so many moving parts within the VMWare environment there’s no definitive single way to complete the consolidation task as each environment is different and involves different tools and configurations.
Thanks again for the additional instructions and for posting.