The other day, one of my coworkers was attempting to upgrade a cluster from ESX 4.0 to ESXi 4.1. He started on a host that was empty (no DRS; manually load balanced cluster), and after that completed he began to evacuate the other 2 ESX 4.0 hosts to the newly installed ESXi 4.1 host.
While attempting a vMotion evacuation (having additional licenses can be great for swing ops like this), the vMotions were erroring out with the following error: “Source detected that the destination failed to resume”. A quick bit of googling brought us to VMware kb 1006052.
If you took a moment to read the kb article, you’ll see that the issue revolves around one host in a cluster having different UUID’s for the datastores than the other host(s) in the cluster. Running vdf -h via ssh on the ESX host and df -h via remote troubleshooting mode on the ESXi host (vdf isn’t a command in ESXi), showed us that one of the NFS datastores on the ESX host didn’t match the UUID of the same datastore on the ESXi host.
Now, the kb suggests 2 possible solutions: 1.) Unmount and remount the datastore with the incorrect information, or 2.) Performing a Storage vMotion. Since taking production VM’s down during business hours was out the question, we were able to Storage vMotion the effected VM’s to a datastore with a UUID that all hosts agreed to, and then to vMotion the VM’s off the host in order to prepare it for the ESXi upgrade.
On another non-production cluster, I ran into some issues with the first option that forced me to remove (but not delete) the vmdk’s from each VM, and then re-add them to the VM. This sort of situation is a complete PITA, even with a small number of VM’s. My advice is to Storage vMotion what you can.
If anyone happens to come across why this is happening, please leave a comment or shoot me a tweet.