Recently
I’ve been working on a fresh new vSphere 6 environment and we wanted to test a
new functionality, vMotion of clustered guest VMs with RDM.
We created and configured Windows Server Failover Cluster solution (WSFC) in accordanced with VMware's Best Practices and then vMotion it to another hosts, as expected successfully !!!
Then we decided
to change the Multi Path Policy of the RDM disk to Round Robin, already supported since vSphere 5.5, when we vMotion the clustered guest VM ,in this specific
scenario, we faced a disk corruption on the WSFC
solution.
Checking
the guest OS you would see a few events:
Source:
NTFS
Event ID:
57
Message:
The system failed to flush data to the transaction log. Corruption may occur.
At this
point the RDM (Raw Device Mapping) disk was not online anymore.
Source:
FailoverClustering
Event ID:
1066
Message: Cluster disk resource indicates corruption for volume
Trying to
bring the disk online again is useless. The disk is fully corrupted by now.
Source:
NTFS
Event ID:
55
Message:
The file system structure on the disk is corrupted and unusable. Please run
chkdsk utility on the volume.
We get not
luck to repair the disk even running check-disk and we had to restore the data
back from a backup.
Well, after
some extensive troubleshooting, it has been confirmed by VMware, it was a bug when the use of these two things in combination.
-
vMotion
of the guest VM with RDM.
-
RDM
disk configured with Round Robin as Path Selection Policy.
Luckily
VMware, has already released the hotfix that fixes this bug on March, 04th.
If you are in an environment like that, it's extremely recommended that you install this hotfix as soon as possible.
If for some reason you cannot install the hotfix,
there’s a workaround that will prevent the disk corruption. Just change your
Policy Multi Path to FIXED.
That’s it, the bug just occurs when the PMP is set to
Round Robin.