Wednesday, August 10, 2016

RDM Disk corruption on Microsoft Failover Cluster


Recently I’ve been working on a fresh new vSphere 6 environment and we wanted to test a new functionality, vMotion of clustered guest VMs with RDM.

We created and configured Windows Server Failover Cluster solution (WSFC) in accordanced with VMware's Best Practices and then vMotion it to another hosts, as expected successfully !!!

Then we decided to change the Multi Path Policy of the RDM disk to Round Robin, already supported since vSphere 5.5, when we vMotion the clustered guest VM ,in this specific scenario, we faced a disk corruption on the WSFC solution.

Checking the guest OS you would see a few events:

Source: NTFS
Event ID: 57
Message: The system failed to flush data to the transaction log. Corruption may occur.


 At this point the RDM (Raw Device Mapping) disk was not online anymore.

Source: FailoverClustering
Event ID: 1066
Message: Cluster disk resource indicates corruption for volume


 Trying to bring the disk online again is useless. The disk is fully corrupted by now.

Source: NTFS
Event ID: 55
Message: The file system structure on the disk is corrupted and unusable. Please run chkdsk utility on the volume.


We get not luck to repair the disk even running check-disk and we had to restore the data back from a backup.

Well, after some extensive troubleshooting, it has been confirmed by VMware, it was a bug when the use of these two things in combination.

-       vMotion of the guest VM with RDM.
-       RDM disk configured with Round Robin as Path Selection Policy.

Luckily VMware, has already released the hotfix that fixes this bug on March, 04th.

If you are in an environment like that, it's extremely recommended that you install this hotfix as soon as possible.


If for some reason you cannot install the hotfix, there’s a workaround that will prevent the disk corruption. Just change your Policy Multi Path to FIXED.
That’s it, the bug just occurs when the PMP is set to Round Robin.

See you…


4 comments:

Anonymous said...

Have you tested after apply the fix? Did it work?

Tks!
Luiz Alberto - luiz.valente@hotmail.com

Eduardo Meirelles da Rocha said...

Hi Luiz,

Yes, I applied the fix on a real environment and it fixed the issue as promissed.

Anonymous said...

can you provide guide how to change RR to fixed?

Eduardo Meirelles da Rocha said...

Hi, here's how you change the Path Selection Policy
https://kb.vmware.com/kb/1036189

Post a Comment

Who am I

My photo
I’m and IT specialist with over 15 years of experience, working from IT infraestructure to management products, troubleshooting and project management skills from medium to large environments. Nowadays I'm working for VMware as a Senior Consultant, helping customers to embrace the Cloud Era and make them succefully on this journay. Despite the fact I'm a VMware employee these postings reflect my own opnion and do not represents VMware's position, strategies or opinios.

Most Viewed Posts

Blog Archive