Storage Foundation with VSphere Windows OS Virtual...

paulfisher27 · ‎12-30-2009

Hi,

We are currently testing Storage Foundation in conjunction with Microsoft Clustering to mirror and cluster virtual machines across sites. We are using a VMDK (virtual disk) for the OS drive and an RDM for the data drive. We are testing the failover by unpresenting the LUN that the C: drive of the active node uses so that it fails over to the passive node. The problem we are having is that the virtual machine starts queuing commands and I/O on the HBA or in memory so the node does not fail over to the passive node. Then when the C: drive is represented to the active node the commands and I/O are then processed as per normal!

This is not a good thing though as we want it to fail quickly not fail once it runs out of memory or HBA queue depth!

This is on two Vsphere virtual machine when we test this on ESX 3.5 it fails instantly so it must be something to do with new multipathing or something like that?

Any help would be much appreciated!

Thanks,

Paul

jlockley · ‎01-09-2010

Hi Paul,

You are wanting to failover virtual machines from one ESX / vsphere host servers, to another host when the guest runs out of resources (eg memory or HBA queue depth)? You are running SFW to give you multipathing on RDM devices within the guest.

The way that you failover the machines is to use MS Cluster to detect a failed OS drive (by removing access to the VMDK file) and so the cluster should bring up the guest on another ESX / vsphere host.

Assuming the design is correct that I've stated, isn't the problem that the virtual machine won't come up on a new host - or by the sound of it, that it won't fail on the old host? If the C: drive is not accessible to the guest OS, but you are saying I/O "starts queuing" with the vsphere host, how does this happen? The guest should have failed instantly as it does on ESX.

I'm wondering why you think this may be a multipathing issue? I haven't seen this in practice and assuming you are running fibre HBA I am thinking that the multipathing should be done at the ESX level. We do see some people have issues with the data disks being lost with guest failover but they are using a cluster disk group that is depending on SCSI reservation and so far hardware is not capable of transferring the reservation from host to host without the guest knowing about it (ie losing it and failing cluster disk resources). In your case I think you would be after data to be queued to the disk or failed to the application so that when the guest comes up again, the application recovers and continues.

Sounds an interesting project, if you have any more comments on what you're doing I'd be interested in following. Our dev team and Product Management teams might have things going on too as people change thinking on what is in a virtual machines over time and are requesting more enhancements.

James.

VOX

Storage Foundation with VSphere Windows OS Virtual Machines