Solved: vxio: Cluster software communication timeout. Rese...

Balthier35 · ‎11-16-2014

Hi,

We are experiencing this error on one of our clusters. It's a two-node campus cluster with the following specifications

SiteA
Node1 is a Windows Server 2008 R2 virtual machine residing on a ESXi 5.1 host in this site
Disk1 and 3 are LUNs in an enclosure in this site

SiteB
Node2 is a Windows Server 2008 R2 virtual machine residing on a ESXi 5.1 host in this site
Disk2 and 4 are LUNs in an enclosure in this site

We have created two VMDGs, one contains Disk 1 and 2, while the other contains Disk 3 and 4. On these VMDGs, we have created mirrored dynamic volumes. The VMDGs are then presented to the failover cluster. The quorum type on the failover cluster is a file share witness, on another server. We are also running Microsoft System Center Configuration Manager to install updates and patches on Node 1 and 2.

Whenever patches are installed on a node, it gets restarted. Whenever that occurs, failover from Node 1 to Node 2 occurs for the cluster resource group. Everything seems to failover just fine, and the VMDG is imported successfully (according to the log). But 10 minutes after the VMDG has been imported, the following error is logged on Node 2

http://s28.postimg.org/ubh8skfh9/vmdg2.png

If I check the status of the VMDGs in VEA its Deported for both VMDGs.

http://s3.postimg.org/72ort9683/vmdg3.png

But even if the disks and VMDGs seem to be offline on the active node, failover does not occur, as in Failover Cluster Manager, the VMDG is online, but there are no volumes enumerated on it.

http://s12.postimg.org/p31vncct9/vmdg1.png

Has anyone else experienced the same, and knows why the status of the disks change to deported, without failover occuring?

Wally_Heim · ‎11-24-2014

Hi Balthier35,

Yes, the VMDg resource properties -> Advance Policies -> check the box for "Run this resoruce in a separate Resource Monitor". This will use a separate process for monitoring this resource.

Thank you,

Wally

View solution in original post

RiaanBadenhorst · ‎11-18-2014

Have you seen this note?

http://www.symantec.com/docs/TECH50643

Balthier35 · ‎11-20-2014

Hi, and thanks for the reply. That note was the first thing I checked, since the url is mentioned in the error. :)

....

"This error in general will not impact cluster operations. The occurrence of this error does not cause the failure of the VMDG resources and the cluster service still has record that the VMDG resources are online and will not attempt to failover the service group."

....

This is exactly what happens, as no failover is attempted by the cluster service. The cluster resource groups stay online, but if I check in VEA, the status of the VMDGs is deported.

.....

Recommendations:

1.Verify that the cluster software is running, this error is indicative of high cluster node resource utilisation
2.Review event logs for indications that other applications are experiencing resource shortage
3.Review event logs for cluster service messages indicating that there are issues monitoring resources
4.Consider seperate resouce monitors for the VMDG resources

......

1. If the cluster software was not running, wouldn't that trigger a failover?
2. No such indications.
3. No such indications
4. Is this configured in the properties sheet of the VMDG resource in Failover Cluster Manager, on the Advanced Policies Tab?

RiaanBadenhorst · ‎11-20-2014

1. No because it is responsible for failover operations. What the error was saying is that it had not been contacted by the cluster software (mscs) within the expect period of time and it was therefore "concerned" so it logged the message.

From your description though it sees like there is a disconnect between the MSCS and Volume manager.

I know VCS better than MSCS but I'm sure it would work the same from a logic point of view. The fact that MSCS is showing the VMdg to be online, when VEA shows it being offline indicates there is something wrong with the monitor process.

Maybe take a look at the vxisis.log to see what is happening when it deports, or why it deports. Or even, if it deports due to the communication break with MSCS.

4. Not sure about the MSCS properties.

Wally_Heim · ‎11-24-2014

Hi Balthier35,

Yes, the VMDg resource properties -> Advance Policies -> check the box for "Run this resoruce in a separate Resource Monitor". This will use a separate process for monitoring this resource.

Thank you,

Wally

Balthier35 · ‎01-26-2015

Running the resource in a separate monitor did fix the problem. Thanks.

But what I am wondering is whether this can be caused by there only being two nodes in the cluster, and one of the nodes is "dormant" for 95% of the month. The cluster resource groups are only online on this node 1 day in the month.

The reason why I am suspecting this is because I've noticed the same on another cluster, and also there the "dormant" node is the one with problems. The node which is active 29 days in the month, has no problems whatsoever.

Does this theory hold water?

Wally_Heim · ‎01-26-2015

Hi Balthier35,

I don't see how this is related to which node hosts the service group most of the month. If this were a physical system I would recommend by checking that all HBA/SCSI adaptor drivers and check server for general performance type issues.

But since this is a ESX 5.1 configuration, you might want to look at some of the new features in the 6.x product line. The new ESX intigration resources for storage get around doing SCSI reservations on the luns/disks. The new agents use vCenter or ESX server to attach the disks to the VM during online and detach them during offline. In this situation, SCSI reservations are not needed because only one VM has the disks attached at a time - no possiblity of the other node accidently touching the luns/disks.

We have had a lot of customers starting to use these new resources with lots of success. They new resources that you should look into are VMWareDisks and VMNSDg.

Thank you,

Wally

VOX

vxio: Cluster software communication timeout. Reservation refresh has been suspended