Service Group Failover fails !!!
Hi there,
I'm having an issue with one of our cluster wherein, when tried to fail-over the service group it failed with below errors -
2013/10/30 22:22:19 VCS ERROR V-16-2-13006 (XXXXX) Resource(vdaappdg_dg): clean procedure did not complete within the expected time. 2013/10/30 22:22:30 VCS ERROR V-16-2-13006 (XXXXX) Resource(vdrapp_vol): clean procedure did not complete within the expected time. 2013/10/30 22:22:32 VCS ERROR V-16-2-13027 (XXXXX) Resource(mobius_dg) - monitor procedure did not complete within the expected time. 2013/10/30 22:23:30 VCS ERROR V-16-2-13077 (XXXXX) Agent is unable to offline resource(vdrapp_vol). Administrative intervention may be required. 2013/10/30 22:26:48 VCS ERROR V-16-2-13077 (XXXXX) Agent is unable to offline resource(vdaappdg_dg). Administrative intervention may be required. 2013/10/30 22:32:35 VCS ERROR V-16-2-13063 (XXXXX) Agent is calling clean for resource(mobius_dg) because offline did not complete within the expected time. 2013/10/30 22:33:22 VCS INFO V-16-2-13068 (XXXXX) Resource(mobius_dg) - clean completed successfully. 2013/10/30 22:37:24 VCS ERROR V-16-2-13027 (XXXXX) Resource(mobius_dg) - monitor procedure did not complete within the expected time. 2013/10/30 22:37:24 VCS ERROR V-16-2-13077 (XXXXX) Agent is unable to offline resource(mobius_dg). Administrative intervention may be required.
Background -
On this cluster we have large number of disks and worth of 54 TB data.
DG-NAME #VD (GB) TOTAL used free mobius 176 14031.25 14000.00 31.25 vdaappdg 534 42596.69 40628.76 1967.93
While VCS fail-over happens it simply hangs! and says "Agent is unable to offline resource(vdrapp_vol). Administrative intervention may be required."
As far as I understand, this message displays when an offline procedure does not complete on time. An offline procedure timeout can occur when the system is overloaded, is busy processing a system call, or is handling a large number of resources.
I wanted to know if someone has ever gone through such a situation & if someone can advice if this is happening due to large number of storage??
Waiting for experts advice and possibly solution in order to make sure fail-over functionality works seamlessly.
Thank you/Nilesh
Hello,
The volumes clearly looks to be culprit here so could be two options
1. IOs are still going on & hence volume can't be stopped. You need to ensure that all the IOs are stopped /applications are stopped to ensure a clean / graceful offline of volume.
2. DG resource is also calling for unsuccessful clean which is most likely because of volume still operational. However also would be worth to check on what is happening to vxconfigd daemon ? Is the daemon heavily loaded. A simple check for this would be to run some veritas commands & see if there is delay in returning of command ... If this is the case then all veritas commands will time out which will be most likely reason for above behaviour of VCS .. would be good check what is vxconfigd doing so much .. this can be found with
# vxtask list (any background tasks running)
# ps -ef |grep -i vx (is there any old vx command which is stuck in a loop & making vxconfigd uncomfortable)
Hope this helps
G