Solved: Service Group Failover fails !!!

Nilesh_Joshi · ‎11-08-2013

Hi there,

I'm having an issue with one of our cluster wherein, when tried to fail-over the service group it failed with below errors -

2013/10/30 22:22:19 VCS ERROR V-16-2-13006 (XXXXX) Resource(vdaappdg_dg): clean procedure did not complete within the expected time.
2013/10/30 22:22:30 VCS ERROR V-16-2-13006 (XXXXX) Resource(vdrapp_vol): clean procedure did not complete within the expected time.
2013/10/30 22:22:32 VCS ERROR V-16-2-13027 (XXXXX) Resource(mobius_dg) - monitor procedure did not complete within the expected time.
2013/10/30 22:23:30 VCS ERROR V-16-2-13077 (XXXXX) Agent is unable to offline resource(vdrapp_vol). Administrative intervention may be required.
2013/10/30 22:26:48 VCS ERROR V-16-2-13077 (XXXXX) Agent is unable to offline resource(vdaappdg_dg). Administrative intervention may be required.
2013/10/30 22:32:35 VCS ERROR V-16-2-13063 (XXXXX) Agent is calling clean for resource(mobius_dg) because offline did not complete within the expected time.
2013/10/30 22:33:22 VCS INFO V-16-2-13068 (XXXXX) Resource(mobius_dg) - clean completed successfully.
2013/10/30 22:37:24 VCS ERROR V-16-2-13027 (XXXXX) Resource(mobius_dg) - monitor procedure did not complete within the expected time.
2013/10/30 22:37:24 VCS ERROR V-16-2-13077 (XXXXX) Agent is unable to offline resource(mobius_dg). Administrative intervention may be required.

Background -

On this cluster we have large number of disks and worth of 54 TB data.

DG-NAME               #VD  (GB) TOTAL        used        free
mobius                176    14031.25    14000.00       31.25
vdaappdg              534    42596.69    40628.76     1967.93

While VCS fail-over happens it simply hangs! and says "Agent is unable to offline resource(vdrapp_vol). Administrative intervention may be required."

As far as I understand, this message displays when an offline procedure does not complete on time. An offline procedure timeout can occur when the system is overloaded, is busy processing a system call, or is handling a large number of resources.

I wanted to know if someone has ever gone through such a situation & if someone can advice if this is happening due to large number of storage??

Waiting for experts advice and possibly solution in order to make sure fail-over functionality works seamlessly.

Thank you/Nilesh

Gaurav_S · ‎11-28-2013

Hello,

The volumes clearly looks to be culprit here so could be two options

1. IOs are still going on & hence volume can't be stopped. You need to ensure that all the IOs are stopped /applications are stopped to ensure a clean / graceful offline of volume.

2. DG resource is also calling for unsuccessful clean which is most likely because of volume still operational. However also would be worth to check on what is happening to vxconfigd daemon ? Is the daemon heavily loaded. A simple check for this would be to run some veritas commands & see if there is delay in returning of command ... If this is the case then all veritas commands will time out which will be most likely reason for above behaviour of VCS .. would be good check what is vxconfigd doing so much .. this can be found with

# vxtask list (any background tasks running)

# ps -ef |grep -i vx (is there any old vx command which is stuck in a loop & making vxconfigd uncomfortable)

Hope this helps

G

View solution in original post

Marianne · ‎11-09-2013

Forget about VCS for now.

VCS is simply performing OS commands to offline a Service Group.

So, with everything online (FileSystem mounted), stop vcs:
hastop -all -force.

Stop application that depends on filesystem/volume(s) and diskgroup(s).

On active node, unmount the filestem(s).
How long does it take?

Deport the diskgroup(s).
How long does it take?

Please share all relevant info about your cluster:

Solaris version (SPARC or x86?)

SF/HA version?

Please post main.cf entries for this SG.

Handy NetBackup Links

Nilesh_Joshi · ‎11-15-2013

Thanks for your response.

To test how much time it takes to unmount and deport the disk group will take sometime as it involves downtime - which is very costly when business freeze going on!

However I can certainly provide you the cluster environment details -

Hardware - V890 SPARC server
Operating System - Solaris 9
SF/HA Version - VERITAS-4.1MP2

I've also attached main.cf entries for problematic service group.

BTW, just another thought -

Does it is a case that system being just incapable of handling so many/much LUNs/Storage to failover from one node to another?

Hardware is end of service life, V890
Operating system is legecy, Solaris 9
Volume manager is quite old, 4.1 MP2 of VxVM

Please advise.

Thank you/Nilesh

Marianne · ‎11-16-2013

Very worrying that you are running old, unsupported OS and VxVM/VCS software in production?

I guess that HBA firmware and drivers are out of date too, resulting in long delays to probe devices at OS and VxVM level....

The reason why I asked you to time a manual unmount, deport and import, is to determine offline, online and monitor timeouts. These attributes can be adjusted.

Handy NetBackup Links

Shaf · ‎11-16-2013

Nilesh,

please check the following,

VCS services runs as a root user on fail over node..

Daniel_Matheus · ‎11-18-2013

Hi Nilesh,

did you check whether there were still processes accessing the file system?

This is most likely the reason for umount/stopvol and dg deport failing.

As this is a NFS share, was the share stopped correctly?
Did you check whether there were still processes accessing the file system?

Or simplest thing sometimes forgotten, was your pwd in the share path?

you can check that by running either fuser or lsof.

The problem is, when the file system is still in use, you can't unmount it, if the filesystem is still mounted you can't stop the volume which emans you can't deport the diskgroup.

In the clean procedure VCS tries to do a force umount, but it seems even that failed.

Is anything conclusive in the syslogs?

Regards,
Daniel

Gaurav_S · ‎11-28-2013

Hello,

The volumes clearly looks to be culprit here so could be two options

1. IOs are still going on & hence volume can't be stopped. You need to ensure that all the IOs are stopped /applications are stopped to ensure a clean / graceful offline of volume.

2. DG resource is also calling for unsuccessful clean which is most likely because of volume still operational. However also would be worth to check on what is happening to vxconfigd daemon ? Is the daemon heavily loaded. A simple check for this would be to run some veritas commands & see if there is delay in returning of command ... If this is the case then all veritas commands will time out which will be most likely reason for above behaviour of VCS .. would be good check what is vxconfigd doing so much .. this can be found with

# vxtask list (any background tasks running)

# ps -ef |grep -i vx (is there any old vx command which is stuck in a loop & making vxconfigd uncomfortable)

Hope this helps

G

VOX

Service Group Failover fails !!!