Problem in switching over the resource group
Hi All
I am facing problem in switching over the resource group.
When I use hagrp -switch -to cmd to swithch over a Resource group , the resource group switch over absolutely fine as shown below
2012/04/03 13:02:14 VCS INFO V-16-1-50135 User root fired command: hagrp -switch cme-platform-sg dlcmdn1 from localhost
2012/04/03 13:02:14 VCS NOTICE V-16-1-10208 Initiating switch of group cme-platform-sg from system dlcmdn2 to system dlcmdn1
2012/04/03 13:02:14 VCS NOTICE V-16-1-10300 Initiating Offline of Resource wf-lister-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn2
2012/04/03 13:02:16 VCS INFO V-16-10031-504 (dlcmdn2) Application:wf-lister-res:offline:Executed /opt/cmd/Mediate/script/ha/ as user cmd
2012/04/03 13:02:17 VCS INFO V-16-2-13001 (dlcmdn2) Resource(wf-lister-res): Output of the completed operation (offline)
Apr 03 2012 13:02:15 dlcmdn2 MediationEngine INFO Process with PID 5251 stopped
2012/04/03 13:02:17 VCS INFO V-16-1-10305 Resource wf-lister-res (Owner: unknown, Group: cme-platform-sg) is offline on dlcmdn2 (VCS initiated)
2012/04/03 13:02:17 VCS NOTICE V-16-1-10300 Initiating Offline of Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn2
2012/04/03 13:02:21 VCS INFO V-16-2-13001 (dlcmdn2) Resource(cme-platform-res): Output of the completed operation (offline)
Shutting down Platform...done.
2012/04/03 13:02:22 VCS INFO V-16-1-10305 Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) is offline on dlcmdn2 (VCS initiated)
2012/04/03 13:02:22 VCS NOTICE V-16-1-10300 Initiating Offline of Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn2
2012/04/03 13:02:24 VCS INFO V-16-1-10305 Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) is offline on dlcmdn2 (VCS initiated)
2012/04/03 13:02:24 VCS NOTICE V-16-1-10446 Group cme-platform-sg is offline on system dlcmdn2
2012/04/03 13:02:24 VCS NOTICE V-16-1-10301 Initiating Online of Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn1
2012/04/03 13:02:24 VCS INFO V-16-6-15002 (dlcmdn2) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/nfs_postoffline dlcmdn2 cme-platform-sg successfully
2012/04/03 13:02:24 VCS INFO V-16-6-15002 (dlcmdn2) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/postoffline dlcmdn2 cme-platform-sg successfully
2012/04/03 13:02:40 VCS INFO V-16-1-10298 Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) is online on dlcmdn1 (VCS initiated)
2012/04/03 13:02:40 VCS NOTICE V-16-1-10301 Initiating Online of Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn1
But when I give hagrp -offline cmd and once the group is offline and then I use hagrp -online cmd, the resource group never fails over and always stuck in cme-platform-res resource.I have tried this many times and always the problem re-occurs.See below
2012/04/03 13:08:04 VCS INFO V-16-1-50135 User root fired command: hagrp -offline cme-platform-sg dlcmdn2 from localhost
2012/04/03 13:08:04 VCS NOTICE V-16-1-10167 Initiating manual offline of group cme-platform-sg on system dlcmdn2
2012/04/03 13:08:04 VCS NOTICE V-16-1-10300 Initiating Offline of Resource wf-lister-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn2
2012/04/03 13:08:06 VCS INFO V-16-10031-504 (dlcmdn2) Application:wf-lister-res:offline:Executed /opt/cmd/Mediate/script/ha/ as user cmd
2012/04/03 13:08:07 VCS INFO V-16-2-13001 (dlcmdn2) Resource(wf-lister-res): Output of the completed operation (offline)
Apr 03 2012 13:08:05 dlcmdn2 MediationEngine INFO Process with PID 6606 stopped
2012/04/03 13:08:07 VCS INFO V-16-1-10305 Resource wf-lister-res (Owner: unknown, Group: cme-platform-sg) is offline on dlcmdn2 (VCS initiated)
2012/04/03 13:08:07 VCS NOTICE V-16-1-10300 Initiating Offline of Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn2
2012/04/03 13:08:11 VCS INFO V-16-2-13001 (dlcmdn2) Resource(cme-platform-res): Output of the completed operation (offline)
Shutting down Platform...done.
2012/04/03 13:08:12 VCS INFO V-16-1-10305 Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) is offline on dlcmdn2 (VCS initiated)
2012/04/03 13:08:12 VCS NOTICE V-16-1-10300 Initiating Offline of Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn2
2012/04/03 13:08:14 VCS INFO V-16-1-10305 Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) is offline on dlcmdn2 (VCS initiated)
2012/04/03 13:08:14 VCS NOTICE V-16-1-10446 Group cme-platform-sg is offline on system dlcmdn2
2012/04/03 13:08:14 VCS INFO V-16-6-15002 (dlcmdn2) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/nfs_postoffline dlcmdn2 cme-platform-sg successfully
2012/04/03 13:08:14 VCS INFO V-16-6-15002 (dlcmdn2) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/postoffline dlcmdn2 cme-platform-sg successfully
2012/04/03 13:08:33 VCS INFO V-16-1-50135 User root fired command: hagrp -online cme-platform-sg dlcmdn1 from localhost
2012/04/03 13:08:33 VCS NOTICE V-16-1-10166 Initiating manual online of group cme-platform-sg on system dlcmdn1
2012/04/03 13:08:33 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group cme-platform-sg on all nodes
2012/04/03 13:08:33 VCS NOTICE V-16-1-10301 Initiating Online of Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn1
2012/04/03 13:08:49 VCS INFO V-16-1-10298 Resource cme-platform-ip-res (Owner: unknown, Group: cme-platform-sg) is online on dlcmdn1 (VCS initiated)
2012/04/03 13:08:49 VCS NOTICE V-16-1-10301 Initiating Online of Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) on System dlcmdn1
2012/04/03 13:10:21 VCS INFO V-16-2-13003 (dlcmdn1) Resource(cme-platform-res): Output of the timed out operation (online)
Starting Platform...
2012/04/03 13:10:21 VCS WARNING V-16-2-13012 (dlcmdn1) Resource(cme-platform-res): online procedure did not complete within the expected time.
2012/04/03 13:10:21 VCS ERROR V-16-2-13065 (dlcmdn1) Agent is calling clean for resource(cme-platform-res) because online did not complete within the expected time.
Can you pls help to find out what could be the problem here as I understand that hagrp -switch cmd and hagrp -offline/online does the same thing?
The only difference between switch and offline & online should be timings - i.e there is delay between when group offlines and when it onlines when you do offline&online. Other than that the commands VCS is calling should be identical for the 2 procedures which are:
- offline routinte for lister
- offline routine for platform
- offline routine for ip
- online routine for ip
- online routine for platform
- online routinte for lister
This is a little simplified as VCS will also call monitor routine after each offline and online to check resource is offline or online.
Presumebly once offline has faulted you clear fault and then run online and online works - or if it doesn't, what do you have to do to get online to work.
For the switch, in the logs above you show
2012/04/03 13:02:40 VCS NOTICE V-16-1-10301 Initiating Online of Resource cme-platform-res
But do not show:
Resource cme-platform-res (Owner: unknown, Group: cme-platform-sg) is online on dlcmdn1 (VCS initiated)
how long does this take to online for switch?
For the offline & online the online of cme-platform-res seems to timeout after 90 seconds - have you changed the OfflineTimeOut from default of 300 seconds to 90 seconds and if not, what is the Type of this resource. If you have changed to 90 seconds, it could be this is too short
In terms of debugging you could see what happens in the following scenarios:
- Does offline and online in quick succession work?
I think you maybe able to do this by (long time since I have used "-wait" so not sure of syntax)
hagrp -offline cme-platform-sg -sys dlcmdn2 ; hagrp -wait cme-platform-sg State OFFLINE -sys dlcmdn2 ; hagrp -online cme-platform-sg -sys dlcmdn1
- Offline group on dlcmdn2 and then online ip resource manually on dlcmdn1 and then online platform resource manually without using VCS.
If platform resource fails or it takes longer than 90 seconds, then you will have to debug your app.
- Wait a period of time - maybe 5 mins, 10 mins or an hour between offline and online. Does online always fails or is it dependent on how long you wait between offline and online
If you are still not able to solve issue after above, post the results of tests and