Wondering if anyone has seen this before, what the cause may be, if there is an automated recovery scenario.
Situation: We run a 3-node cluster (with a 3-node GCO cluster at a remote site) running VCS 5.1-SP1 on Dell R411 servers. This past Saturday, our operations were performing a standard switchover of our Primary resources (applications) to a Standby node. On switchover to the new node, the IP resource (which is first in the dependency tree) was started up. VCS then reported it was starting up the first of our seven Application resources but none were started up. [As an aside, this node had run the Application resources within the past 3 weeks and they are currently running on that node, so there was no problem with the applications]. It appeared that the Application Agent was hung, as we could interact with the had daemon for stats and some commanding, but hastop commands (or variants) would not complete (i.e., had to CTRL-C them since they would not finish).
This left us in a no-brain situation. There were no log entries or traps indicating the had daemon was having a problem with the Application Agent. Worse, the had daemon did not try to recover from the no-brain situation, at least for the 15 minutes we tried CLI commands to clear the issue. We eventually were able to recover from the no-brain by rebooting the server where the issue was occurring. We have a 24x7 operation and outages over 4 minutes can be very detrimental to our customers.
How do we know it was an Application Agent hang? We have been able to create the same situation in our lab by attaching to one or more of the Application Agent threads and causing them to halt on a Standby node, then switching over to that node. The Application resources are not started and the had daemon does not try to recover from the situation (or if it does, it says it is restarting the Application Agent then says it is already up), basically leaving us in no-brain. Also, we are migrating to VCS 6.0.1 in the next month and we see the same behavior with that release.
Has anyone seen this before? Is it a known VCS bug? Is there some way to automatically recover from this to keep us out of extended no-brain?
If the application agent does hang, it should be restarted by VCS - see extract from "VCS Attributes" Appendix in VCS Admin guide:
AgentReplyTimeoutThe number of seconds the engine waits to receive a heartbeat from the agent beforerestarting the agent.■ Type and dimension: integer-scalar■ Default: 130 seconds
If VCS does messs up, then you should be able to run "hastop -local -force" if just one node has an issue or "hastop -all -force" if all nodes have issues and this should kill had, hadshadow and all application agent processes and leave your application in its current state - then you can restart VCS. You shouldn't need to reboot.
I would say there could be multiple reasons for having an application agent to hang
My recommendation here would be to be to increase the level of logging & get the logs analyzed on what is the actual root cause.
Below technote explains how to increase the debug level
You may need to open a support case to get the logs analyzed by Symantec. Once you enable the debug, you may want to take a planned outage & attempt a similar activity & ensure that logs are captured to do analysis.