Solved: MountV and VMDg agents dying

Marianne · ‎08-24-2011

3-node cluster SFW/HA 5.1 SP2 on Windows 2008 SP1 (x64)

Since first installation of version 5.1 end of 2008, our customer has been plagued with these two cluster agents terminating periodically.

I found TN http://www.symantec.com/docs/TECH63436 the first time the problem was seen and ensured the Veritas paths were in the PATH variable directly after the Windows paths.

Calls logged with Support over time resulted in reinstallation of Cluster nodes, installation of every available SFW/HA patch, exclusion of Veritas paths from Anti-Virus software, but the problem is still seen.

Extracts from Event Viewer Application log:

Error   2011/08/14 04:49:56 AM  Had     13086   None    Agent(MountV) detected that a previously started agent with process-id(15212) for this type is shutting down. Will terminate the old agent in (10) seconds.
Error   2011/08/14 04:49:56 AM  Had     13086   None    Agent(VMDg) detected that a previously started agent with process-id(13412) for this type is shutting down. Will terminate the old agent in (10) seconds.

 Error   2011/08/08 12:29:50 AM  AgentFramework  13120   None    Error receiving from the engine. Agent(MountV) is exiting.
Error   2011/08/08 12:29:50 AM  AgentFramework  13120   None    Error receiving from the engine. Agent(VMDg) is exiting.

Error   2011/08/07 12:14:56 AM  AgentFramework  13120   None    Error receiving from the engine. Agent(MountV) is exiting.
Error   2011/08/07 12:14:56 AM  AgentFramework  13120   None    Error receiving from the engine. Agent(VMDg) is exiting.

Error   2011/08/05 05:48:18 PM  AgentFramework  13120   None    Error receiving from the engine. Agent(MountV) is exiting.
Error   2011/08/05 05:46:55 PM  Had     13085   None    Terminating the old agent for the type(VMDg) with process-id(12528)
Error   2011/08/05 05:46:45 PM  Had     13051   None    Agent(MountV) is exiting because another agent with process-id(24792) is already running for this type
Error   2011/08/05 05:46:45 PM  Had     13086   None    Agent(VMDg) detected that a previously started agent with process-id(12528) for this type is shutting down. Will terminate the old agent in (10) seconds.

Does ANYONE have anymore ideas?

Handy NetBackup Links

Wally_Heim · ‎08-24-2011

Hi Marianne,

The CPs are released around the 25-26 of every month now. So you are talking about waiting a day or two for the release of CP3. However, I cannot say that CP3 will resolve your issue. I won't know what is included it CP3 until it is released. Once it is released then the readme that comes with it will list all of the patches included. We will also release a tech note that lists the issues fixed in CP3 but it usually takes us a few days to create it and get it published.

Thanks,

Wally

View solution in original post

Wally_Heim · ‎08-24-2011

Hi Marianne,

Are you seeing these messages for other resource types used in the cluster?

This is typical of HAD terminating and being restarted. When HAD restarts it kicks off new vcsagdriver.exe processes for each resource type. If their is another vcsagdriver.exe process running for a give resource type then you will see these messages.

You can check the hashadow_a.txt log to see if you are having issues with HAD restarting. If you are then we need to determine why it is happening. HAD and GAB often get restarted becuase of system load issues. Are you seeing any messages from GAB in the event logs? The hex details for the gab messages contain what is actually happening. If you see it counting up from 7 seconds to 14 then HAD restarting without a problem, it is typically system load at that time. if you are not able to determine the cause of the system load then you can try increasing the GAB timeout value to 30 seconds and see if that allows GAB to survive the issue.

Once HAD no longer restarts, then your agentframework messages will go away.

Thanks,

Wally

mikebounds · ‎08-24-2011

I know this is SFWHA, but I had same issue in SFHA for UNIX. The problem turned out to be a leak in the Mount agent. I also had error "Agent is exiting because another agent with process-id is already running for this type" so I guess agent was hanging rather than dying. You might think that if it was a bug in code then other customers would have issue, but for me this was specific to config as I had NFS mounts and lots of them. I know Windows doesn't have NFS Mounts, but there might be something specific in your enviroment which means you are the only one (or one of few) who are experiencing memory leak in the agent.

It sounds like only VmDg agents are failing which means issue is not likely to be system load as otherwise other agents would fail and therefore I see a bug in these 2 agents or SFW provider as the most likely cause of your issue as for instance an error in the SAN should cause the agents to report errors, not terminate.

Mike

Marianne · ‎08-24-2011

Thanks Wally. Today was actually the first time that I had a closer look at Application Log.

I have not been personally involved with the last couple of occurances. I was asked last Friday (19 August) to investigate why a Service Group would not offline/failover on the 17th.

In Engine_A log (difficult to even open because of the size - but that's another issue that I need to find a solution for) I could see that the offline was initiated for one service group - everything had been running fine up to this point. Customer wanted to switch one SG because users were complaining about slow response.

2011/08/17 13:17:13 VCS NOTICE V-16-1-10167 Initiating manual offline of group SSR2sg on system PSQL3
2011/08/17 13:17:13 VCS NOTICE V-16-1-10300 Initiating Offline of Resource .....
2011/08/17 13:17:13 VCS NOTICE V-16-1-10300 Initiating Offline of Resource ......
2011/08/17 13:17:27 VCS NOTICE V-16-1-10300 Initiating Offline of Resource SSR2sg-MountV-1

One after the other resource reported as Offline, MountV never did. No timeouts, no errors were logged in engine_A. An hour later the customer decided to switch all remaining SG's and reboot the server. Obviously the same thing happened for other Mount resources.

Looking at MountV_A log, the recurring Warnings stopped on 14 July and only started again after the system was rebooted roundabout 15:00.

2011/08/14 21:30:17 VCS WARNING V-16-2-13140 Thread(10988) Could not find timer entry with id 15386
2011/08/17 15:01:11 VCS WARNING V-16-2-13140 Thread(7568) Could not find timer entry with id 87

So, yes, you are 100% correct - the entries that I've seen in Application log is clearly not related to MountV agent dying. Looking at MountV log, the agent died some time after 21:30 on the 14th.

If the customer/colleague realizes that the Agents are not running (can be seen in hastatus -sum output), they restart the agent 'haagent' and sometimes with hastop -local -force. This normally 'kick-starts' the agents again.

So, there is no evidence of when/why these agents die...

The weird thin is that it ONLY happens to these 2 agents - nothing else.

I will have another look at Application log to search for GAB and HAD (I'm relying on explicit searches because of the size of the log).

Handy NetBackup Links

Marianne · ‎08-24-2011

Thanks for the additional pointers, Mike!

Customer mounts look something like this (4 volumes):

D:\       (Program files )
D:\SQL_Data      (SQL_Data )
D:\Logs           (SQL_Logs )
D:\RegRep   ( Directory rep)

Yes, only these 3 agents are terminating, nothing else. We've been installing patches since late 2008 (5.1 GA) until earlier this year - no closer to solution.

Customer is planning upgrade of OS to 2008 R2 soon (another post coming....)

Handy NetBackup Links

Wally_Heim · ‎08-24-2011

Hi Marianne,

I think the GAB errors should be in the System Event log

I would say that your issue is needing a deep investigation of the logs to see what is happening. I would recommend opening a case with Symantec Support and provide vxexplorer logs from all nodes in the cluster. Please mention the times when the issue has happened when opening the case.

Tech note TECH183192 covers how to collect vxexplorer logs - http://www.symantec.com/docs/TECH83192

As Mike mentioned it might be an interaction with SFW or with something in the environment. Are you able to install the latest Cumulative Patch? A new one is due out at the end of this week (CP3.)

Thanks,

Wally

mikebounds · ‎08-24-2011

It sounds as though this is an active/passive cluster if these are the only 4 mounts - do the agents die on the inactive node and if so does the problem move to the other node when you switch the SG.

Are there 2 or 3 agents failing, if 3, what is the 3rd agent (in addition to MountV and VmDg?

Mike

Marianne · ‎08-24-2011

Hi Mike - these are just the mounts for one SG. Each SG has a drive letter with three mount points (total of 4 MountV resources). There are +- 15 SQL service groups spread over the 3 nodes.

Wally, we have been logging calls over and over, each time providing explorer output. Response each time was upgrade / patch / inconclusive....
The last recommendation was to install CP1 (a couple of weeks ago). We are still waiting for the customer to log Change Control to install.
Is there perhaps a README available for CP3? or do we have to wait for it to be released?
Best to wait for CP3 rather than installing CP1?

Handy NetBackup Links

Wally_Heim · ‎08-24-2011

Hi Marianne,

The CPs are released around the 25-26 of every month now. So you are talking about waiting a day or two for the release of CP3. However, I cannot say that CP3 will resolve your issue. I won't know what is included it CP3 until it is released. Once it is released then the readme that comes with it will list all of the patches included. We will also release a tech note that lists the issues fixed in CP3 but it usually takes us a few days to create it and get it published.

Thanks,

Wally

Wally_Heim · ‎08-24-2011

Hi Marianne,

Please send me a private IM or email with the support case IDs. I would like to take a look at them.

Thanks,

Wally

Marianne · ‎08-26-2011

Seems insufficient info was supplied in previous support cases...

We will have to wait until the problem is seen again and open a new support call.

Will keep you updated.

Handy NetBackup Links

Marianne · ‎09-28-2011

All stable since installation of CP3 2 weekends ago....

Handy NetBackup Links

VOX

MountV and VMDg agents dying