cancel
Showing results for 
Search instead for 
Did you mean: 

Looking for root cause on my a resource is offine.

Nu-B
Level 2

I recently noticed that I have a resource that is offline. I am fairly new to VCS and I'm looking to track down how I can determine when and why this resource is offline. VCS seems to have a ton of logs and I'm not sure which one will benefit me in my search for the root cause. Can anyone point me in the right direction?

1 ACCEPTED SOLUTION

Accepted Solutions

mikebounds
Level 6
Partner Accredited

I don't agree these messages are normal - they are of severity ERROR, not INFO, so the agent is exiting abnormally - i.e crashing.  This indicates an error with VCS or the SDRF agent, not SRDF itself as if there is a problem with SRDF, the SRDF agent should report a problem, the SRDF agent should not shutdown.

Some possible causes are:

  1. SRDF Agent has memory leak
  2. Lots of resources and/or service groups (more than 200) and so "had" daemon is busy
  3. System is too busy so "had" can not get enough system resources

It could be a one-off so, to restart agent you can use:

haagent -start SRDF -sys sys_name_on_which_agent_has_stopped

But if it stops again, then you need to investigate the cause which will probably involve logging a call with Symantec, unless there are obvious system resource problems on your nodes.

Mike

View solution in original post

5 REPLIES 5

Wally_Heim
Level 6
Employee

Hi Nu-B,

VCS logs are fairly straight forward.  The main process of VCS is call HAD.  It logs all of its activities in the engine_A.txt file.  It keeps two older copies in the engine_B.txt and engine.C.txt files.  This file keeps entries from all cluster nodes that are running and visable at the time an entry is made into the log.

All agents log informaiton to a file named <agent_type>_A.txt with older copies being <agent_type>_B.txt and <agent_type>_C.txt.  As an example the IP resource would log all of its most recent entries in the IP_A.txt log file.  This log is unique to the agent process on a given node.

Agents can have their logging level increased o decreased by adjusting the LogDBg attribute for that specific agent.  This might be needed to get more details on what an agent is doing.

VCS keeps is log files in the %vcs_home%\log folder (windows.)  Or similar location on non-Windows OS.  %vcs_home% in windows = the initial installation path of VCS.

You will need to check out the engine_A.txt file on your cluster

I hope this helps.

Thank you,

Wally

Nu-B
Level 2

Good stuff, thanks for the info! Checked both logs and it seems that my SRDF agent is having issues  along with my HAD engine crashing.

2012/09/12 20:11:59 VCS ERROR V-16-2-13120 Thread(7624) Error receiving from the engine. Agent(SRDF) is exiting.
2012/09/12 20:16:43 VCS WARNING V-16-2-13140 Thread(7660) Could not find timer entry with id 6
2012/09/16 19:03:59 VCS ERROR V-16-2-13120 Thread(21356) Error receiving from the engine. Agent(SRDF) is exiting.

Wally_Heim
Level 6
Employee

Hi Nu-B,

The SRDF agent really does not put much into the SRDF_A.txt log file.  The messages that you point out seem to be normal startup/shutdown of the SRDF agent/HAD.

The SRDF agent is a perl script based agent.  It should be reporting the EMC commands that it is running and the output from the comands in the engine_A.txt log. 

Thank you,

Wally

mikebounds
Level 6
Partner Accredited

I don't agree these messages are normal - they are of severity ERROR, not INFO, so the agent is exiting abnormally - i.e crashing.  This indicates an error with VCS or the SDRF agent, not SRDF itself as if there is a problem with SRDF, the SRDF agent should report a problem, the SRDF agent should not shutdown.

Some possible causes are:

  1. SRDF Agent has memory leak
  2. Lots of resources and/or service groups (more than 200) and so "had" daemon is busy
  3. System is too busy so "had" can not get enough system resources

It could be a one-off so, to restart agent you can use:

haagent -start SRDF -sys sys_name_on_which_agent_has_stopped

But if it stops again, then you need to investigate the cause which will probably involve logging a call with Symantec, unless there are obvious system resource problems on your nodes.

Mike

Nu-B
Level 2

Thanks all for the info, from the logs it does look like my node 1 had issues and the HAD engine restarted and at the same time my SRDF agent error out. I had 1 of 3 SRDF resources that didn't come back online so I'm trying to track down why. It seems that this same problem occurred a few months ago so this isn't the first time this has happened. Maybe I'll go ahead and open a case with support. Thanks again for everyone's help.