Solved: If in case only an agent is

tanislavm · ‎07-01-2014

Hi Gaurav,

I wish to verify with you troubleshooting things.

An service group could failover onto other node because the curent running node is heavy loaded?If yes how i could see that this is the reason?Any clue on engine_A.log?Then cpus load more than 60%?physical memory to few and swap not enough?

If a plex is in an NDEV this means that the disk is faulty or the paths to disks are faulty.right?

If an service group failover to other node then the engine_A.log will shows me the faulty resource who was the culprit?If yes,then let`s say that an logical volume reource went faulty.Next i should to investigate in /var/adm/messages to see if the issue is because of the disk or of because the paths to disk(hba)?

At the time i perform "hastart" command,this will start only the had,hashadow and the agents based on main.cf resources?right?

hastart will not start the llt and gab.right?

so if llt and gab are not started and i issue hastart,then nothing happens?

Could be an resource online at the OS level and offline at the vcs level?

What happens when an resource agent disapear and the resource is online?

thanks so much.

Gaurav_S · ‎07-02-2014

If in case only an agent is hung, there is no need to restart entire cluster ... you can do "haagent -stop" or kill -9 to agent process & then do a "haagent -start <agent> "

As mentioned many times before, plex state will depend on the fault, not necessarily it will be in "Disabled stale" state, there are many other plex states. If VCS is unable to make recovery of volumes, it will keep the group in faulted state, then you can fix the plex states manually & once done, start the group from VCS.

If group fails over to other node & if VCS is able to make recovery, none of steps will be required however again .. depends on what the fault is & whether VCS can recover it automatically or no.

G

View solution in original post

Marianne · ‎07-02-2014

Please, my friend...

We cannot answer all your 'what if' questions...

All the 'What If' and theory is covered in the manuals as well as online documentation on SORT.
And in Classroom training...

You seem to confuse Storage Foundation issues with Cluster admin.
Understand that failures at disk level will cause VCS to react in the way that it has been configured. Similar to any resource type that VCS is monitoring.
Plex states depend on what exactly is wrong at hardware level (covered in VxVM documentation).

There should be no reason for VCS agents to be 'hung'. If so, you may want to log a call with Symantec Support.

Agents can be stopped and started with 'haagent' commands.
See http://sfdoccentral.symantec.com/sf/5.0/hpux/manpages/vcs/haagent_1m.html for command usage.

Handy NetBackup Links

View solution in original post

Gaurav_S · ‎07-01-2014

Answers :

An service group could failover onto other node because the curent running node is heavy loaded?If yes how i could see that this is the reason?Any clue on engine_A.log?Then cpus load more than 60%?physical memory to few and swap not enough?

>>> Yes it is possible, if server is heavy loaded VCS will not get sufficient resources to process the requests for agents & hence failover may happen. VCS had a host monitor agent which can monitor CPU resource contention & it does log in engine_A.log. Not all the performance issues will be captured in engine_A.log, you need to investigate from OS level as well (like SAR reports, /var/adm/messages etc). 60% should not cause to crash/failover/.

If a plex is in an NDEV this means that the disk is faulty or the paths to disks are faulty.right?

>> Yes, NDEV means there is no device, a disk device may have failed.

If an service group failover to other node then the engine_A.log will shows me the faulty resource who was the culprit?If yes,then let`s say that an logical volume reource went faulty.Next i should to investigate in /var/adm/messages to see if the issue is because of the disk or of because the paths to disk(hba)?

>> you need to look in engine_Alog with sequence of events, if a goup went offline, you need to see which was the first resource to go offline & then look at possible reasons. If it was a volume resource, you are right, first step would be to look of disks of the volume have failed or no. /var/adm/messages will the right place to start

At the time i perform "hastart" command,this will start only the had,hashadow and the agents based on main.cf resources?right?

>> Right, whatever types of resources you have used in main.cf, those agents will be started.

hastart will not start the llt and gab.right?

>> No .. LLT & GAB sit below VCS. you need to start LLT first, then GAB & then fencing & then VCS(had). hastart will not start LLT/GAB or fencing.

so if llt and gab are not started and i issue hastart,then nothing happens?

>> true.. nothing will happen

Could be an resource online at the OS level and offline at the vcs level?

>> No, if a resource is configured in cluster, even if cluster is frozen, VCS should detect the status as online (if resource is configured correctly)

What happens when an resource agent disapear and the resource is online?

>> Disappear means offline ? If the the agent is hung, VCS will not be able to communicate to agent & hence resource monitor timeout will fail which will generate fault in the cluster

G

tanislavm · ‎07-02-2014

Hi Gaurav,

The had will poll the resorce agents at certain intervals to see the resource status?

You state:

If the the agent is hung, VCS will not be able to communicate to agent & hence resource monitor timeout will fail which will generate fault in the cluster.

Please what kind of fault?

thanks so much.

Gaurav_S · ‎07-02-2014

Hi,

"had" polls agents at certain intervals, I am not sure what exactly it is. Resources have individual monitor intervals & timeouts & I am sure agent will have its value .. look at "haagent -display <value>" , not sure though if it is hard coded.

Lets say if volume agent is hung, all the volume resources in cluster will fault . & if volume resources were critical, entire group will fault. Same way if mount agent hangs, all the mount resources will fault & thus failing application / service group because mounts will host application filesystems usually

G

tanislavm · ‎07-02-2014

Hi,

I wish a roughly figure of cpu load when the service group could failover.80% cpu load?more?

The resource will fault if the corresponding agent hang,because had will not have any updates or why?

thanks so much

Gaurav_S · ‎07-02-2014

It may depend on setup, can't exactly say if its 80%, please note, VCS host monitor agent is just reporting the number .. It is not going to initiate the failover on itself. If CPU is having high utilization, lets say 90% but still if "had" is able to communicate to agents, nothing will happen. It depends on overall utilization of server on what happens during any specific scenario. If CPU load is 50% but memory ran out completely, this may result in issue as well. So bottom line is, % may be different on case to case basis

I explained rest of part already, if agent is hang, it won't be able to fetch the resource status & update to "had". for e.g if volume agent is hung, it won't be able to update any status of volume resources to "had", & since "had" will not receive update, it will eventually timeout & fault.

G

tanislavm · ‎07-02-2014

Hi,

If the agents hung then all the resources go offline.right?How i restart the group on line?Should i kill -9 the agents processes,then hastop -all -force,then hastart(start whole vcs on all nodes)?

If the resources are offline,then for volume case,the plexes of all volumes are in STALE status.right?so now the vcs is running and i perform vxplex -f attach and vxvol start all.right?

Or maybe the above are not necessary because the group will failover to other node and everything is fine.

thanks so much.

Gaurav_S · ‎07-02-2014

If in case only an agent is hung, there is no need to restart entire cluster ... you can do "haagent -stop" or kill -9 to agent process & then do a "haagent -start <agent> "

As mentioned many times before, plex state will depend on the fault, not necessarily it will be in "Disabled stale" state, there are many other plex states. If VCS is unable to make recovery of volumes, it will keep the group in faulted state, then you can fix the plex states manually & once done, start the group from VCS.

If group fails over to other node & if VCS is able to make recovery, none of steps will be required however again .. depends on what the fault is & whether VCS can recover it automatically or no.

G

Marianne · ‎07-02-2014

Please, my friend...

We cannot answer all your 'what if' questions...

All the 'What If' and theory is covered in the manuals as well as online documentation on SORT.
And in Classroom training...

You seem to confuse Storage Foundation issues with Cluster admin.
Understand that failures at disk level will cause VCS to react in the way that it has been configured. Similar to any resource type that VCS is monitoring.
Plex states depend on what exactly is wrong at hardware level (covered in VxVM documentation).

There should be no reason for VCS agents to be 'hung'. If so, you may want to log a call with Symantec Support.

Agents can be stopped and started with 'haagent' commands.
See http://sfdoccentral.symantec.com/sf/5.0/hpux/manpages/vcs/haagent_1m.html for command usage.

Handy NetBackup Links

VOX

Query on VCS & VxVM