Forum Discussion

Prab's avatar
Prab
Level 3
16 years ago

Cluster Failover

hi All,

There was a cluster failover that happened yesterday in my enironment. Below are the logs that i have come up with from the /var/adm/messages.


Apr 29 23:22:32 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(5) Agent is calling clean for resource(n
bugrp_master) because the resource became OFFLINE unexpectedly, on its own.

Apr 29 23:22:32 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 (wiggum) Agent is calling clean for resource(nbugrp_master
) because the resource became OFFLINE unexpectedly, on its own.

Apr 29 23:25:03 wiggum vmd[3157]: [ID 631293 daemon.notice] terminating - successful (0)

Apr 29 23:25:03 wiggum vmd[3157]: [ID 715111 daemon.error] volume daemon terminating because it received a signal (15)

Apr 29 23:25:03 wiggum vmd[3157]: [ID 164182 daemon.error] terminating - daemon terminated (7)

Apr 29 23:26:20 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13069 Thread(5) Resource(nbugrp_master) - clean faile
d.

Apr 29 23:26:20 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13069 (wiggum) Resource(nbugrp_master) - clean failed.

Apr 29 23:27:28 wiggum bpjava-msvc[8270]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in unde
rlying service module.

Apr 29 23:28:55 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13078 Thread(5) Resource(nbugrp_master) - clean compl
eted successfully after 1 failed attempts.

Apr 29 23:28:55 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 Thread(5) Resource(nbugrp_master) became OFFLIN
E unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

Apr 29 23:28:55 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 (wiggum) Resource(nbugrp_master) became OFFLINE unexpected
ly on its own. Agent is restarting (attempt number 1 of 2) the resource.

Apr 29 23:29:05 wiggum vmd[8896]: [ID 617826 daemon.notice] ready for connections

Apr 29 23:30:14 wiggum bpjava-msvc[9439]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in unde
rlying service module.

Apr 30 07:10:56 wiggum bpjava-msvc[29649]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in und
erlying service module.

The scenario is as below:

I am running a solaris 10 OS with NBU 6.5.1 on the master server with around 300 clients.
The master server is clustered.
At around 23:55 local time the entire cluster failover has happened and so many jobs got cancelled.



Could someone help me find out the root cause for the above failure.

I also have nbsu logs, agent debug logs and also core stack logs. Please let me know incase i need to upload those logs also for further assistance.


Regards,
Prab


PS: I have attached some of the logs. Please find the attachments.
  • Check the log located at:

    /usr/openv/netbackup/bin/cluster/AGENT_DEBUG.log at the same time.

    This will tell you which process in netbackup went offline because if which the failure happened.

    Then if the logs are available for that process, dig down further to get the root cause.

  • hi All,

    I have uploaded logs, Can somebody update me in regards to this...

    Thanks in advance..


    Regards,
    Prab
  •  pr 29 23:22:32 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(5) Agent is calling clean for resource(n
    bugrp_master) because the resource became OFFLINE unexpectedly, on its own.


    THe line above is a very generic error message that VCS engine logs.. Netbackup agent for VCS monitors a large number of processes. 

    If any of these process crashes, VCS engine will think that it went offline unexpectedly.

    However, first we need to find out which is the actual netbackup binary causing this problem.
    This can be found in /usr/openv/netbackup/bin/cluster/AGENT_DEBUG.log

    This will have a line which says something like "detected following processes offline: ......."

    This indicates the netbackup process that went offline.

    Now, once you identify, check the logs for that process, bperror to see if you get a clue.

    For an accurate RCA, I would recommend you to open a call with Symantec.

    Hope this helps you.