Prab
16 years agoLevel 3
Cluster Failover
hi All,
There was a cluster failover that happened yesterday in my enironment. Below are the logs that i have come up with from the /var/adm/messages.
Apr 29 23:22:32 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(5) Agent is calling clean for resource(n
bugrp_master) because the resource became OFFLINE unexpectedly, on its own.
Apr 29 23:22:32 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 (wiggum) Agent is calling clean for resource(nbugrp_master
) because the resource became OFFLINE unexpectedly, on its own.
Apr 29 23:25:03 wiggum vmd[3157]: [ID 631293 daemon.notice] terminating - successful (0)
Apr 29 23:25:03 wiggum vmd[3157]: [ID 715111 daemon.error] volume daemon terminating because it received a signal (15)
Apr 29 23:25:03 wiggum vmd[3157]: [ID 164182 daemon.error] terminating - daemon terminated (7)
Apr 29 23:26:20 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13069 Thread(5) Resource(nbugrp_master) - clean faile
d.
Apr 29 23:26:20 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13069 (wiggum) Resource(nbugrp_master) - clean failed.
Apr 29 23:27:28 wiggum bpjava-msvc[8270]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in unde
rlying service module.
Apr 29 23:28:55 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13078 Thread(5) Resource(nbugrp_master) - clean compl
eted successfully after 1 failed attempts.
Apr 29 23:28:55 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 Thread(5) Resource(nbugrp_master) became OFFLIN
E unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.
Apr 29 23:28:55 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 (wiggum) Resource(nbugrp_master) became OFFLINE unexpected
ly on its own. Agent is restarting (attempt number 1 of 2) the resource.
Apr 29 23:29:05 wiggum vmd[8896]: [ID 617826 daemon.notice] ready for connections
Apr 29 23:30:14 wiggum bpjava-msvc[9439]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in unde
rlying service module.
Apr 30 07:10:56 wiggum bpjava-msvc[29649]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in und
erlying service module.
The scenario is as below:
I am running a solaris 10 OS with NBU 6.5.1 on the master server with around 300 clients.
The master server is clustered.
At around 23:55 local time the entire cluster failover has happened and so many jobs got cancelled.
Could someone help me find out the root cause for the above failure.
I also have nbsu logs, agent debug logs and also core stack logs. Please let me know incase i need to upload those logs also for further assistance.
Regards,
Prab
PS: I have attached some of the logs. Please find the attachments.
There was a cluster failover that happened yesterday in my enironment. Below are the logs that i have come up with from the /var/adm/messages.
Apr 29 23:22:32 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(5) Agent is calling clean for resource(n
bugrp_master) because the resource became OFFLINE unexpectedly, on its own.
Apr 29 23:22:32 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 (wiggum) Agent is calling clean for resource(nbugrp_master
) because the resource became OFFLINE unexpectedly, on its own.
Apr 29 23:25:03 wiggum vmd[3157]: [ID 631293 daemon.notice] terminating - successful (0)
Apr 29 23:25:03 wiggum vmd[3157]: [ID 715111 daemon.error] volume daemon terminating because it received a signal (15)
Apr 29 23:25:03 wiggum vmd[3157]: [ID 164182 daemon.error] terminating - daemon terminated (7)
Apr 29 23:26:20 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13069 Thread(5) Resource(nbugrp_master) - clean faile
d.
Apr 29 23:26:20 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13069 (wiggum) Resource(nbugrp_master) - clean failed.
Apr 29 23:27:28 wiggum bpjava-msvc[8270]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in unde
rlying service module.
Apr 29 23:28:55 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13078 Thread(5) Resource(nbugrp_master) - clean compl
eted successfully after 1 failed attempts.
Apr 29 23:28:55 wiggum AgentFramework[12382]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 Thread(5) Resource(nbugrp_master) became OFFLIN
E unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.
Apr 29 23:28:55 wiggum Had[12362]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 (wiggum) Resource(nbugrp_master) became OFFLINE unexpected
ly on its own. Agent is restarting (attempt number 1 of 2) the resource.
Apr 29 23:29:05 wiggum vmd[8896]: [ID 617826 daemon.notice] ready for connections
Apr 29 23:30:14 wiggum bpjava-msvc[9439]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in unde
rlying service module.
Apr 30 07:10:56 wiggum bpjava-msvc[29649]: [ID 427199 user.error] pam_dial_auth: terminal-device not specifiedby login, returning Error in und
erlying service module.
The scenario is as below:
I am running a solaris 10 OS with NBU 6.5.1 on the master server with around 300 clients.
The master server is clustered.
At around 23:55 local time the entire cluster failover has happened and so many jobs got cancelled.
Could someone help me find out the root cause for the above failure.
I also have nbsu logs, agent debug logs and also core stack logs. Please let me know incase i need to upload those logs also for further assistance.
Regards,
Prab
PS: I have attached some of the logs. Please find the attachments.