We replaced one of our solaris servers (swapped the hard drives into the new server) after a hardware failure. When the server came back up, all the applications we have on the servers in the cluster stopped functioning. All the servers' logs show that the resource could not be contacted, then it attemps to run clean and repeats this process until the server we brought up is take offline. I am not sure why this is occuring and could not find any documentation concerning steps needed to re-introduce a server to the cluster. Sep 20 20:17:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(16) Resource(app1) - monitor procedure did not complete within the expected time. Sep 20 20:17:52 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(17) Resource(app2) - monitor procedure did not complete within the expected time. Sep 20 20:17:58 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(18) Resource(app3) - monitor procedure did not complete within the expected time. Sep 20 20:18:02 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(19) Resource(app4) - monitor procedure did not complete within the expected time. Sep 20 20:18:13 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(20) Resource(app5) - monitor procedure did not complete within the expected time. Sep 20 20:18:28 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(21) Resource(app6) - monitor procedure did not complete within the expected time. Sep 20 20:22:17 app_server1 AgentFramework[1105]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(4) Resource(app7) - monitor procedure did not complete within the expected time. Sep 20 20:23:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13210 Thread(34) Agent is calling clean for resource(app1) because 4 successive invocations of the monitor pr ocedure did not complete within the expected time. Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(34) Resource(app1) - clean completed successfully. Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13074 Thread(34) The monitoring program for resource(app1) has consistently failed to determine the resource

To re-introduce a server to the cluster steps are: Install VCS and agents Copy following files from existing node: /etc/llthosts /etc/gabtab /etc/llttab /etc/vx/.uuids/clusuuid (and /etc/vxfen* if you use I/O fencing) Create /etc/VRTSvcs/conf/sysname containing hostname of node Edit /etc/llttab so that set-node is set to either /etc/VRTSvcs/conf/sysname or the node name Start llt and gab on new node and check "lltstat -nvv" shows all heartbeats are connected and "gabconfig -a" shows port a membership Run "hastart" - this should do a remote build and create main.cf and types.cf files in /etc/VRTSvcs/conf/config Mike

Thank you for your reply. The hard drives were transfered to the new hardware so the files exist still. What confuses me is that when the server was brought up (booted) the monitor could not get status of all resources across all the nodes in the cluster.

Sorry, I read post wrong, I thought you had put new disks into existing server, but after reading again, I see you put old disks in new server. This may mean the devices for the network cards have changed so you need to check references to network cards in llttab and main.cf However your logs say montitor timed out for resource app1 which does't sound like a NIC - can you provide details of this resource (hares -display app1) Mike

Sorry, I cannot get the hares output right now. My thoughts on this is if the device names changed, why would the monitor service on other members of the cluster have trouble getting the resource status. I would think that this server would just not be able to bring resources online or respond to the cluster.

Are there errors when VCS starts on the new server and joins the cluster. Is the new server the same hardware as the old one? Mike

The hardware is the same. Since this is a production environment we had to shut the server down since it was causing issues. I am planning on booting the server into single user mode to view the logs and check for any other issues.

Cluster fails after solaris server is brought online after hardware replacement

17 Replies

mikebounds
Level 6
11 years ago
To re-introduce a server to the cluster steps are:

Install VCS and agents

Copy following files from existing node:
/etc/llthosts /etc/gabtab /etc/llttab /etc/vx/.uuids/clusuuid (and /etc/vxfen* if you use I/O fencing)

Create /etc/VRTSvcs/conf/sysname containing hostname of node

Edit /etc/llttab so that set-node is set to either /etc/VRTSvcs/conf/sysname or the node name

Start llt and gab on new node and check "lltstat -nvv" shows all heartbeats are connected and "gabconfig -a" shows port a membership

Run "hastart" - this should do a remote build and create main.cf and types.cf files in /etc/VRTSvcs/conf/config

Mike
mike_ohio
Level 3
11 years ago
Thank you for your reply. The hard drives were transfered to the new hardware so the files exist still. What confuses me is that when the server was brought up (booted) the monitor could not get status of all resources across all the nodes in the cluster.
mikebounds
Level 6
11 years ago
Sorry, I read post wrong, I thought you had put new disks into existing server, but after reading again, I see you put old disks in new server.

This may mean the devices for the network cards have changed so you need to check references to network cards in llttab and main.cf

However your logs say montitor timed out for resource app1 which does't sound like a NIC - can you provide details of this resource (hares -display app1)

Mike
mike_ohio
Level 3
11 years ago
Sorry, I cannot get the hares output right now. My thoughts on this is if the device names changed, why would the monitor service on other members of the cluster have trouble getting the resource status. I would think that this server would just not be able to bring resources online or respond to the cluster.
mikebounds
Level 6
11 years ago
Are there errors when VCS starts on the new server and joins the cluster.

Is the new server the same hardware as the old one?

Mike
mike_ohio
Level 3
11 years ago
The hardware is the same. Since this is a production environment we had to shut the server down since it was causing issues. I am planning on booting the server into single user mode to view the logs and check for any other issues.
mike_ohio
Level 3
11 years ago
If I remove the server from the cluster. What is required to add it back in?
mike_ohio
Level 3
11 years ago
From the messages log on the problem server. the port a messages repeat till the server was shutdown

Sep 20 20:14:17 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 2 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 5 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 5 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 2 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 0 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 3 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 0 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 6 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 3 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 1 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 1 active

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: /etc/default/SUNWsneep is from a system with ID "84ac2f88"

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: saved /etc/default/SUNWsneep as /etc/default/SUNWsneep.84ac2f88

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: /etc/default/SUNWsneep successfully (re)initialized

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: cannot use backup file to restore missing values to eeprom

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: Chassis Serial not available from system eeprom

Sep 20 20:14:18 lsappp10.itlogon.com last message repeated 1 time

Sep 20 20:14:19 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: Chassis Serial is not in backup file

Sep 20 20:14:20 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: Warning: cannot use backup file for this recovery

Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s

nmpplugin: sunPlatSensorClass 0 unsupported (row=287)

Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s

nmpplugin: sunPlatSensorClass 0 unsupported (row=288)

Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s

nmpplugin: sunPlatSensorClass 0 unsupported (row=289)

Sep 20 20:14:23 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:14:23 lsappp10.itlogon.com nrpe[1397]: [ID 601491 daemon.notice] Start

ing up daemon

Sep 20 20:14:23 lsappp10.itlogon.com nrpe[1397]: [ID 627629 daemon.notice] Warni

ng: Daemon is configured to accept command arguments from clients!

Sep 20 20:14:37 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:14:43 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:14:57 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:02 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:15:16 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:21 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:15:35 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:40 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:15:54 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:59 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:16:13 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:16:18 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:16:32 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:16:33 lsappp10.itlogon.com syslog[1784]: [ID 702911 daemon.notice] VCS

INFO V-16-1-11240 Command Server: running with security OFF

Sep 20 20:16:33 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-10619 'HAD' starting on: lsappp10

Sep 20 20:16:33 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-10620 Waiting for local cluster configuration status

Sep 20 20:16:35 lsappp10.itlogon.com genunix: [ID 408114 kern.info] /pseudo/zcon

snex@1/zcons@0 (zcons0) online

Sep 20 20:16:35 lsappp10.itlogon.com genunix: [ID 408114 kern.info] /pseudo/zcon

snex@1/zcons@1 (zcons1) online

Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-10625 Local cluster configuration valid

Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-11034 Registering for cluster membership

Sep 20 20:16:35 lsappp10.itlogon.com gab: [ID 843912 kern.notice] GAB INFO V-15-

1-20005 Port h registration waiting for seed port membership

Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-11035 Waiting for cluster membership

Sep 20 20:16:37 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:16:50 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS CR

ITICAL V-16-1-11306 Did not receive cluster membership, manual intervention may

be needed for seeding

Sep 20 20:16:51 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:16:56 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:17:10 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:17:15 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:17:29 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:17:34 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:17:48 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

Sep 20 20:17:53 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:18:07 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:18:11 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:18:26 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:18:30 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:18:45 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:18:49 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:19:04 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:19:09 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:19:24 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:19:29 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
mikebounds
Level 6
11 years ago
If you have one node getting gab membership this usually means it can't see the other nodes over LLT - please provide from problem node:

output from "lltstat -nvv" and "gabconfig -a"

file /etc/gabtab

Mike
mike_ohio
Level 3
11 years ago
The node is disconnected from the network becuase of the issue it caused with the cluster.

Forum Discussion

Cluster fails after solaris server is brought online after hardware replacement

17 Replies

Related Content

NetBackup Media server hardware refresh

replace

replace disk

Re: node freeze v/s service group freeze

Guest - Guest clustering on single Hardware

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic