Solved: To re-introduce a server to

mike_ohio · ‎09-21-2013

We replaced one of our solaris servers (swapped the hard drives into the new server) after a hardware failure. When the server came back up, all the applications we have on the servers in the cluster stopped functioning. All the servers' logs show that the resource could not be contacted, then it attemps to run clean and repeats this process until the server we brought up is take offline. I am not sure why this is occuring and could not find any documentation concerning steps needed to re-introduce a server to the cluster.

Sep 20 20:17:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(16) Resource(app1) - monitor procedure did not complete within the expected time.

Sep 20 20:17:52 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(17) Resource(app2) - monitor procedure did not complete within the expected time.

Sep 20 20:17:58 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(18) Resource(app3) - monitor procedure did not complete within the expected time.

Sep 20 20:18:02 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(19) Resource(app4) - monitor procedure did not complete within the expected time.

Sep 20 20:18:13 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(20) Resource(app5) - monitor procedure did not complete within the expected time.

Sep 20 20:18:28 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(21) Resource(app6) - monitor procedure did not complete within the expected time.

Sep 20 20:22:17 app_server1 AgentFramework[1105]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(4) Resource(app7) - monitor procedure did not complete within the expected time.

Sep 20 20:23:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13210 Thread(34) Agent is calling clean for resource(app1) because 4 successive invocations of the monitor pr ocedure did not complete within the expected time.

Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(34) Resource(app1) - clean completed successfully.

Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13074 Thread(34) The monitoring program for resource(app1) has consistently failed to determine the resource

mikebounds · ‎09-22-2013

To re-introduce a server to the cluster steps are:

Install VCS and agents
Copy following files from existing node:
/etc/llthosts /etc/gabtab /etc/llttab /etc/vx/.uuids/clusuuid (and /etc/vxfen* if you use I/O fencing)
Create /etc/VRTSvcs/conf/sysname containing hostname of node
Edit /etc/llttab so that set-node is set to either /etc/VRTSvcs/conf/sysname or the node name
Start llt and gab on new node and check "lltstat -nvv" shows all heartbeats are connected and "gabconfig -a" shows port a membership
Run "hastart" - this should do a remote build and create main.cf and types.cf files in /etc/VRTSvcs/conf/config

Mike

View solution in original post

mikebounds · ‎09-24-2013

If you have disconnected heartbeats so that lltnode ids 0-6 are showing down and only lltnode id 7 (itself) is showing as UP, then this is why GAB is not seeding and hence the "port a" messages.

Mike

View solution in original post

mikebounds · ‎09-22-2013

To re-introduce a server to the cluster steps are:

Install VCS and agents
Copy following files from existing node:
/etc/llthosts /etc/gabtab /etc/llttab /etc/vx/.uuids/clusuuid (and /etc/vxfen* if you use I/O fencing)
Create /etc/VRTSvcs/conf/sysname containing hostname of node
Edit /etc/llttab so that set-node is set to either /etc/VRTSvcs/conf/sysname or the node name
Start llt and gab on new node and check "lltstat -nvv" shows all heartbeats are connected and "gabconfig -a" shows port a membership
Run "hastart" - this should do a remote build and create main.cf and types.cf files in /etc/VRTSvcs/conf/config

Mike

mike_ohio · ‎09-22-2013

Thank you for your reply. The hard drives were transfered to the new hardware so the files exist still. What confuses me is that when the server was brought up (booted) the monitor could not get status of all resources across all the nodes in the cluster.

mikebounds · ‎09-22-2013

Sorry, I read post wrong, I thought you had put new disks into existing server, but after reading again, I see you put old disks in new server.

This may mean the devices for the network cards have changed so you need to check references to network cards in llttab and main.cf

However your logs say montitor timed out for resource app1 which does't sound like a NIC - can you provide details of this resource (hares -display app1)

Mike

mike_ohio · ‎09-22-2013

Sorry, I cannot get the hares output right now. My thoughts on this is if the device names changed, why would the monitor service on other members of the cluster have trouble getting the resource status. I would think that this server would just not be able to bring resources online or respond to the cluster.

mikebounds · ‎09-22-2013

Are there errors when VCS starts on the new server and joins the cluster.

Is the new server the same hardware as the old one?

Mike

mike_ohio · ‎09-22-2013

The hardware is the same. Since this is a production environment we had to shut the server down since it was causing issues. I am planning on booting the server into single user mode to view the logs and check for any other issues.

mike_ohio · ‎09-24-2013

If I remove the server from the cluster. What is required to add it back in?

mike_ohio · ‎09-24-2013

From the messages log on the problem server. the port a messages repeat till the server was shutdown

Sep 20 20:14:17 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 2 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 5 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 5 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 2 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 0 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 3 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 0 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 6 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 3 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 1 (e1000g2) node 1 active

Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-

1-10024 link 0 (nxge2) node 1 active

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: /etc/default/SUNWsneep is from a system with ID "84ac2f88"

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: saved /etc/default/SUNWsneep as /etc/default/SUNWsneep.84ac2f88

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: /etc/default/SUNWsneep successfully (re)initialized

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: cannot use backup file to restore missing values to eeprom

Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: Chassis Serial not available from system eeprom

Sep 20 20:14:18 lsappp10.itlogon.com last message repeated 1 time

Sep 20 20:14:19 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: Chassis Serial is not in backup file

Sep 20 20:14:20 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro

ot: Warning: cannot use backup file for this recovery

Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s

nmpplugin: sunPlatSensorClass 0 unsupported (row=287)

Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s

nmpplugin: sunPlatSensorClass 0 unsupported (row=288)

Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s

nmpplugin: sunPlatSensorClass 0 unsupported (row=289)

Sep 20 20:14:23 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:14:23 lsappp10.itlogon.com nrpe[1397]: [ID 601491 daemon.notice] Start

ing up daemon

Sep 20 20:14:23 lsappp10.itlogon.com nrpe[1397]: [ID 627629 daemon.notice] Warni

ng: Daemon is configured to accept command arguments from clients!

Sep 20 20:14:37 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:14:43 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:14:57 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:02 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:15:16 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:21 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:15:35 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:40 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:15:54 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:15:59 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:16:13 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:16:18 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:16:32 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:16:33 lsappp10.itlogon.com syslog[1784]: [ID 702911 daemon.notice] VCS

INFO V-16-1-11240 Command Server: running with security OFF

Sep 20 20:16:33 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-10619 'HAD' starting on: lsappp10

Sep 20 20:16:33 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-10620 Waiting for local cluster configuration status

Sep 20 20:16:35 lsappp10.itlogon.com genunix: [ID 408114 kern.info] /pseudo/zcon

snex@1/zcons@0 (zcons0) online

Sep 20 20:16:35 lsappp10.itlogon.com genunix: [ID 408114 kern.info] /pseudo/zcon

snex@1/zcons@1 (zcons1) online

Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-10625 Local cluster configuration valid

Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-11034 Registering for cluster membership

Sep 20 20:16:35 lsappp10.itlogon.com gab: [ID 843912 kern.notice] GAB INFO V-15-

1-20005 Port h registration waiting for seed port membership

Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO

TICE V-16-1-11035 Waiting for cluster membership

Sep 20 20:16:37 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:16:50 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS CR

ITICAL V-16-1-11306 Did not receive cluster membership, manual intervention may

be needed for seeding

Sep 20 20:16:51 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:16:56 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:17:10 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:17:15 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:17:29 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:17:34 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:17:48 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

Sep 20 20:17:53 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:18:07 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:18:11 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:18:26 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:18:30 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:18:45 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:18:49 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:19:04 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:19:09 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

1-20032 Port a closed

Sep 20 20:19:24 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-

1-20026 Port a registration waiting for seed port membership

Sep 20 20:19:29 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-

mikebounds · ‎09-24-2013

If you have one node getting gab membership this usually means it can't see the other nodes over LLT - please provide from problem node:

output from "lltstat -nvv" and "gabconfig -a"

file /etc/gabtab

Mike

mike_ohio · ‎09-24-2013

The node is disconnected from the network becuase of the issue it caused with the cluster.

mike_ohio · ‎09-24-2013

lltstat -nvv from the problem host

* 7 lsappp10 OPEN

nxge2 UP 00:14:4F:DD:60:A8

e1000g2 UP 00:14:4F:D4:D3:40

root@lsappp10.itlogon.com # cat /etc/gabtab

/sbin/gabconfig -c -n8

root@lsappp10.itlogon.com # /sbin/gabconfig -c -n8

root@lsappp10.itlogon.com # cat /etc/llttab

set-node lsappp10

set-cluster 2

link nxge2 /dev/nxge:2 - ether - -

link e1000g2 /dev/e1000g:2 - ether - -

mikebounds · ‎09-24-2013

If you have disconnected heartbeats so that lltnode ids 0-6 are showing down and only lltnode id 7 (itself) is showing as UP, then this is why GAB is not seeding and hence the "port a" messages.

Mike

mike_ohio · ‎09-24-2013

it is only showing 7 on this server because it cannot see the other servers as the network is not up on this node. The other cluster node shows all members

root@lsappp09 # lltstat -nvv

LLT node information:

Node State Link Status Address

0 lsappp01 OPEN

e1000g3 UP 00:14:4F:24:EF:C3

nxge0 DOWN

1 lsappp02 OPEN

e1000g3 UP 00:03:BA:B4:5E:73

nxge0 DOWN

2 lsappp03 OPEN

e1000g3 UP 00:03:BA:B4:60:27

nxge0 DOWN

3 lsappp04 OPEN

e1000g3 UP 00:03:BA:B1:B7:07

nxge0 DOWN

4 lsappp07 CONNWAIT

e1000g3 DOWN

nxge0 DOWN

5 lsappp08 OPEN

e1000g3 UP 00:03:BA:B2:1C:FB

nxge0 DOWN

* 6 lsappp09 OPEN

e1000g3 UP 00:14:4F:D4:09:CF

nxge0 UP 00:14:4F:DD:68:26

7 lsappp10 CONNWAIT

e1000g3 DOWN

nxge0 DOWN

mike_ohio · ‎09-24-2013

From another node in the cluster

root@lsappp04 # lltstat -nvv

LLT node information:

Node State Link Status Address

0 lsappp01 OPEN

ce3 UP 00:14:4F:24:EF:C3

ce7 UP 00:03:BA:B1:B2:1F

1 lsappp02 OPEN

ce3 UP 00:03:BA:B4:5E:73

ce7 UP 00:03:BA:B1:6E:2F

2 lsappp03 OPEN

ce3 UP 00:03:BA:B4:60:27

ce7 UP 00:03:BA:B1:6A:2B

* 3 lsappp04 OPEN

ce3 UP 00:03:BA:B1:B7:07

ce7 UP 00:03:BA:B4:60:13

4 lsappp07 CONNWAIT

ce3 DOWN

ce7 DOWN

5 lsappp08 OPEN

ce3 UP 00:03:BA:B2:1C:FB

ce7 UP 00:03:BA:B1:E3:17

6 lsappp09 OPEN

ce3 UP 00:14:4F:D4:09:CF

ce7 DOWN

7 lsappp10 CONNWAIT

ce3 DOWN

ce7 DOWN

from main.cf

root@lsappp10.itlogon.com # cat main.cf|grep lsappp09

system lsappp09 (

SystemList = { lsappp09 = 0 }

SystemList = { lsappp07 = 2, lsappp09 = 0, lsappp10 = 1 }

AutoStartList = { lsappp09 }

SystemList = { lsappp09 = 1 }

lsappp09 = 6 }

lsappp09,

Device @lsappp09 = { e1000g0 = 0, e1000g4 = 0 }

SystemList = { lsappp09 = 0, lsappp10 = 1 }

system lsappp10 (

SystemList = { lsappp10 = 0 }

SystemList = { lsappp07 = 2, lsappp09 = 0, lsappp10 = 1 }

SystemList = { lsappp10 = 0 }

lsappp10 = 7,

lsappp10 }

Device @lsappp10 = { e1000g0 = 0, nxge1 = 0 }

SystemList = { lsappp09 = 0, lsappp10 = 1 }

SystemList = { lsappp10 = 0 }

mike_ohio · ‎09-24-2013

this config does not look right to me nor does the fact that another node reports a heartbeat issue on lsappp09

g_lee · ‎09-25-2013

In addition to the problem on lsappp09 mentioned by Mike, note that more than one node (04 & 09) sees lsappp07 down.

as gabtab is set to seed with 8 nodes - if 07 is also down, lsappp10 won't seed unless you run gabconfig -c -n<number-of-nodes-up> or -cx to seed regardless of how many nodes are up/down.

mike_ohio · ‎09-27-2013

This has been resolved. I worked with Max from support (who did a great job, thanks) and mikebounds, you were right about the GAB not seeding issue.

Output from gabconfig -a

GAB Port Memberships
===============================================================
Port h gen bf3437 membership 0123 56
Port h gen bf3437 jeopardy ; 6
Port h gen bf3437 visible ; 7

Note no port a status. So only HAD was showing but this was probably not updating since node 7, the server that was down was still showing as visible. So GAB was effectively stuck. The hardware replacement had nothing to do with this situation but was just a lucky coincidence.

To resolve we needed to restart the cluster services across the whole cluster.

First, turn off HAD

hastop –all –force

Then on each of the nodes run

gabconfig –U (unloads gab)

lltconfig –U (unloads llt)

Then reload llt, gab and had. On the first server, GAB needs to be started specially because no other nodes are running. So on the first node do

lltconfig -c

gabconfig –cx

Check the status reported by gabconfig -a to see port a status

Then on the other nodes

On all the other servers in the cluster do the following

lltconfig –c

sh /etc/gabtab

gabconfig –a (make sure port a shows the node)

Once each server is reporting correctly in gabconfig, start HAD on each server

hastart

Then check gabconfig -a for port h status

VOX

Cluster fails after solaris server is brought online after hardware replacement