09-21-2013 06:12 PM
We replaced one of our solaris servers (swapped the hard drives into the new server) after a hardware failure. When the server came back up, all the applications we have on the servers in the cluster stopped functioning. All the servers' logs show that the resource could not be contacted, then it attemps to run clean and repeats this process until the server we brought up is take offline. I am not sure why this is occuring and could not find any documentation concerning steps needed to re-introduce a server to the cluster.
Solved! Go to Solution.
09-22-2013 02:12 AM
To re-introduce a server to the cluster steps are:
Mike
09-24-2013 02:23 PM
If you have disconnected heartbeats so that lltnode ids 0-6 are showing down and only lltnode id 7 (itself) is showing as UP, then this is why GAB is not seeding and hence the "port a" messages.
Mike
09-22-2013 02:12 AM
To re-introduce a server to the cluster steps are:
Mike
09-22-2013 05:37 AM
Thank you for your reply. The hard drives were transfered to the new hardware so the files exist still. What confuses me is that when the server was brought up (booted) the monitor could not get status of all resources across all the nodes in the cluster.
09-22-2013 06:14 AM
Sorry, I read post wrong, I thought you had put new disks into existing server, but after reading again, I see you put old disks in new server.
This may mean the devices for the network cards have changed so you need to check references to network cards in llttab and main.cf
However your logs say montitor timed out for resource app1 which does't sound like a NIC - can you provide details of this resource (hares -display app1)
Mike
09-22-2013 07:51 AM
Sorry, I cannot get the hares output right now. My thoughts on this is if the device names changed, why would the monitor service on other members of the cluster have trouble getting the resource status. I would think that this server would just not be able to bring resources online or respond to the cluster.
09-22-2013 08:19 AM
Are there errors when VCS starts on the new server and joins the cluster.
Is the new server the same hardware as the old one?
Mike
09-22-2013 09:49 AM
The hardware is the same. Since this is a production environment we had to shut the server down since it was causing issues. I am planning on booting the server into single user mode to view the logs and check for any other issues.
09-24-2013 10:58 AM
If I remove the server from the cluster. What is required to add it back in?
09-24-2013 11:40 AM
From the messages log on the problem server. the port a messages repeat till the server was shutdown
09-24-2013 12:28 PM
If you have one node getting gab membership this usually means it can't see the other nodes over LLT - please provide from problem node:
output from "lltstat -nvv" and "gabconfig -a"
file /etc/gabtab
Mike
09-24-2013 12:57 PM
The node is disconnected from the network becuase of the issue it caused with the cluster.
09-24-2013 01:12 PM
09-24-2013 02:23 PM
If you have disconnected heartbeats so that lltnode ids 0-6 are showing down and only lltnode id 7 (itself) is showing as UP, then this is why GAB is not seeding and hence the "port a" messages.
Mike
09-24-2013 02:33 PM
it is only showing 7 on this server because it cannot see the other servers as the network is not up on this node. The other cluster node shows all members
09-24-2013 03:12 PM
From another node in the cluster
09-24-2013 03:13 PM
this config does not look right to me nor does the fact that another node reports a heartbeat issue on lsappp09
09-25-2013 03:21 AM
In addition to the problem on lsappp09 mentioned by Mike, note that more than one node (04 & 09) sees lsappp07 down.
as gabtab is set to seed with 8 nodes - if 07 is also down, lsappp10 won't seed unless you run gabconfig -c -n<number-of-nodes-up> or -cx to seed regardless of how many nodes are up/down.
09-27-2013 08:40 PM
This has been resolved. I worked with Max from support (who did a great job, thanks) and mikebounds, you were right about the GAB not seeding issue.
Output from gabconfig -a
GAB Port Memberships
===============================================================
Port h gen bf3437 membership 0123 56
Port h gen bf3437 jeopardy ; 6
Port h gen bf3437 visible ; 7
Note no port a status. So only HAD was showing but this was probably not updating since node 7, the server that was down was still showing as visible. So GAB was effectively stuck. The hardware replacement had nothing to do with this situation but was just a lucky coincidence.
To resolve we needed to restart the cluster services across the whole cluster.
First, turn off HAD
hastop –all –force
Then on each of the nodes run
gabconfig –U (unloads gab)
lltconfig –U (unloads llt)
Then reload llt, gab and had. On the first server, GAB needs to be started specially because no other nodes are running. So on the first node do
lltconfig -c
gabconfig –cx
Check the status reported by gabconfig -a to see port a status
Then on the other nodes
On all the other servers in the cluster do the following
lltconfig –c
sh /etc/gabtab
gabconfig –a (make sure port a shows the node)
Once each server is reporting correctly in gabconfig, start HAD on each server
hastart
Then check gabconfig -a for port h status