Forum Discussion

mike_ohio's avatar
mike_ohio
Level 3
11 years ago

Cluster fails after solaris server is brought online after hardware replacement

We replaced one of our solaris servers (swapped the hard drives into the new server) after a hardware failure. When the server came back up, all the applications we have on the servers in the cluster stopped functioning. All the servers' logs show that the resource could not be contacted, then it attemps to run clean and repeats this process until the server we brought up is take offline. I am not sure why this is occuring and could not find any documentation concerning steps needed to re-introduce a server to the cluster.

 

Sep 20 20:17:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(16) Resource(app1) - monitor procedure did not complete within the expected time.
Sep 20 20:17:52 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(17) Resource(app2) - monitor procedure did not complete within the expected time.
Sep 20 20:17:58 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(18) Resource(app3) - monitor procedure did not complete within the expected time.
Sep 20 20:18:02 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(19) Resource(app4) - monitor procedure did not complete within the expected time.
Sep 20 20:18:13 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(20) Resource(app5) - monitor procedure did not complete within the expected time.
Sep 20 20:18:28 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(21) Resource(app6) - monitor procedure did not complete within the expected time.
Sep 20 20:22:17 app_server1 AgentFramework[1105]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(4) Resource(app7) - monitor procedure did not complete within the expected time.
Sep 20 20:23:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13210 Thread(34) Agent is calling clean for resource(app1) because 4 successive invocations of the monitor pr                              ocedure did not complete within the expected time.
Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(34) Resource(app1) - clean completed successfully.
Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13074 Thread(34) The monitoring program for resource(app1) has consistently failed to determine the resource      
  • To re-introduce a server to the cluster steps are:

    1. Install VCS and agents
       
    2. Copy following files from existing node:
      /etc/llthosts /etc/gabtab /etc/llttab /etc/vx/.uuids/clusuuid (and /etc/vxfen* if you use I/O fencing)
       
    3. Create /etc/VRTSvcs/conf/sysname containing hostname of node
       
    4. Edit /etc/llttab so that set-node is set to either /etc/VRTSvcs/conf/sysname or the node name
       
    5. Start llt and gab on new node and check "lltstat -nvv" shows all heartbeats are connected and "gabconfig -a" shows port a membership
       
    6. Run "hastart" - this should do a remote build and create main.cf and types.cf files in /etc/VRTSvcs/conf/config

    Mike

     

  • If you have disconnected heartbeats so that lltnode ids 0-6 are showing down and only lltnode id 7 (itself) is showing as UP, then this is why GAB is not seeding and hence the "port a" messages.

    Mike

17 Replies