cancel
Showing results for 
Search instead for 
Did you mean: 

Faulted on node2

solom
Level 4

Hi

 

Faulted on node2 This message comes when i turn off the node1  to see the failover work or not  my services group and the resource not failover to node 2

 (Service Groups state is online and the system list for the  Service Groups on both nodes)

Please help .

 

Regards

1 ACCEPTED SOLUTION

Accepted Solutions

mikebounds
Level 6
Partner Accredited

I have looked at logs and main.cf more closely now and I can see your issue.  You have nested mounts:

 

        MountPoint = "/trakpri/meam/live/tc"
        MountPoint = "/trakpri/meam/live/tc/jrn/alt"
        MountPoint = "/trakpri/meam/live/tc/jrn/pri"
        MountPoint = "/trakpri/meam/live/tc/wij"
 
But you have not defined the right dependencies in VCS - you should have dependencies
 
  live-tc-jrn-alt requires trakpri-meam-live-tc
  live-tc-jrn-pri requires trakpri-meam-live-tc
  trakpri-meam-live-tc-wij requires trakpri-meam-live-tc
 
 
 
i.e VCS needs to mount "/trakpri/meam/live/tc" first and then the others and when VCS is offlining it needs to umount all subdirectory mounts of "/trakpri/meam/live/tc" (I call these submounts) first and then umount "/trakpri/meam/live/tc".  As you have not defined these dependencies VCS will try to online and offline all at the same time and this will cause problems most of the time.
 
If the directories are created correctly then directory "/trakpri/meam/live/tc" should be empty (as you shouldn't have files or directories in a mount point) and the filesystem that is mounted at "/trakpri/meam/live/tc" should have directories:
/wij
/jrn/alt
/jrn/pri
 
so that the sub-mounts can mount.  This will mean the Firedrill may fail for the mounts as it won't see the mount points for sub-mounts and this is fine as if it is failing, the firedrill is just written badly as it should check for nested mounts and not expect to find mount points for the sub-mounts.
 
Also I don't know why you are making dependencies like "ip requires trakpri-meam-live-tc".  An IP doesn't need a mount, they are independent.  I have seen this done the otherway round ("trakpri-meam-live-tc requires ip") and this is a good idea as this means if the IP fails as the service group is already online on the other system (but VCS can't see this as VCS is not running on other system or heartbeats are down), then the mounts won't online because they depend on ip which can prevent mounts onlining on both nodes at the same time which can cause corruption.  If when you add you app to the service group and this needs ip, then just make app dependent on mounts and ip and remove dependencies of ip needing mounts.
 
Mike

 

View solution in original post

20 REPLIES 20

mikebounds
Level 6
Partner Accredited

Please provide hastatus -sum output from before you run the test and extract from engine_A.log and main.cf.

Mike

mikebounds
Level 6
Partner Accredited

Please provide logs as this should say what issue is.  One probable cause is that you only have 1 heartbeat working as this will stop failover, so please also provide output of "lltstat -nvv" and also if resources are not probed, so please provide output of "hastatus -sum"

As an aside, I would recommend setting AutoStartList for your service group so that they start on a cold cluster start.

Mike

arangari
Level 5

i think the clusterService group will still failover even if we have only one LLT link active. as this group fault is a special fault and handled differently than other groups

mikebounds
Level 6
Partner Accredited

Solom,

If when you turn off node1, if the ClusterService group is failing over to node2 and the other groups are not, then if Amit is right in that the ClusterService groups fails over, even with only one heartbeat (which makes sense as the ClusterService group has no storage), then almost certainly, your issue is the probable cause I mentioned on my last post - that only one heartbeat is working, so please provide output from eariler request of "lltstat -nvv".

Mike

arangari
Level 5

also engine-logs can be useful

solom
Level 4
where is the directory for logs in /etc

mikebounds
Level 6
Partner Accredited

Logs are in /var/VRTSvcs/logs, but output of "lltstat -nvv" may show the issue is there is only one heartbeat working.

Mike

solom
Level 4

[root@tcpri-clu1 ~]# lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
   * 0 tcpri-clu1        OPEN    
                                  eth4   UP      3C:4A:92:EF:8B:7B
                                  eth5   UP      3C:4A:92:EF:8B:7F
     1 tcpri-clu2        OPEN    
                                  eth4   UP      3C:4A:92:EF:7B:73
                                  eth5   UP      3C:4A:92:EF:7B:77
     2                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     3                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     4                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     5                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     6                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     7                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     8                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
     9                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    10                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    11                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    12                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    13                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    14                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    15                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    16                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    17                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    18                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    19                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    20                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    21                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    22                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    23                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    24                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    25                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    26                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    27                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    28                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    29                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    30                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    31                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    32                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    33                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    34                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    35                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    36                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    37                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    38                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    39                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    40                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    41                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    42                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    43                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    44                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    45                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    46                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    47                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    48                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    49                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    50                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    51                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    52                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    53                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    54                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    55                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    56                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    57                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    58                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    59                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    60                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    61                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    62                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN    
    63                   CONNWAIT
                                  eth4   DOWN    
                                  eth5   DOWN

mikebounds
Level 6
Partner Accredited

How are you turning off your node as from the logs I see:

2013/06/17 21:52:47 VCS NOTICE V-16-1-10322 System tcpri-clu1 (Node '0') changed state from RUNNING to LEAVING

as oppose to "changed state from RUNNING to FAULTED" so "LEAVING" means VCS is been shutdown, rather than a node being "killed".  If RC scripts are being called, then historically the RC scripts ran "hastop -evacuate", which means service groups would be failed over, but since at least 5.1, the RC scripts now run "hastop -sysoffline" and I am not sure if this fails over service groups as the "-sysoffline" flag is undocumented.

The best way to test a node failing is to power down system boards if this server is part of a logical domain or flick the power switch.  The best O/S command I got to work for this test was "uadmin 2 0" and even this was sometimes not quick enough bringing down the server, as VCS knew command was run so saw shutdown as administrive so didn't failover groups.

Mike

 

arangari
Level 5

Mike,

The 'hastop -sysoffline' is same as 'hastop -local -evacuate -noautodisable', except that it will not be interactive. in VCS 5.1, we made the 'hastop' command confirmative, however we wanted to handle the reboot case and hence introduced this special flag. 

Thus the 'hastop -sysoffline' will also 'evacuate' the online groups to the other node possible. However it is possible that due to dependencies etc, the evacuation may not happen. 

 

Solom: Best way to provide these logs/main.cf is to attach, as then it is better handled. i will look into logs and then respond. 

 

solom
Level 4

Please check attached file

mikebounds
Level 6
Partner Accredited

Thanks Amit.  Do you know why the -sysoffline" is not documented in man pages or VCS user guide.  Also do any of the VCS guides tell you that service groups will evacuated by the O/S shutdown scripts as I couldn't find this info in the VCS user or install guide.

Mike

arangari
Level 5

The ' hastop -sysoffline'  command is not expected to be used outside the shutdown RC script. This was the reason it was not documented. 

At reboot, it is known that, in reboot sequence, while stopping VCS,  the online SGs will be evacuated if appropriate. This should be in document as the behavior is present even in 4.1.  

I will reconfirm the documentation again

 

solom
Level 4

Sorry for the cuttoff. I was assisgned temporarily to a different project.

My problem is still there and I just realized something that could be the source of my problem. Previously when I was running Veritas 5.1 on Redhat 5.6 the disk groups were defined on all hosts of the cluster. I could, at the time, manually deport and import disk grops between servers in the cluster.

Now, I can only see the disk groups visible on the primary node of the cluster. Eventhough the luns are visible from all nodes of the cluster. So I have a feeling that my problem lies in the original setup of the Diskgroups or the cluster it self. I tried to delete the diskgroups and create them again, but the option to share the diskgroup between hosts is greyed out.

 

Thoughts??

solom
Level 4

two quick followups:

1. I managed to do a manual deport and export between the nodes in the cluster using vxdiskadmin but not through vom

2. I ran a fire drill on the ClusterService and it executed fine. But on the Service Groups for the mounts it failed

solom
Level 4

also engine-logs

 

code

Attach

solom
Level 4

I working on vom and the ClusterService is online on node1 and when i turn off node1 node2 going online but the problem after created Service Groups and resource can't failover to node2 and the type for services group is failover and my nodes working on redhat 6.3 .

 

main.cf

 

Attached

mikebounds
Level 6
Partner Accredited

I have looked at logs and main.cf more closely now and I can see your issue.  You have nested mounts:

 

        MountPoint = "/trakpri/meam/live/tc"
        MountPoint = "/trakpri/meam/live/tc/jrn/alt"
        MountPoint = "/trakpri/meam/live/tc/jrn/pri"
        MountPoint = "/trakpri/meam/live/tc/wij"
 
But you have not defined the right dependencies in VCS - you should have dependencies
 
  live-tc-jrn-alt requires trakpri-meam-live-tc
  live-tc-jrn-pri requires trakpri-meam-live-tc
  trakpri-meam-live-tc-wij requires trakpri-meam-live-tc
 
 
 
i.e VCS needs to mount "/trakpri/meam/live/tc" first and then the others and when VCS is offlining it needs to umount all subdirectory mounts of "/trakpri/meam/live/tc" (I call these submounts) first and then umount "/trakpri/meam/live/tc".  As you have not defined these dependencies VCS will try to online and offline all at the same time and this will cause problems most of the time.
 
If the directories are created correctly then directory "/trakpri/meam/live/tc" should be empty (as you shouldn't have files or directories in a mount point) and the filesystem that is mounted at "/trakpri/meam/live/tc" should have directories:
/wij
/jrn/alt
/jrn/pri
 
so that the sub-mounts can mount.  This will mean the Firedrill may fail for the mounts as it won't see the mount points for sub-mounts and this is fine as if it is failing, the firedrill is just written badly as it should check for nested mounts and not expect to find mount points for the sub-mounts.
 
Also I don't know why you are making dependencies like "ip requires trakpri-meam-live-tc".  An IP doesn't need a mount, they are independent.  I have seen this done the otherway round ("trakpri-meam-live-tc requires ip") and this is a good idea as this means if the IP fails as the service group is already online on the other system (but VCS can't see this as VCS is not running on other system or heartbeats are down), then the mounts won't online because they depend on ip which can prevent mounts onlining on both nodes at the same time which can cause corruption.  If when you add you app to the service group and this needs ip, then just make app dependent on mounts and ip and remove dependencies of ip needing mounts.
 
Mike

 

solom
Level 4

Thank you very much for the thourough answer; I really didnt think about it this way. I will give it a shot a get back to you

 

Solom