Solved: Solom, If when you turn off

solom · ‎07-09-2013

Hi

Faulted on node2 This message comes when i turn off the node1 to see the failover work or not my services group and the resource not failover to node 2

(Service Groups state is online and the system list for the Service Groups on both nodes)

Please help .

Regards

mikebounds · ‎07-19-2013

I have looked at logs and main.cf more closely now and I can see your issue. You have nested mounts:

MountPoint = "/trakpri/meam/live/tc"

MountPoint = "/trakpri/meam/live/tc/jrn/alt"

MountPoint = "/trakpri/meam/live/tc/jrn/pri"

MountPoint = "/trakpri/meam/live/tc/wij"

But you have not defined the right dependencies in VCS - you should have dependencies

live-tc-jrn-alt requires trakpri-meam-live-tc

live-tc-jrn-pri requires trakpri-meam-live-tc

trakpri-meam-live-tc-wij requires trakpri-meam-live-tc

i.e VCS needs to mount "/trakpri/meam/live/tc" first and then the others and when VCS is offlining it needs to umount all subdirectory mounts of "/trakpri/meam/live/tc" (I call these submounts) first and then umount "/trakpri/meam/live/tc". As you have not defined these dependencies VCS will try to online and offline all at the same time and this will cause problems most of the time.

If the directories are created correctly then directory "/trakpri/meam/live/tc" should be empty (as you shouldn't have files or directories in a mount point) and the filesystem that is mounted at "/trakpri/meam/live/tc" should have directories:

/wij

/jrn/alt

/jrn/pri

so that the sub-mounts can mount. This will mean the Firedrill may fail for the mounts as it won't see the mount points for sub-mounts and this is fine as if it is failing, the firedrill is just written badly as it should check for nested mounts and not expect to find mount points for the sub-mounts.

Also I don't know why you are making dependencies like "ip requires trakpri-meam-live-tc". An IP doesn't need a mount, they are independent. I have seen this done the otherway round ("trakpri-meam-live-tc requires ip") and this is a good idea as this means if the IP fails as the service group is already online on the other system (but VCS can't see this as VCS is not running on other system or heartbeats are down), then the mounts won't online because they depend on ip which can prevent mounts onlining on both nodes at the same time which can cause corruption. If when you add you app to the service group and this needs ip, then just make app dependent on mounts and ip and remove dependencies of ip needing mounts.

Mike

View solution in original post

mikebounds · ‎07-09-2013

Please provide hastatus -sum output from before you run the test and extract from engine_A.log and main.cf.

Mike

mikebounds · ‎07-09-2013

Please provide logs as this should say what issue is. One probable cause is that you only have 1 heartbeat working as this will stop failover, so please also provide output of "lltstat -nvv" and also if resources are not probed, so please provide output of "hastatus -sum"

As an aside, I would recommend setting AutoStartList for your service group so that they start on a cold cluster start.

Mike

arangari · ‎07-09-2013

i think the clusterService group will still failover even if we have only one LLT link active. as this group fault is a special fault and handled differently than other groups

mikebounds · ‎07-10-2013

Solom,

If when you turn off node1, if the ClusterService group is failing over to node2 and the other groups are not, then if Amit is right in that the ClusterService groups fails over, even with only one heartbeat (which makes sense as the ClusterService group has no storage), then almost certainly, your issue is the probable cause I mentioned on my last post - that only one heartbeat is working, so please provide output from eariler request of "lltstat -nvv".

Mike

arangari · ‎07-10-2013

also engine-logs can be useful

solom · ‎07-11-2013

where is the directory for logs in /etc

mikebounds · ‎07-11-2013

Logs are in /var/VRTSvcs/logs, but output of "lltstat -nvv" may show the issue is there is only one heartbeat working.

Mike

solom · ‎07-11-2013

[root@tcpri-clu1 ~]# lltstat -nvv
LLT node information:
    Node                 State    Link Status Address
   * 0 tcpri-clu1        OPEN
                                  eth4   UP      3C:4A:92:EF:8B:7B
                                  eth5   UP      3C:4A:92:EF:8B:7F
     1 tcpri-clu2        OPEN
                                  eth4   UP      3C:4A:92:EF:7B:73
                                  eth5   UP      3C:4A:92:EF:7B:77
     2                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     3                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     4                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     5                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     6                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     7                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     8                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
     9                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    10                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    11                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    12                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    13                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    14                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    15                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    16                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    17                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    18                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    19                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    20                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    21                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    22                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    23                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    24                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    25                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    26                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    27                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    28                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    29                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    30                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    31                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    32                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    33                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    34                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    35                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    36                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    37                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    38                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    39                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    40                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    41                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    42                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    43                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    44                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    45                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    46                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    47                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    48                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    49                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    50                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    51                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    52                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    53                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    54                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    55                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    56                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    57                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    58                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    59                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    60                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    61                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    62                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN
    63                   CONNWAIT
                                  eth4   DOWN
                                  eth5   DOWN

mikebounds · ‎07-11-2013

How are you turning off your node as from the logs I see:

2013/06/17 21:52:47 VCS NOTICE V-16-1-10322 System tcpri-clu1 (Node '0') changed state from RUNNING to LEAVING

as oppose to "changed state from RUNNING to FAULTED" so "LEAVING" means VCS is been shutdown, rather than a node being "killed". If RC scripts are being called, then historically the RC scripts ran "hastop -evacuate", which means service groups would be failed over, but since at least 5.1, the RC scripts now run "hastop -sysoffline" and I am not sure if this fails over service groups as the "-sysoffline" flag is undocumented.

The best way to test a node failing is to power down system boards if this server is part of a logical domain or flick the power switch. The best O/S command I got to work for this test was "uadmin 2 0" and even this was sometimes not quick enough bringing down the server, as VCS knew command was run so saw shutdown as administrive so didn't failover groups.

Mike

arangari · ‎07-11-2013

Mike,

The 'hastop -sysoffline' is same as 'hastop -local -evacuate -noautodisable', except that it will not be interactive. in VCS 5.1, we made the 'hastop' command confirmative, however we wanted to handle the reboot case and hence introduced this special flag.

Thus the 'hastop -sysoffline' will also 'evacuate' the online groups to the other node possible. However it is possible that due to dependencies etc, the evacuation may not happen.

Solom: Best way to provide these logs/main.cf is to attach, as then it is better handled. i will look into logs and then respond.

solom · ‎07-14-2013

Please check attached file

mikebounds · ‎07-14-2013

Thanks Amit. Do you know why the -sysoffline" is not documented in man pages or VCS user guide. Also do any of the VCS guides tell you that service groups will evacuated by the O/S shutdown scripts as I couldn't find this info in the VCS user or install guide.

Mike

arangari · ‎07-15-2013

The ' hastop -sysoffline' command is not expected to be used outside the shutdown RC script. This was the reason it was not documented.

At reboot, it is known that, in reboot sequence, while stopping VCS, the online SGs will be evacuated if appropriate. This should be in document as the behavior is present even in 4.1.

I will reconfirm the documentation again

solom · ‎07-18-2013

Sorry for the cuttoff. I was assisgned temporarily to a different project.

My problem is still there and I just realized something that could be the source of my problem. Previously when I was running Veritas 5.1 on Redhat 5.6 the disk groups were defined on all hosts of the cluster. I could, at the time, manually deport and import disk grops between servers in the cluster.

Now, I can only see the disk groups visible on the primary node of the cluster. Eventhough the luns are visible from all nodes of the cluster. So I have a feeling that my problem lies in the original setup of the Diskgroups or the cluster it self. I tried to delete the diskgroups and create them again, but the option to share the diskgroup between hosts is greyed out.

Thoughts??

solom · ‎07-18-2013

two quick followups:

1. I managed to do a manual deport and export between the nodes in the cluster using vxdiskadmin but not through vom

2. I ran a fire drill on the ClusterService and it executed fine. But on the Service Groups for the mounts it failed

solom · ‎07-18-2013

also engine-logs

code

Attach

solom · ‎07-18-2013

I working on vom and the ClusterService is online on node1 and when i turn off node1 node2 going online but the problem after created Service Groups and resource can't failover to node2 and the type for services group is failover and my nodes working on redhat 6.3 .

main.cf

Attached

mikebounds · ‎07-19-2013

I have looked at logs and main.cf more closely now and I can see your issue. You have nested mounts:

MountPoint = "/trakpri/meam/live/tc"

MountPoint = "/trakpri/meam/live/tc/jrn/alt"

MountPoint = "/trakpri/meam/live/tc/jrn/pri"

MountPoint = "/trakpri/meam/live/tc/wij"

But you have not defined the right dependencies in VCS - you should have dependencies

live-tc-jrn-alt requires trakpri-meam-live-tc

live-tc-jrn-pri requires trakpri-meam-live-tc

trakpri-meam-live-tc-wij requires trakpri-meam-live-tc

i.e VCS needs to mount "/trakpri/meam/live/tc" first and then the others and when VCS is offlining it needs to umount all subdirectory mounts of "/trakpri/meam/live/tc" (I call these submounts) first and then umount "/trakpri/meam/live/tc". As you have not defined these dependencies VCS will try to online and offline all at the same time and this will cause problems most of the time.

If the directories are created correctly then directory "/trakpri/meam/live/tc" should be empty (as you shouldn't have files or directories in a mount point) and the filesystem that is mounted at "/trakpri/meam/live/tc" should have directories:

/wij

/jrn/alt

/jrn/pri

so that the sub-mounts can mount. This will mean the Firedrill may fail for the mounts as it won't see the mount points for sub-mounts and this is fine as if it is failing, the firedrill is just written badly as it should check for nested mounts and not expect to find mount points for the sub-mounts.

Also I don't know why you are making dependencies like "ip requires trakpri-meam-live-tc". An IP doesn't need a mount, they are independent. I have seen this done the otherway round ("trakpri-meam-live-tc requires ip") and this is a good idea as this means if the IP fails as the service group is already online on the other system (but VCS can't see this as VCS is not running on other system or heartbeats are down), then the mounts won't online because they depend on ip which can prevent mounts onlining on both nodes at the same time which can cause corruption. If when you add you app to the service group and this needs ip, then just make app dependent on mounts and ip and remove dependencies of ip needing mounts.

Mike

solom · ‎07-19-2013

Thank you very much for the thourough answer; I really didnt think about it this way. I will give it a shot a get back to you

Solom

VOX

Faulted on node2