β11-08-2017 08:19 AM
Hi Mates,
I just want to know how will the cluster know about the service group status in remote cluster.
I know it will be by the use of ICMP adnd wac process.
But , I am confused with below output of hastatus -summ from one site. when secondary site is isolated due to network/router issue.
hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A adm1area2 RUNNING 0
A adm2area2 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B BkupLan adm1area2 Y N ONLINE
B BkupLan adm2area2 Y N ONLINE
B ClusterService adm1area2 Y N OFFLINE|FAULTED
B ClusterService adm2area2 Y N OFFLINE|FAULTED
B app2Mon adm1area2 Y N ONLINE
B app2Mon adm2area2 Y N ONLINE
B app1 adm1area2 Y N OFFLINE
B app1 adm2area2 Y N OFFLINE
B app1fs adm1area2 Y N OFFLINE|FAULTED
B app1fs adm2area2 Y N OFFLINE|FAULTED
B PrivLan adm1area2 Y N ONLINE
B PrivLan adm2area2 Y N ONLINE
B PubLan adm1area2 Y N OFFLINE|FAULTED
B PubLan adm2area2 Y N OFFLINE|FAULTED
B Site adm1area2 Y N OFFLINE
B Site adm2area2 Y N OFFLINE
B StorLan adm1area2 Y N ONLINE
B StorLan adm2area2 Y N ONLINE
B assp31 adm1area2 Y N OFFLINE|FAULTED
B assp31 adm2area2 Y N OFFLINE|FAULTED
-- RESOURCES FAILED
-- Group Type Resource System
D ClusterService IPMultiNICB wac_mip adm1area2
D ClusterService IPMultiNICB wac_mip adm2area2
D app1fs Proxy app1fs_p1 adm1area2
D app1fs Proxy app1fs_p1 adm2area2
D PubLan MultiNICB pub_mnic adm1area2
D PubLan MultiNICB pub_mnic adm2area2
D assp31 Proxy syb1_p1 adm1area2
D assp31 Proxy syb1_p1 adm2area2
-- WAN HEARTBEAT STATE
-- Heartbeat To State
M Icmp area1app1rc_cluster DOWN
-- REMOTE CLUSTER STATE
-- Cluster State
N area1app1rc_cluster FAULTED
-- REMOTE SYSTEM STATE
-- cluster:system State Frozen
O area1app1rc_cluster:adm1area1 FAULTED 0
O area1app1rc_cluster:adm2area1 FAULTED 0
-- REMOTE GROUP STATE
-- Group cluster:system Probed AutoDisabled State
P app1 area1app1rc_cluster:adm1area1 Y N OFFLINE
P app1 area1app1rc_cluster:adm2area1 Y N OFFLINE
P app1fs area1app1rc_cluster:adm1area1 Y N ONLINE
P app1fs area1app1rc_cluster:adm2area1 Y N OFFLINE
P Site area1app1rc_cluster:adm1area1 Y N ONLINE
P Site area1app1rc_cluster:adm2area1 Y N OFFLINE
P assp31 area1app1rc_cluster:adm1area1 Y N OFFLINE
P assp31 area1app1rc_cluster:adm2area1 Y N ONLINE
When ICMP is showing down, howcome its still seeing the status of remote cluster and remote service groups.
Regards
S
β11-08-2017 08:38 AM
Is it still showing the summary like this or has it changed to OFFLINE for the remote resources?
β11-08-2017 10:00 AM
Hi,
ICMP is alive and ok now.
But I need to understand when ICMP is showing down here , it means heartbeat between two cluster shuld be down .
then what should be the status of remote cluster and service groups in haststus -summ
β11-09-2017 06:57 AM
Any one reply please?
β11-10-2017 05:43 PM
when ICMP is showing down here , it means heartbeat between two cluster shuld be down
- Correct/
then what should be the status of remote cluster and service groups in haststus -summ
- Unknown or
- Exited if the remote cluster was gracefully shutdown or
- Faulted if the rebote cluster was doen due to failure
β11-10-2017 05:47 PM
Correction:
these two lines below
- Exited if the remote cluster was gracefully shutdown or
- Faulted if the rebote cluster was doen due to failure
should be corrected as
- Exited if the remote cluster was gracefully shutdown, then inter-cluster HB was lost
- Faulted if the rebote cluster was doen due to failure, then inter-cluster HB was lost
β11-14-2017 08:34 PM
Well if it shows remote resources online when the cluster is faulted its a bit weird. That's why I asked if it was still showing that.
β11-15-2017 02:58 AM
the state of the resource in the remote cluster is "handled" by the had daemon on the remote cluster and communicated to the local cluster via the inter cluster heartbeat. Since the remote clusetr is down, so are the resources the remote cluster.
If gthe remote cluster is down, to check the state of the resources in the remote cluster, run the command below
#hares -state -clus <remote_cluster>
If you suspect ha* commands do not show resource and service group state correctly, you can try a simple work around below
#hastop -all -force <<< run this command on one node
#hastart <<< run this command on each node
There are some know defects with some early VCS releases. Make sure to patch up your VCS
Can you run the command below and post the output here?
hasys -display | grep -i vers
β11-15-2017 09:16 AM - edited β11-15-2017 09:17 AM
Hello Frank,
I understood the situation ..thanks for your help.
On one node I am geeting below message on running hastatus -summ
hastatus -sum
VCS ERROR V-16-1-10600 Cannot connect to VCS engine
VCS WARNING V-16-1-11046 Local system not available
However , on other node this system is showing as running and service groups online on both nodes.
So, situation is hastatus -summ showing vcs engine not running ..but had process is running and gab port h is 01 in gabconfig -a
ps -ef | grep -i had
root 5673 1 0 Oct 09 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 5660 1 0 Oct 09 ? 117:48 /opt/VRTSvcs/bin/had
on hastart
below error is seen
Nov 15 15:20:32 xxxxx syslog[29067]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11103 VCS exited. It will restart
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10619 'HAD' starting on: xxxxx
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10620 Waiting for local cluster configuration status
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10625 Local cluster configuration valid
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11034 Registering for cluster membership
Nov 15 15:21:02 xxxxx genunix: [ID 159711 kern.notice] GAB ERROR V-15-1-20054 Port h registration failed, device busy
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11032 Registration failed. Exiting
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10116 GabHandle::open failed errno = 16
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11033 GAB open failed. Exiting
Nov 15 15:21:07 xxxxx syslog[29067]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11103 VCS exited. It will restart
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10619 'HAD' starting on: xxxxx
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10620 Waiting for local cluster configuration status
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10625 Local cluster configuration valid
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11034 Registering for cluster membership
Is it due to h port is already formed membership in gabconfig.
Regards
S.
β11-15-2017 01:56 PM
had is monitored by another daemon call hashadow. when had is down (abnormally), hashadow will restart it up. had rcords this error (V-16-1-11103) if there is an issue when data is transferring between itself and GAB. Since the issue is within the daemon. the system restart is needed.
To fix GAB/HAD communication issue, you can also stop had and close all GAB ports and restart GAB then had. not sure how familiar you are with VCS so my advise is to get a maintenance window to restart the systems (both nodes. do a cluster reboot meaning down all the nodes then start them up)
also keep a close eye on the cluster load. if the load on one system is always very high, move some load to the other node. if all the nodes in the cluster are on high load on a regular basis, h/w upgrade is needed.