Query about ICMP in GCO

symsonu · ‎11-08-2017

Hi Mates,

I just want to know how will the cluster know about the service group status in remote cluster.

I know it will be by the use of ICMP adnd wac process.

But , I am confused with below output of hastatus -summ from one site. when secondary site is isolated due to network/router issue.

hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A adm1area2             RUNNING              0
A adm2area2             RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B BkupLan         adm1area2             Y          N               ONLINE
B BkupLan         adm2area2             Y          N               ONLINE
B ClusterService adm1area2             Y          N               OFFLINE|FAULTED
B ClusterService adm2area2             Y          N               OFFLINE|FAULTED
B app2Mon          adm1area2             Y          N               ONLINE
B app2Mon          adm2area2             Y          N               ONLINE
B app1             adm1area2             Y          N               OFFLINE
B app1             adm2area2             Y          N               OFFLINE
B app1fs           adm1area2             Y          N               OFFLINE|FAULTED
B app1fs           adm2area2             Y          N               OFFLINE|FAULTED
B PrivLan         adm1area2             Y          N               ONLINE
B PrivLan         adm2area2             Y          N               ONLINE
B PubLan          adm1area2             Y          N               OFFLINE|FAULTED
B PubLan          adm2area2             Y          N               OFFLINE|FAULTED
B Site            adm1area2             Y          N               OFFLINE
B Site            adm2area2             Y          N               OFFLINE
B StorLan         adm1area2             Y          N               ONLINE
B StorLan         adm2area2             Y          N               ONLINE
B assp31         adm1area2             Y          N               OFFLINE|FAULTED
B assp31         adm2area2             Y          N               OFFLINE|FAULTED

-- RESOURCES FAILED
-- Group           Type                 Resource             System

D ClusterService IPMultiNICB          wac_mip              adm1area2
D ClusterService IPMultiNICB          wac_mip              adm2area2
D app1fs           Proxy                app1fs_p1             adm1area2
D app1fs           Proxy                app1fs_p1             adm2area2
D PubLan          MultiNICB            pub_mnic             adm1area2
D PubLan          MultiNICB            pub_mnic             adm2area2
D assp31         Proxy                syb1_p1              adm1area2
D assp31         Proxy                syb1_p1              adm2area2

-- WAN HEARTBEAT STATE
-- Heartbeat       To                   State

M Icmp            area1app1rc_cluster    DOWN

-- REMOTE CLUSTER STATE
-- Cluster         State

N area1app1rc_cluster FAULTED

-- REMOTE SYSTEM STATE
-- cluster:system       State                Frozen

O area1app1rc_cluster:adm1area1 FAULTED              0
O area1app1rc_cluster:adm2area1 FAULTED              0

-- REMOTE GROUP STATE
-- Group           cluster:system       Probed     AutoDisabled    State

P app1             area1app1rc_cluster:adm1area1 Y          N               OFFLINE
P app1             area1app1rc_cluster:adm2area1 Y          N               OFFLINE
P app1fs           area1app1rc_cluster:adm1area1 Y          N               ONLINE
P app1fs           area1app1rc_cluster:adm2area1 Y          N               OFFLINE
P Site            area1app1rc_cluster:adm1area1 Y          N               ONLINE
P Site            area1app1rc_cluster:adm2area1 Y          N               OFFLINE
P assp31         area1app1rc_cluster:adm1area1 Y          N               OFFLINE
P assp31         area1app1rc_cluster:adm2area1 Y          N               ONLINE

When ICMP is showing down, howcome its still seeing the status of remote cluster and remote service groups.

Regards

S

RiaanBadenhorst · ‎11-08-2017

Is it still showing the summary like this or has it changed to OFFLINE for the remote resources?

symsonu · ‎11-08-2017

Hi,

ICMP is alive and ok now.

But I need to understand when ICMP is showing down here , it means heartbeat between two cluster shuld be down .

then what should be the status of remote cluster and service groups in haststus -summ

symsonu · ‎11-09-2017

Any one reply please?

frankgfan · ‎11-10-2017

when ICMP is showing down here , it means heartbeat between two cluster shuld be down

- Correct/

then what should be the status of remote cluster and service groups in haststus -summ

- Unknown or

- Exited if the remote cluster was gracefully shutdown or

- Faulted if the rebote cluster was doen due to failure

frankgfan · ‎11-10-2017

Correction:

these two lines below

- Exited if the remote cluster was gracefully shutdown or

- Faulted if the rebote cluster was doen due to failure

should be corrected as

- Exited if the remote cluster was gracefully shutdown, then inter-cluster HB was lost

- Faulted if the rebote cluster was doen due to failure, then inter-cluster HB was lost

RiaanBadenhorst · ‎11-14-2017

Well if it shows remote resources online when the cluster is faulted its a bit weird. That's why I asked if it was still showing that.

frankgfan · ‎11-15-2017

the state of the resource in the remote cluster is "handled" by the had daemon on the remote cluster and communicated to the local cluster via the inter cluster heartbeat. Since the remote clusetr is down, so are the resources the remote cluster.

If gthe remote cluster is down, to check the state of the resources in the remote cluster, run the command below

#hares -state -clus <remote_cluster>

If you suspect ha* commands do not show resource and service group state correctly, you can try a simple work around below

#hastop -all -force <<< run this command on one node

#hastart <<< run this command on each node

There are some know defects with some early VCS releases. Make sure to patch up your VCS

Can you run the command below and post the output here?

hasys -display | grep -i vers

symsonu · ‎11-15-2017

Hello Frank,

I understood the situation ..thanks for your help.

On one node I am geeting below message on running hastatus -summ

hastatus -sum
VCS ERROR V-16-1-10600 Cannot connect to VCS engine
VCS WARNING V-16-1-11046 Local system not available

However , on other node this system is showing as running and service groups online on both nodes.

So, situation is hastatus -summ showing vcs engine not running ..but had process is running and gab port h is 01 in gabconfig -a

ps -ef | grep -i had
root 5673 1 0 Oct 09 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 5660 1 0 Oct 09 ? 117:48 /opt/VRTSvcs/bin/had

on hastart
below error is seen

Nov 15 15:20:32 xxxxx syslog[29067]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11103 VCS exited. It will restart
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10619 'HAD' starting on: xxxxx
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10620 Waiting for local cluster configuration status
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10625 Local cluster configuration valid
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11034 Registering for cluster membership
Nov 15 15:21:02 xxxxx genunix: [ID 159711 kern.notice] GAB ERROR V-15-1-20054 Port h registration failed, device busy
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11032 Registration failed. Exiting
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10116 GabHandle::open failed errno = 16
Nov 15 15:21:02 xxxxx Had[4808]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11033 GAB open failed. Exiting
Nov 15 15:21:07 xxxxx syslog[29067]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11103 VCS exited. It will restart
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10619 'HAD' starting on: xxxxx
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10620 Waiting for local cluster configuration status
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10625 Local cluster configuration valid
Nov 15 15:21:47 xxxxx Had[6870]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11034 Registering for cluster membership

Is it due to h port is already formed membership in gabconfig.

Regards

S.

frankgfan · ‎11-15-2017

had is monitored by another daemon call hashadow. when had is down (abnormally), hashadow will restart it up. had rcords this error (V-16-1-11103) if there is an issue when data is transferring between itself and GAB. Since the issue is within the daemon. the system restart is needed.

To fix GAB/HAD communication issue, you can also stop had and close all GAB ports and restart GAB then had. not sure how familiar you are with VCS so my advise is to get a maintenance window to restart the systems (both nodes. do a cluster reboot meaning down all the nodes then start them up)

also keep a close eye on the cluster load. if the load on one system is always very high, move some load to the other node. if all the nodes in the cluster are on high load on a regular basis, h/w upgrade is needed.

VOX

Query about ICMP in GCO