cancel
Showing results for 
Search instead for 
Did you mean: 

Veritas Cluster Server Heartbeat link down, jeapordy state..

br1an
Level 3

Hello Everyone,
I am having this problem with the VCS hearbeat links.

The VCS are being run on a Solaris machine v440. The VCS version is 4.0 on Solaris 9, I know it's old & EOL. Im just hoping to find and pinpoint the soloution to this problem.
The VCS heartbeat links are running on 2 seperate Vlans. This is a 2 node cluster.
Recently the old switch was taken out and a new switch CISCO 3750 was added. The switch shows the cables are connected and I am able to see link up from the switch side.
The links in ce4 of both servers are not linking. Any ideas besides faulty VLAN? How do I test the communications on that particular VLAN?
Here are the results of various commands, any help is apperciated!
Thank you!

#lltstat -n
LLT node information:
    Node                 State    Links
     0    node1          OPEN        1
   * 1    node2          OPEN        2

#lltstat -nvv|head
LLT node information:
    Node                 State    Link  Status  Address
     0 node1          OPEN
                                  ce4   DOWN
                                  ce6   UP      00:03:BA:94:F8:61
   * 1 node2          OPEN
                                  ce4   UP      00:03:BA:94:A4:6F
                                  ce6   UP      00:03:BA:94:A4:71
     2                   CONNWAIT
                                  ce4   DOWN

#lltstat -n
node information:
    Node        LLT no       State    Links
   * 0           node1          OPEN        2
     1           node2          OPEN        1

#lltstat -nvv|head
LLT node information:
    Node                 State    Link  Status  Address
   * 0 node1          OPEN
                                  ce4   UP      00:03:BA:94:F8:5F
                                  ce6   UP      00:03:BA:94:F8:61
     1 node2          OPEN
                                  ce4   DOWN
                                  ce6   UP      00:03:BA:94:A4:71
     2                   CONNWAIT
                                  ce4   DOWN

#gabconfig -a
GAB Port Memberships
===============================================================
Port a gen   49c917 membership 01
Port a gen   49c917   jeopardy ;1
Port h gen   49c91e membership 01
Port h gen   49c91e   jeopardy ;

 

18 REPLIES 18

Marianne
Level 6
Partner    VIP    Accredited Certified

Plumb test IP addresses on c4 on both nodes and see if you ping from both machines.

Also check /var/adm/messages on both nodes for possible NIC errors.

br1an
Level 3
When I plumped up the interface and assigned it a temporary IP I got the following message. node1 had[5167]: [ID 702911 daemon.notice] VCS ERROR V-16-1-6525 (node2) MultiNICB:MNIC:monitor:Interface configuration is not valid. Check IP addresses, subnet masks, flags, groupnames on all interfaces. Make sure that all the interfaces of only one IP subnet are included under the resource. See Bundled Agents Reference Guide. That did not look good, and I immediately unplumbed it. #netstat -I interface shows traffic input ce4 output input (Total) output packets errs packets errs colls packets errs packets errs colls 37525 0 145600 0 0 3682466 0 4771776 0 0 1 0 0 0 0 12 0 0 0 0 0 0 0 0 0 9 0 0 0 0 1 0 0 0 0 10 0 0 0 0 0 0 0 0 0 9 0 0 0 0 1 0 0 0 0 10 0 0 0 0

g_lee
Level 6

Can you try snooping the interfaces to see if there's any traffic?

Also - testing with dlpiping may help:

http://www.symantec.com/business/support/index?page=content&id=TECH19998

Daniel_Matheus
Level 4
Employee Accredited Certified

Hi Brian,

 

The error was probably seen because you online the interface with the same subnet as the MultiNICB resource uses.

Excerpt from the agent guide:

As one of the requirements of MultiNICB agent is that the MultiNICB resource must include ALL of the interfaces that belong to the same IP subnet.  If some interfaces on the same IP subnet are outside of MultiNICB control,
this can lead to complications in the event of device failures.
 
Can you check that the link is up from the host side like below (this is on Sol10, but I assume it will also work on Sol9)?

# kstat -p |grep link_up | awk -F: '{ print $3 }'
ce0
ce1

If the link is up on both sides you can check llt communication using lltping:

http://sfdoccentral.symantec.com/sf/5.0/solaris/manpages/vcs/lltping_1m.html

Further you can check using lltdump for any incoming LLT packages

http://sfdoccentral.symantec.com/sf/5.0/solaris/manpages/vcs/lltdump_1m.html

 

Both commands reside in /opt/VRTSllt/bin.

In case the link is up on both nodes, but you can't get a connection using lltping or by configuring IP's on the link, you can be certain that the problem is within your network.

 

Thanks,
Dan

 

br1an
Level 3

# kstat -p |grep link_up showed me that the links are up
However when I tried the lltping there was no response.

#lltstat -n
LLT node information:
    Node                 State    Links
   * 0    node1          OPEN        1
     1    node2          OPEN        2

From node 1:
bash-2.05$ ./lltping -c 1 -v
lltping: opening LLT dev: /dev/llt port: 30
lltping: mynodeid: 0
lltping: sending a msg to node 1
lltping: waiting for a msg from node 1
lltping: alarmhandler, exit on alarm
lltping: no response from server on node 1

From node 2:
bash-2.05$ ./lltping -c 0 -v
lltping: opening LLT dev: /dev/llt port: 30
lltping: mynodeid: 1
lltping: sending a msg to node 0
lltping: waiting for a msg from node 0
lltping: alarmhandler, exit on alarm
lltping: no response from server on node 0

I checked on the switches and the connections are on the same VLAN and it is active.
ce6 heartbeats are in VLAN 1 and ce4 heartbeats are in VLAN 6. How can I tell which on is the primary heartbeat and which is the secondary heart beat. I'm clueless as to what to do next.

mikebounds
Level 6
Partner Accredited

Try using temporary IPs using a different subnet than existing IPs - I usually use:

1.1.1.1 on node1 and 1.1.1.2 on node 2 using mask 255.255.255.0.

and then ping 1.1.1.1 from node 2 and ping 1.1.1.2 from node 1.

All the tools you have used so far just tell you the NIC is up, so there is an issue with the connection of the 2 NICs.

Mike

 

br1an
Level 3
Seems that way, I plumb the interfaces and tried pinging them but there was no respond. netstat shows that there are some traffic going through in that interface and I have no idea what it is and how to look for it.

g_lee
Level 6

netstat shows that there are some traffic going through in that interface and I have no idea what it is and how to look for it.

per earlier suggestion - have you tried looking at the output from snoop? dlpiping?

https://www-secure.symantec.com/connect/forums/veritas-cluster-server-heartbeat-link-down-jeapordy-state#comment-9082981

mikebounds
Level 6
Partner Accredited

If ping is not working, then either there is an issue with the connection between the interfaces, or the switches may be blocking ping, so assuming that both ce4 and ce6 both have the same firewall configurations, unplumb the IPs from ce4 and plumb them onto ce6 - if ping works using ce6 then this proves the issue is with the network (this proof is usually needed if you have a separate network team as I have alway found the network team say the issue is the O/S until you can prove otherwise).

The otherthing you can do is to do a traceroute (do this first while IPs are on ce4 and then repeat when IPs are on ce6).  Also if you want IPs on both interfaces at the same time so you don't have to keep unplumbing to compare, then use 1.1.2.1 and 1.1.2.2 (with mask 255.255.255.0) on the second set of interfaces.

Mike

br1an
Level 3
I did. But the moment I started the piping server I got this message on the engineA_log. As this is on the production side I'm very afraid to take chances. I don't want any failover happening or unavailability of resources while doing this. Dev is down for the time it will be some time next month before it gets back up. VCS NOTICE V-16-1-10438 Group group1 has been probed on system node2 VCS NOTICE V-16-1-10442 Initiating auto-start online of group group1 on system node1 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group group1 on all nodes VCS ERROR V-16-10001-6525 (node2) MultiNICB:MNIC:monitor:Interface configuration is not valid. Check IP addresses, subnet masks, flags, groupnames on all interfaces. Make sure that all the interfaces of only one IP subnet are included under the resource. See Bundled Agents Reference Guide

g_lee
Level 6

I did. But the moment I started the piping server I got this message on the engineA_log.

by piping do you mean dlpiping?

How are the interfaces configured at the moment?

# cat /etc/llttab

What is the output from snoop -d ce4 on either/both the servers?

br1an
Level 3
Sorry, yes I meant dlpiping. cat /etc/llttab set-node node2 set-cluster 10 link ce4 /dev/ce:4 - ether - - link ce6 /dev/ce:6 - ether - - output of snoop -d ce4 Using device /dev/ce (promiscuous mode) ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> (multicast) ETHER Type=2000 (Unknown), size = 458 bytes ? -> * ETHER Type=9000 (Loopback), size = 60 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (broadcast) ETHER Type=CAFE (Unknown), size = 74 bytes ? -> (multicast) ETHER Type=0000 (LLC/802.3), size = 52 bytes ? -> * ETHER Type=9000 (Loopback), size = 60 byte

g_lee
Level 6

For sanity's sake / to double check, can you please send the output from:

# hares -display MNIC   ### ie: the multinicb resource that logged errors in engine_A.log when you attempted the dlpiping

to make sure this isn't being used by llt, as dlpiping on its own should not have caused the multinicb res to complain

br1an
Level 3
I apologize, if I even know where to begin or look for I would have been very effective in figuring out this problem. Thank you very much for all the help, and I need all the info I can get. hares -display #Resource Attribute System Value MNIC Group global Group1 MNIC Type global MultiNICB MNIC AutoStart global 1 MNIC Critical global 0 MNIC Enabled global 1 MNIC LastOnline global node1 MNIC MonitorOnly global 0 MNIC ResourceOwner global unknown MNIC TriggerEvent global 0 MNIC ArgListValues node2 0 /sbin/in.mpathd 1 1 4 ce0 0 ce3 1 0 1 1 100 3 3 0 0.0.0.0 0 MNIC ArgListValues node1 0 /sbin/in.mpathd 1 1 4 ce0 0 ce3 1 0 1 1 100 3 3 0 0.0.0.0 0 MNIC ConfidenceLevel node2 0 MNIC ConfidenceLevel node1 0 MNIC Flags node2 MNIC Flags node1 MNIC IState node2 not waiting MNIC IState node1 not waiting MNIC Probed node2 1 MNIC Probed node1 1 MNIC Start node2 0 MNIC Start node1 0 MNIC State node2 ONLINE MNIC State node1 ONLINE MNIC ComputeStats global 0 MNIC ConfigCheck global 1 MNIC DefaultRouter global 0.0.0.0 MNIC Device global ce0 0 ce3 1 MNIC Failback global 0 MNIC IgnoreLinkStatus global 1 MNIC LinkTestRatio global 1 MNIC MpathdCommand global /sbin/in.mpathd MNIC MpathdRestart global 1 MNIC NetworkHosts global MNIC NetworkTimeout global 100 MNIC NoBroadcast global 0 MNIC OfflineTestRepeatCount global 3 MNIC OnlineTestRepeatCount global 3 MNIC ResourceInfo global State Valid Msg TS MNIC UseMpathd global 0 MNIC MonitorTimeStats node2 Avg 0 TS MNIC MonitorTimeStats node1 Avg 0 TS

g_lee
Level 6

The LLT interfaces don't seem to be used by MultiNICB (unless you have other MultiNICB resources configured) - which is good/correct.

Are you sure the MultiNICB messages are from the dlpiping, or from when you tried to configure an interface on the same subnet as mentioned above ( https://www-secure.symantec.com/connect/forums/veritas-cluster-server-heartbeat-link-down-jeapordy-state#comment-9082031 ) - it's hard to tell as there are no timestamps on the log messages provided

I still think the dlpiping would be the most straightforward way to see if the interfaces can see each other correctly

node1# /opt/VRTSllt/dlpiping -s /dev/ce:4

node2# /opt/VRTSllt/dlpiping -c /dev/ce:4 00:03:BA:94:F8:5F

or reverse

node2# /opt/VRTSllt/dlpiping -s /dev/ce:4

node1# /opt/VRTSllt/dlpiping -c /dev/ce:4 00:03:BA:94:A4:6F

Daniel_Matheus
Level 4
Employee Accredited Certified

Hello Brian,

 

I just reproduced this inhouse by disabling the LLT link on the 2nd node.

 

Link working:

[root@server102 /]# lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
   * 0 server102         OPEN
                                  e1000g0   UP      00:50:56:05:3C:E2
                                  e1000g1   UP      00:50:56:05:3C:E3
     1 server101         OPEN
                                  e1000g0   UP      00:50:56:05:3A:E0
                                  e1000g1   UP      00:50:56:05:3A:E1

 

 

You see packets get transferred


[root@server102 /]# snoop -d e1000g0
Using device e1000g0 (promiscuous mode)
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
server102-priv -> (broadcast)  ARP C Who is 192.168.40.102, server102-priv ?
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
           ? -> *            ETHER Type=CAFE (Unknown), size=62 bytes
 

 

Link disabled on the 2nd node

[root@server102 /]# lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
   * 0 server102         OPEN
                                  e1000g0   UP      00:50:56:05:3C:E2
                                  e1000g1   UP      00:50:56:05:3C:E3
     1 server101         OPEN
                                  e1000g0   DOWN
                                  e1000g1   UP      00:50:56:05:3A:E1

 

Now you see the same broadcast messages as in your snoop


[root@server102 /]# snoop -d e1000g0
Using device e1000g0 (promiscuous mode)
           ? -> (broadcast)  ETHER Type=CAFE (Unknown), size=90 bytes
           ? -> (broadcast)  ETHER Type=CAFE (Unknown), size=90 bytes
           ? -> (broadcast)  ETHER Type=CAFE (Unknown), size=90 bytes
           ? -> (broadcast)  ETHER Type=CAFE (Unknown), size=90 bytes

 

As you have the same problem on both nodes the issue is definitely in your network.

You need to check the cabling and your Switch/Vlan setting.

 

Thanks,
Dan

Marianne
Level 6
Partner    VIP    Accredited Certified

Extract from one of your older posts:

Seems that way, I plumb the interfaces and tried pinging them but there was no respond.

You need to call your network administrator and show that ping is not working.

 

kittu_pandu
Level 2

Hi

I am having a  little knwlodege on vcs , after a reboot of a server i found that node states as ADMIN_WAIT_STATE and other node state as LEAVING this is a 2-NODE CLUSTER

Is this occured due to incorrect configuration file? correct me if iam wrong

please explain me in which situations we find these errors

in which situations we find STALE file/error in MAIN.CF file

Thanks and Regards