cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

Veritas Cluster LLT link failure

RayButler
Level 2

I'm having an issue I'm unable to identify with Veritas Cluster 4.0 MP1 on solaris 9. The cluster supports an instance of Oracle 9i. The node falls out of membership then a short while later will reconnect. Below are snips from /var/adm/messages and the logs from the Cisco switch.

Jul  4 04:02:50 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble
Jul  4 04:02:51 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active
Jul  4 04:02:53 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble
Jul  4 04:02:58 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active
Jul  4 04:02:58 jfkdbsp1 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 650 ticks from 1 link 0 (ce1)
Jul  4 04:02:58 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 12 hb seq 35184900 from 1 link 0 (ce1)
Jul  4 18:34:07 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble
Jul  4 18:34:11 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active
Jul  4 18:34:11 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 10 hb seq 35289448 from 1 link 0 (ce1)
Jul  4 18:34:13 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble
Jul  4 18:34:19 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 8 sec (36427270)
Jul  4 18:34:20 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active
Jul  4 18:34:20 jfkdbsp1 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 850 ticks from 1 link 0 (ce1)
Jul  4 18:34:20 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 16 hb seq 35289466 from 1 link 0 (ce1)
Jul  4 18:34:22 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble
Jul  4 18:34:24 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active
Jul  4 18:34:24 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 8 hb seq 35289475 from 1 link 0 (ce1)
Jul  4 18:34:35 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble
Jul  4 18:34:41 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 8 sec (36427294)
Jul  4 18:34:42 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 9 sec (36427294)
Jul  4 18:34:43 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 10 sec (36427294)
Jul  4 18:34:44 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 11 sec (36427294)
Jul  4 18:34:45 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 12 sec (36427294)
Jul  4 18:34:46 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 13 sec (36427294)
Jul  4 18:34:47 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 14 sec (36427294)
Jul  4 18:34:48 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 15 sec (36427294)
Jul  4 18:34:49 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 16 sec (36427294)
Jul  4 18:34:49 jfkdbsp1 llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 0 (ce1) node 1 expired
Jul  4 18:34:54 jfkdbsp1 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port a gen   53bd93 membership 01
Jul  4 18:34:54 jfkdbsp1 gab: [ID 608499 kern.notice] GAB INFO V-15-1-20037 Port a gen   53bd93   jeopardy ;1
Jul  4 18:34:54 jfkdbsp1 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port h gen   53bd9a membership 01
Jul  4 18:34:54 jfkdbsp1 gab: [ID 608499 kern.notice] GAB INFO V-15-1-20037 Port h gen   53bd9a   jeopardy ;1
Jul  4 18:34:54 jfkdbsp1 Had[2025]: [ID 702911 daemon.notice] VCS INFO V-16-1-10077 Received new cluster membership
Jul  4 18:34:54 jfkdbsp1 Had[2025]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10087 System jfkdbsf1 (Node '1') is in Regardy Membership - Membership: 0x3, Jeopardy: 0x2
Jul  4 18:34:55 jfkdbsp1 genunix: [ID 408789 kern.warning] WARNING: ce1: fault detected external to device; service degraded
Jul  4 18:34:55 jfkdbsp1 genunix: [ID 451854 kern.warning] WARNING: ce1: xcvr addr:0x00 - link down

Jul  4 18:34:56: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/12, changed state to down
Jul  4 18:34:57: %LINK-3-UPDOWN: Interface GigabitEthernet0/12, changed state to down

Jul  4 18:35:54 jfkdbsp1 genunix: [ID 408789 kern.notice] NOTICE: ce1: fault cleared external to device; service available
Jul  4 18:35:54 jfkdbsp1 genunix: [ID 451854 kern.notice] NOTICE: ce1: xcvr addr:0x00 - link up 1000 Mbps full duplex

Jul  4 18:35:56: %LINK-3-UPDOWN: Interface GigabitEthernet0/12, changed state to up
Jul  4 18:35:58: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/12, changed state to up

I've replaced the patch cable between the server and the switch, no change. engine_a.log goes back to the original install back in 2008, this issue has occured 700+ times. Since these interfaces aren't plumbed by the OS, is there any way to get diagnostic information from LLT that can shed some light on the cause? I have three sites with an identical config, of the three, I see these errors at site two but there are less than half the number, and site three has zero errors. Any help is appreciated. Thanks.

 

 

6 REPLIES 6

mikebounds
Level 6
Partner Accredited

If you have 1 VLAN for both heartbeats, these can cause lost heartbeats - if you can provide output of "lltstat -nvv" this usually shows duplicate MAC addresses if 2 separate VLANS are not used.

You can plumb IPs on the interfaces for diagnostics of the network - LLT just doesn't requires IPs, but it doesn't matter if there are IPS, except if there are IPs, then it is more likely someone might use the interfaces and effect traffic on the Heartbeats.

Mike

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi,

did you got a chance to see through traffic of the switch ? is switch heavily loaded or excessive broadcasting ? would be worth to think from network prospective ...

LLT is performing as expected where after getting a disconnect, waited for 16 seconds & then declared the node as expired ... & later we can see the interface getting down & then up.

I would prefer to get the troubleshooting done from network (either switch port or switch settings, you have tried replacing the cable anyways) ... also you can match the interface settings at the OS layer to ensure nothing is suspicious over there (use ndd -get to see details & compare with other sites)

Also The VCS version you are using is very old & is EOL long back, would strongly recommend to upgrade to latest version... lots of bug fixes in latest versions ..

If you want to know more on LLT tools ... you can have a look at /opt/VRTSllt folder ...there are couple of tools like lltstat , llttest ... some of them would give very high level info though some would give detailed info which need to send out support ...

 

G

TonyGriffiths
Level 6
Employee Accredited Certified

Hi

As Gaurav and Mike commented, LLT is performing as expected: Heart beats are missed ... threshold is reached ... link declared down. This is also backe dup by the driver messages: ce1 fault detected etc.

In the extract you pasted, I could only find references to NIC ce1 ? Are there other messages related to the other heartbeat links in your cluster ?

 

cheers

tony

 

RayButler
Level 2

Thanks for the comments. This link is in it's own dedicated VLAN, switch port usage is low, only the LLT traffic. The switch backbone does see high load during backups, but the interface drops occur both during and outside of the backup windows. We utilize both ce1 and ce6 for LLT, ce1 is in a dedicated VLAN but ce6 is using VLAN1.

bash-2.05# lltstat -l
LLT link information:
    Link  Tag   State  Type  Pri     SAP    MTU    Addrlen
          Xmit          Recv          Err           LateHB
          Broadcast
    0            ce1  on     etherfp   hipri   0xCAFE 1500   6
          1292283       999385        237795        0             4
          FF:FF:FF:FF:FF:FF

    1            ce6  on     etherfp   hipri   0xCAFE 1500   6
          1295561       1004699       240467        0             6
          FF:FF:FF:FF:FF:FF

bash-2.05# lltstat -nvv|head
LLT node information:
    Node                 State    Link  Status  Address
   * 0 jfkdbsp1          OPEN
                                  ce1   UP      00:03:BA:93:2D:08
                                  ce6   UP      00:03:BA:85:53:60
     1 jfkdbsf1          OPEN
                                  ce1   UP      00:03:BA:93:2C:DB
                                  ce6   UP      00:03:BA:95:05:81

Gaurav_S
Moderator
Moderator
   VIP    Certified

Errors are definitely looking high ,infact on both the links ... I think would be worth to get the switch checked

 

If there is a possibility to test out something, you can try putting up cross cables so that you can eliminate the switch & see how it performs ..

This would add little steps though, don't remove both the links together ... can try by putting one link with cross cable & see if situation improves ..

G

RayButler
Level 2

The nodes are in different server rooms, I can't cross connect the links but I will look at the network config. I don't understand why we don't have dedicated VLAN's for both links. Thanks.