Veritas Cluster LLT link failure
I'm having an issue I'm unable to identify with Veritas Cluster 4.0 MP1 on solaris 9. The cluster supports an instance of Oracle 9i. The node falls out of membership then a short while later will reconnect. Below are snips from /var/adm/messages and the logs from the Cisco switch.
Jul 4 04:02:50 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble Jul 4 04:02:51 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active Jul 4 04:02:53 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble Jul 4 04:02:58 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active Jul 4 04:02:58 jfkdbsp1 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 650 ticks from 1 link 0 (ce1) Jul 4 04:02:58 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 12 hb seq 35184900 from 1 link 0 (ce1) Jul 4 18:34:07 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble Jul 4 18:34:11 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active Jul 4 18:34:11 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 10 hb seq 35289448 from 1 link 0 (ce1) Jul 4 18:34:13 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble Jul 4 18:34:19 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 8 sec (36427270) Jul 4 18:34:20 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active Jul 4 18:34:20 jfkdbsp1 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 850 ticks from 1 link 0 (ce1) Jul 4 18:34:20 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 16 hb seq 35289466 from 1 link 0 (ce1) Jul 4 18:34:22 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble Jul 4 18:34:24 jfkdbsp1 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce1) node 1 active Jul 4 18:34:24 jfkdbsp1 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 8 hb seq 35289475 from 1 link 0 (ce1) Jul 4 18:34:35 jfkdbsp1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce1) node 1 in trouble Jul 4 18:34:41 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 8 sec (36427294) Jul 4 18:34:42 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 9 sec (36427294) Jul 4 18:34:43 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 10 sec (36427294) Jul 4 18:34:44 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 11 sec (36427294) Jul 4 18:34:45 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 12 sec (36427294) Jul 4 18:34:46 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 13 sec (36427294) Jul 4 18:34:47 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 14 sec (36427294) Jul 4 18:34:48 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 15 sec (36427294) Jul 4 18:34:49 jfkdbsp1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (ce1) node 1 inactive 16 sec (36427294) Jul 4 18:34:49 jfkdbsp1 llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 0 (ce1) node 1 expired Jul 4 18:34:54 jfkdbsp1 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port a gen 53bd93 membership 01 Jul 4 18:34:54 jfkdbsp1 gab: [ID 608499 kern.notice] GAB INFO V-15-1-20037 Port a gen 53bd93 jeopardy ;1 Jul 4 18:34:54 jfkdbsp1 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port h gen 53bd9a membership 01 Jul 4 18:34:54 jfkdbsp1 gab: [ID 608499 kern.notice] GAB INFO V-15-1-20037 Port h gen 53bd9a jeopardy ;1 Jul 4 18:34:54 jfkdbsp1 Had[2025]: [ID 702911 daemon.notice] VCS INFO V-16-1-10077 Received new cluster membership Jul 4 18:34:54 jfkdbsp1 Had[2025]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10087 System jfkdbsf1 (Node '1') is in Regardy Membership - Membership: 0x3, Jeopardy: 0x2 Jul 4 18:34:55 jfkdbsp1 genunix: [ID 408789 kern.warning] WARNING: ce1: fault detected external to device; service degraded Jul 4 18:34:55 jfkdbsp1 genunix: [ID 451854 kern.warning] WARNING: ce1: xcvr addr:0x00 - link down Jul 4 18:34:56: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/12, changed state to down Jul 4 18:34:57: %LINK-3-UPDOWN: Interface GigabitEthernet0/12, changed state to down Jul 4 18:35:54 jfkdbsp1 genunix: [ID 408789 kern.notice] NOTICE: ce1: fault cleared external to device; service available Jul 4 18:35:54 jfkdbsp1 genunix: [ID 451854 kern.notice] NOTICE: ce1: xcvr addr:0x00 - link up 1000 Mbps full duplex Jul 4 18:35:56: %LINK-3-UPDOWN: Interface GigabitEthernet0/12, changed state to up Jul 4 18:35:58: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/12, changed state to up
I've replaced the patch cable between the server and the switch, no change. engine_a.log goes back to the original install back in 2008, this issue has occured 700+ times. Since these interfaces aren't plumbed by the OS, is there any way to get diagnostic information from LLT that can shed some light on the cause? I have three sites with an identical config, of the three, I see these errors at site two but there are less than half the number, and site three has zero errors. Any help is appreciated. Thanks.