Forum Discussion

vostrushka's avatar
vostrushka
Level 3
12 years ago

LLT connections mismatch

My cluster was configured some time ago and had two dedicated and one low priority LLT links.
About a week ago I found that one of the dedicated links is down. It turned out that ports on the switch were disabled.
So, I asked to enable them and OS on both nodes started to see interfaces running. But the cluster still shows that one link is down.
Please find below the lltstat output from both nodes and llttab files from both nodes:

As you can see each node sees its own links but only two from the other node. I thought that restart of the cluster and perhaps the node will help, but not sure. I also did not want to restart aplication if it is not necessary.

My question is will it help to stop cluster with hastop -all -force command and then restart or reconfigure LLT links or I really have to stop everything, fix LLT links and then start the app. Only one node of the cluster is active the other does not have any application services running.


node-ora01:root ~ # lltstat -n
LLT node information:
    Node                 State    Links
   * 0 node-ora01   OPEN        3
     1 node-ora02   OPEN        2

node-ora01:root ~ # cat /etc/llttab
set-node node-ora01
set-cluster 3415
link eth1 eth-9c:8e:99:fa:21:0a - ether - -
link eth3 eth-9c:8e:99:fa:21:0e - ether - -
link-lowpri bond0 bond0 - ether - -
 

node-ora02:root ~ # lltstat -n
LLT node information:
    Node                 State    Links
     0 node-ora01   OPEN        2
   * 1 node-ora02   OPEN        3

node-ora02:root ~ # cat /etc/llttab
set-node node-ora02
set-cluster 3415
link eth1 eth-9c:8e:99:f9:ec:bc - ether - -
link eth3 eth-9c:8e:99:f9:ec:c0 - ether - -
link-lowpri bond0 bond0 - ether - -

 

  • I did get down time for the cluster but it turned out that something wrong with ports on the switch or cables.

    So far only node itself sees NIC is up and running, other node does not.

    I also found how to enable or disable LLT link without restarting the whole stack or the cluster:

    lltconfig -u eth3

    lltconfig -t eth3 -d eth3

    Thank you all.

    Leonid

     

  • Can you post output of "lltstat -nvv from each node.  It sounds as though the connection is broke between a pair of the interfaces - you can test this by plumbing in IPs to on broken link to see if you ping.

    Mike

  • Please find below lltstat -nvv output:

     

    Node 1
    ---------
    LLT node information:
        Node                 State    Link  Status  Address
       * 0 node-ora01   OPEN    
                                      eth1   UP      9C:8E:99:FA:21:0A
                                      eth3   UP      9C:8E:99:FA:21:0E
                                      bond0   UP      9C:8E:99:FA:21:08
         1 node-ora02   OPEN    
                                      eth1   UP      9C:8E:99:F9:EC:BC
                                      eth3   DOWN    
                                      bond0   UP      9C:8E:99:F9:EC:BA
         2                   CONNWAIT
                                      eth1   DOWN    
                                      eth3   DOWN    
                                      bond0   DOWN    
     
    Node 2
    ---------
    LLT node information:
        Node                 State    Link  Status  Address
         0 node-ora01   OPEN    
                                      eth1   UP      9C:8E:99:FA:21:0A
                                      eth3   DOWN    
                                      bond0   UP      9C:8E:99:FA:21:08
       * 1 node-ora02   OPEN    
                                      eth1   UP      9C:8E:99:F9:EC:BC
                                      eth3   UP      9C:8E:99:F9:EC:C0
                                      bond0   UP      9C:8E:99:F9:EC:BA
         2                   CONNWAIT
                                      eth1   DOWN    
                                      eth3   DOWN    
                                      bond0   DOWN    
     
  • This output means the connection between eth3 is down.  The output should be interpreted as follows:

     

         0 node-ora01   OPEN    
                                      eth1   UP      9C:8E:99:FA:21:0A     Can see interface on other node
                                      eth3   DOWN                                   Can NOT see interface on other node
                                      bond0   UP      9C:8E:99:FA:21:08  Can see interface on other node
       * 1 node-ora02   OPEN    
                                      eth1   UP      9C:8E:99:F9:EC:BC      Local interface is UP
                                      eth3   UP      9C:8E:99:F9:EC:C0      Local interface is UP
                                      bond0   UP      9C:8E:99:F9:EC:BA   Local interface is UP
     
    So local interfaces are ok, but connection is broken for eth3
     
    Mike

     

  • Yes, I understand that.

    The way I tried to restart it today morning did not help. Perhaps, I did not follow the right sequence.

    I'll try it on my test cluster first and then try again.

    Leonid

  • Perhaps I wasn't clear enough in my last post - this is not an issue with the cluster or the node - it is an issue with the cables or the switch.  You do not need to restart anything on the host for it to see a network connection that was previously broken.  The only possible problem with the host is if the eth3 interfaces on each machine are running at different speeds, but if it was working previously and you have not changed anything, then this is very unlikely.  As I said earlier, to verify link is down, what I would do is to plumb IPs on the interfaces - example:

    First test eth1 works with IPs (i.e you are testing there are no firewalls that allows LLT and not ping)

    plumb 1.1.1.1, mask 255.255.255.0 on eth1, node-ora01 

    plumb 1.1.1.2, mask 255.255.255.0 on eth1,node-ora02

    Then test you can ping 1.1.1.2 from node-ora01 and if it doesn't work, you could try ssh, traceroute or "telnet 1.1.1.2 port" to see if other ports work.  You can test connection the other way too.

    Once you have verified this works - then test eth3:

     

    plumb 1.1.3.1, mask 255.255.255.0 on eth3, node-ora01 

    plumb 1.1.3.2, mask 255.255.255.0 on eth3, node-ora02

    Repeat connection tests for 1.1.3.2 from node-ora01

    Mike

  • I see. No, I am not giving up. ;-)

    I just take a step back to see what I can do. I will do some testings with IP addresses  and check speed.

    Then try to play it on my test cluster. I'll report in couple of days what become a solution.

    Leonid

  • Hi Leonid,

    If you are only running VCS with gab ports a and h, then yes you can force stop VCS with applications

    running and then restart GAB / LLT for eth3 to come up. But as others mentioned, make sure eth3

    is connected between nodes via pinging an temp IP.

     

    Regards

    Srini

     

  • I did get down time for the cluster but it turned out that something wrong with ports on the switch or cables.

    So far only node itself sees NIC is up and running, other node does not.

    I also found how to enable or disable LLT link without restarting the whole stack or the cluster:

    lltconfig -u eth3

    lltconfig -t eth3 -d eth3

    Thank you all.

    Leonid