Solved: Hi Leonid, If you are only

vostrushka · ‎01-20-2013

My cluster was configured some time ago and had two dedicated and one low priority LLT links.
About a week ago I found that one of the dedicated links is down. It turned out that ports on the switch were disabled.
So, I asked to enable them and OS on both nodes started to see interfaces running. But the cluster still shows that one link is down.
Please find below the lltstat output from both nodes and llttab files from both nodes:

As you can see each node sees its own links but only two from the other node. I thought that restart of the cluster and perhaps the node will help, but not sure. I also did not want to restart aplication if it is not necessary.

My question is will it help to stop cluster with hastop -all -force command and then restart or reconfigure LLT links or I really have to stop everything, fix LLT links and then start the app. Only one node of the cluster is active the other does not have any application services running.

node-ora01:root ~ # lltstat -n
LLT node information:
    Node                 State    Links
   * 0 node-ora01   OPEN        3
     1 node-ora02   OPEN        2

node-ora01:root ~ # cat /etc/llttab
set-node node-ora01
set-cluster 3415
link eth1 eth-9c:8e:99:fa:21:0a - ether - -
link eth3 eth-9c:8e:99:fa:21:0e - ether - -
link-lowpri bond0 bond0 - ether - -

node-ora02:root ~ # lltstat -n
LLT node information:
    Node                 State    Links
     0 node-ora01   OPEN        2
   * 1 node-ora02   OPEN        3

node-ora02:root ~ # cat /etc/llttab
set-node node-ora02
set-cluster 3415
link eth1 eth-9c:8e:99:f9:ec:bc - ether - -
link eth3 eth-9c:8e:99:f9:ec:c0 - ether - -
link-lowpri bond0 bond0 - ether - -

vostrushka · ‎01-31-2013

I did get down time for the cluster but it turned out that something wrong with ports on the switch or cables.

So far only node itself sees NIC is up and running, other node does not.

I also found how to enable or disable LLT link without restarting the whole stack or the cluster:

lltconfig -u eth3

lltconfig -t eth3 -d eth3

Thank you all.

Leonid

View solution in original post

mikebounds · ‎01-20-2013

Can you post output of "lltstat -nvv from each node. It sounds as though the connection is broke between a pair of the interfaces - you can test this by plumbing in IPs to on broken link to see if you ping.

Mike

vostrushka · ‎01-20-2013

Please find below lltstat -nvv output:

Node 1

---------

LLT node information:

Node State Link Status Address

* 0 node-ora01 OPEN

eth1 UP 9C:8E:99:FA:21:0A

eth3 UP 9C:8E:99:FA:21:0E

bond0 UP 9C:8E:99:FA:21:08

1 node-ora02 OPEN

eth1 UP 9C:8E:99:F9:EC:BC

eth3 DOWN

bond0 UP 9C:8E:99:F9:EC:BA

2 CONNWAIT

eth1 DOWN

eth3 DOWN

bond0 DOWN

Node 2

---------

LLT node information:

Node State Link Status Address

0 node-ora01 OPEN

eth1 UP 9C:8E:99:FA:21:0A

eth3 DOWN

bond0 UP 9C:8E:99:FA:21:08

* 1 node-ora02 OPEN

eth1 UP 9C:8E:99:F9:EC:BC

eth3 UP 9C:8E:99:F9:EC:C0

bond0 UP 9C:8E:99:F9:EC:BA

2 CONNWAIT

eth1 DOWN

eth3 DOWN

bond0 DOWN

mikebounds · ‎01-21-2013

This output means the connection between eth3 is down. The output should be interpreted as follows:

0 node-ora01 OPEN

eth1 UP 9C:8E:99:FA:21:0A Can see interface on other node

eth3 DOWN Can NOT see interface on other node

bond0 UP 9C:8E:99:FA:21:08 Can see interface on other node

* 1 node-ora02 OPEN

eth1 UP 9C:8E:99:F9:EC:BC Local interface is UP

eth3 UP 9C:8E:99:F9:EC:C0 Local interface is UP

bond0 UP 9C:8E:99:F9:EC:BA Local interface is UP

So local interfaces are ok, but connection is broken for eth3

Mike

vostrushka · ‎01-21-2013

Yes, I understand that.

The way I tried to restart it today morning did not help. Perhaps, I did not follow the right sequence.

I'll try it on my test cluster first and then try again.

Leonid

mikebounds · ‎01-22-2013

Perhaps I wasn't clear enough in my last post - this is not an issue with the cluster or the node - it is an issue with the cables or the switch. You do not need to restart anything on the host for it to see a network connection that was previously broken. The only possible problem with the host is if the eth3 interfaces on each machine are running at different speeds, but if it was working previously and you have not changed anything, then this is very unlikely. As I said earlier, to verify link is down, what I would do is to plumb IPs on the interfaces - example:

First test eth1 works with IPs (i.e you are testing there are no firewalls that allows LLT and not ping)

plumb 1.1.1.1, mask 255.255.255.0 on eth1, node-ora01

plumb 1.1.1.2, mask 255.255.255.0 on eth1,node-ora02

Then test you can ping 1.1.1.2 from node-ora01 and if it doesn't work, you could try ssh, traceroute or "telnet 1.1.1.2 port" to see if other ports work. You can test connection the other way too.

Once you have verified this works - then test eth3:

plumb 1.1.3.1, mask 255.255.255.0 on eth3, node-ora01

plumb 1.1.3.2, mask 255.255.255.0 on eth3, node-ora02

Repeat connection tests for 1.1.3.2 from node-ora01

Mike

vostrushka · ‎01-22-2013

I see. No, I am not giving up. ;)

I just take a step back to see what I can do. I will do some testings with IP addresses and check speed.

Then try to play it on my test cluster. I'll report in couple of days what become a solution.

Leonid

avsrini · ‎01-30-2013

Hi Leonid,

If you are only running VCS with gab ports a and h, then yes you can force stop VCS with applications

running and then restart GAB / LLT for eth3 to come up. But as others mentioned, make sure eth3

is connected between nodes via pinging an temp IP.

Regards

Srini

vostrushka · ‎01-31-2013

I did get down time for the cluster but it turned out that something wrong with ports on the switch or cables.

So far only node itself sees NIC is up and running, other node does not.

I also found how to enable or disable LLT link without restarting the whole stack or the cluster:

lltconfig -u eth3

lltconfig -t eth3 -d eth3

Thank you all.

Leonid

VOX

LLT connections mismatch