07-15-2015 10:00 AM
Hi all,
I am runing a SFRAC environnement and one of my 2 nodes cluster frequently reboots unexpectedly.
I went through the OS logs and the VCS engine_A but didnt find any clue.
Is that an eviction ?
when i run lltstat, i can see some errors counters - what does it exactly mean ?
LLT errors:
0 Rcv not connected
0 Rcv unconfigured
0 Rcv bad dest address
0 Rcv bad source address
0 Rcv bad generation
0 Rcv no buffer
0 Rcv malformed packet
0 Rcv wrong length packet
0 Rcv bad SAP
0 Rcv bad STREAM primitive
0 Rcv bad DLPI primitive
0 Rcv DLPI error
120 Snd not connected
0 Snd no buffer
0 Snd stream flow drops
42867 Snd no links up
0 Rcv bad checksum
0 Rcv bad udp/ether source address
0 Rcv DLPI link-down error
how to efficiently trace these reboots ??
Solved! Go to Solution.
07-15-2015 07:50 PM
There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.
Kindly provide your engine log, system kernel logs to be able to determine the issue.
If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.
Also are there any core dump that is being generated?
Regards,
Sudhir
07-15-2015 08:41 PM
Hi,
I agree with Sudhir, there are n number of possibilties ...
I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..
are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?
Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?
G
07-16-2015 12:53 AM
If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.
If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.
Mike
07-15-2015 07:50 PM
There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.
Kindly provide your engine log, system kernel logs to be able to determine the issue.
If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.
Also are there any core dump that is being generated?
Regards,
Sudhir
07-15-2015 08:41 PM
Hi,
I agree with Sudhir, there are n number of possibilties ...
I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..
are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?
Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?
G
07-16-2015 12:53 AM
If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.
If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.
Mike
10-09-2015 02:44 AM
Sorry, its been a while i didnt come to the forum.
For the above case, the problem was finally caused by a fencing issue.
the nodes were adressing the shared fencing disks in a different manner. (checked this with vxddladm get namingscheme).
I set the naming scheme on both nodes to enclosure based with the following command :
vxddladm set namingscheme=ebn persistence=yes
and then I updated the /etc/vxfentab on both nodes.
It is also possible to clear the keys, and check with vxfenadm -s all -f /etc/vxfentab
Adter that, there were no stale keys and no reboot was recorded again.
10-09-2015 03:32 AM
Glad to know ..