cancel
Showing results for 
Search instead for 
Did you mean: 

unexpected reboot

kwakou
Level 4

Hi all,

I am runing a SFRAC environnement and one of my 2 nodes cluster frequently reboots unexpectedly.

I went through the OS logs and the VCS engine_A but didnt find any clue.

Is that an eviction ?

when i run lltstat, i can see some errors counters - what does it exactly mean ?

LLT errors:
    0          Rcv not connected
    0          Rcv unconfigured
    0          Rcv bad dest address
    0          Rcv bad source address
    0          Rcv bad generation
    0          Rcv no buffer
    0          Rcv malformed packet
    0          Rcv wrong length packet
    0          Rcv bad SAP
    0          Rcv bad STREAM primitive
    0          Rcv bad DLPI primitive
    0          Rcv DLPI error
    120        Snd not connected
    0          Snd no buffer
    0          Snd stream flow drops
    42867      Snd no links up
    0          Rcv bad checksum
    0          Rcv bad udp/ether source address
    0          Rcv DLPI link-down error

 

how to efficiently trace these reboots ??

3 ACCEPTED SOLUTIONS

Accepted Solutions

sudhir_h
Level 4
Employee

There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.

Kindly provide your engine log, system kernel logs to be able to determine the issue.

If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.

Also are there any core dump that is being generated?

 

Regards,

Sudhir

View solution in original post

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi,

I agree with Sudhir, there are n number of possibilties ...

I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..

are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?

Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?

 

G

View solution in original post

mikebounds
Level 6
Partner Accredited

If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.

If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.

Mike

View solution in original post

5 REPLIES 5

sudhir_h
Level 4
Employee

There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.

Kindly provide your engine log, system kernel logs to be able to determine the issue.

If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.

Also are there any core dump that is being generated?

 

Regards,

Sudhir

View solution in original post

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi,

I agree with Sudhir, there are n number of possibilties ...

I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..

are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?

Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?

 

G

View solution in original post

mikebounds
Level 6
Partner Accredited

If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.

If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.

Mike

View solution in original post

kwakou
Level 4

Sorry, its been a while i didnt come to the forum.

For the above case, the problem was finally caused by a fencing issue.

the nodes were adressing the shared fencing disks in a different manner. (checked this with vxddladm get namingscheme).

I set the naming scheme on both nodes to enclosure based with the following command :

vxddladm set namingscheme=ebn persistence=yes

and then I updated the /etc/vxfentab on both nodes.

It is also possible to clear the keys, and check with vxfenadm -s all -f /etc/vxfentab

Adter that, there were no stale keys and no reboot was recorded again.

 

Gaurav_S
Moderator
Moderator
   VIP    Certified

Glad to know ..