Forum Discussion

kwakou's avatar
kwakou
Level 4
10 years ago

unexpected reboot

Hi all,

I am runing a SFRAC environnement and one of my 2 nodes cluster frequently reboots unexpectedly.

I went through the OS logs and the VCS engine_A but didnt find any clue.

Is that an eviction ?

when i run lltstat, i can see some errors counters - what does it exactly mean ?

LLT errors:
    0          Rcv not connected
    0          Rcv unconfigured
    0          Rcv bad dest address
    0          Rcv bad source address
    0          Rcv bad generation
    0          Rcv no buffer
    0          Rcv malformed packet
    0          Rcv wrong length packet
    0          Rcv bad SAP
    0          Rcv bad STREAM primitive
    0          Rcv bad DLPI primitive
    0          Rcv DLPI error
    120        Snd not connected
    0          Snd no buffer
    0          Snd stream flow drops
    42867      Snd no links up
    0          Rcv bad checksum
    0          Rcv bad udp/ether source address
    0          Rcv DLPI link-down error

 

how to efficiently trace these reboots ??

  • There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.

    Kindly provide your engine log, system kernel logs to be able to determine the issue.

    If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.

    Also are there any core dump that is being generated?

     

    Regards,

    Sudhir

  • Hi,

    I agree with Sudhir, there are n number of possibilties ...

    I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..

    are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?

    Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?

     

    G

  • If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.

    If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.

    Mike

5 Replies

  • There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.

    Kindly provide your engine log, system kernel logs to be able to determine the issue.

    If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.

    Also are there any core dump that is being generated?

     

    Regards,

    Sudhir

  • Hi,

    I agree with Sudhir, there are n number of possibilties ...

    I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..

    are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?

    Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?

     

    G

  • If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.

    If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.

    Mike

  • Sorry, its been a while i didnt come to the forum.

    For the above case, the problem was finally caused by a fencing issue.

    the nodes were adressing the shared fencing disks in a different manner. (checked this with vxddladm get namingscheme).

    I set the naming scheme on both nodes to enclosure based with the following command :

    vxddladm set namingscheme=ebn persistence=yes

    and then I updated the /etc/vxfentab on both nodes.

    It is also possible to clear the keys, and check with vxfenadm -s all -f /etc/vxfentab

    Adter that, there were no stale keys and no reboot was recorded again.