Forum Discussion

kwakou

Level 4

10 years ago

unexpected reboot

Hi all,

I am runing a SFRAC environnement and one of my 2 nodes cluster frequently reboots unexpectedly.

I went through the OS logs and the VCS engine_A but didnt find any clue.

Is that an eviction ?

when i run lltstat, i can see some errors counters - what does it exactly mean ?

LLT errors:
0 Rcv not connected
0 Rcv unconfigured
0 Rcv bad dest address
0 Rcv bad source address
0 Rcv bad generation
0 Rcv no buffer
0 Rcv malformed packet
0 Rcv wrong length packet
0 Rcv bad SAP
0 Rcv bad STREAM primitive
0 Rcv bad DLPI primitive
0 Rcv DLPI error
120 Snd not connected
0 Snd no buffer
0 Snd stream flow drops
42867 Snd no links up
0 Rcv bad checksum
0 Rcv bad udp/ether source address
0 Rcv DLPI link-down error

how to efficiently trace these reboots ??

AIX

business continuity

Cluster Server

sudhir_h
10 years ago
There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.

Kindly provide your engine log, system kernel logs to be able to determine the issue.

If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.

Also are there any core dump that is being generated?

Regards,

Sudhir
Gaurav_S
10 years ago
Hi,

I agree with Sudhir, there are n number of possibilties ...

I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..

are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?

Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?

G
mikebounds
10 years ago
If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.

If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.

Mike

5 Replies

sudhir_h
Level 4
10 years ago
There could be many reasons, ranging from hardware problems, to network link problems, system load, to applications etc.

Kindly provide your engine log, system kernel logs to be able to determine the issue.

If sufficient system logs are not being generated, kindly edit the syslog.conf to log required messages.

Also are there any core dump that is being generated?

Regards,

Sudhir
Gaurav_S
Moderator
10 years ago
Hi,

I agree with Sudhir, there are n number of possibilties ...

I would recommend to configure crash dump in the server ... if an unexpected reboot is happening, it should generate a system dump .. provide the same to vendor & get analysis done of crash dump ..

are you saying that there is no panic string or any related messages in errpt log during the time of reboot or just before reboot ? Do you see any VCS action just before reboot happens ?

Regarding the LLT packets, it can not me made sure on when these errors occurred .. do you see errors increasing when system is up & running ?

G
mikebounds
Level 6
10 years ago
If VCS panics the box via fencing then there should be messages in the O/S system log (it is not shown in VCS log) - this is certainly the case for Solaris, so if there are no messages something else maybe causing the reboot, so I would increase O/S logging and enable crash logs as others have said.

If it is the same node that is rebooting and you still suspect issue is fencing then you configure Preferred fencing (see "Preferred fencing" in VCS admin guide) to give a higher weight to the node that does't reboot to see if that node starts rebooting.

Mike
kwakou
Level 4
9 years ago
Sorry, its been a while i didnt come to the forum.

For the above case, the problem was finally caused by a fencing issue.

the nodes were adressing the shared fencing disks in a different manner. (checked this with vxddladm get namingscheme).

I set the naming scheme on both nodes to enclosure based with the following command :

vxddladm set namingscheme=ebn persistence=yes

and then I updated the /etc/vxfentab on both nodes.

It is also possible to clear the keys, and check with vxfenadm -s all -f /etc/vxfentab

Adter that, there were no stale keys and no reboot was recorded again.
Gaurav_S
Moderator
9 years ago
Glad to know ..

Forum Discussion

unexpected reboot

5 Replies

Related Content

Event 41352 Unexpected COM exception caught - EV 14.1

Uninstall without Reboot

unexpected EOF on archive file while restoring

Re: Backup Exec locked drive and library go Offline

Re: Image import Completed Partially

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic