The unexpected reboo

kingtux · ‎01-16-2012

I have a three node cluster running HPUX 11.31 with CFS and Custom package agent...Today I rebooted one of the nodes after I manually switched service groups to 2nd node, this caused all other nodes in the cluster to reboot aswell? What could cause this?

Gaurav_S · ‎01-16-2012

Hi,

There could be couple of causes ... however most suspectful could be:

1. Do you use SFRAC within this cluster (using Oracle RAC CRS component) ? If yes, was there any heavy system load during the reboot time ? If CRS can't establish internode communication either because of heavy load OR because of network issue on heartbeats, it may initiate a panic to other nodes.

2. Do you use Symantec IOFencing ? If yes, was there any recent issues on the heartbeat links ?

Best thing I would suggest here is to look at logs & also at crashinfo. If veritas fencing had caused the issue, you might see "vxfen" module triggering the panic (you can observe the same from crash stack or crash info)

There could be other reasons as well (any unknown issue) but would suggest to check for above...

Would be worth to paste the syslog or any useful log you have from the time of crash/reboot...

Gaurav

kingtux · ‎01-17-2012

Thanks for the response...I do not have SFRAC componets in cluster and I do have IO fencing setup which I believe is setup correctly. I will check logs and upload once I have a moment.

mikebounds · ‎01-17-2012

I would say that almost certainly that I/O fencing is causing this issue - please give fencing config from each node:

 head /etc/*fen*
vxdisk -o alldgs list | egrep "vxfen|coord"

Mike

Gaurav_S · ‎01-17-2012

IOFencing might be setup in right way .... however the interesting point would be to know the existence of SCSI3 registration & reservation keys.

Just as an e.g, you might have IOFencing running on all the nodes, however when one node was shutdown (effectively there was change in GAB membership), IOFencing would verify the registration keys on coordinator disks & if in case registration of other nodes found missing (keys deleted somehow), it will initiate a panic for other nodes as well..

Though since the setup is already rebooted, fencing module may have registered the keys again, but would be worth to check again.. check the keys using:

# vxfenadm -g all -f /etc/vxfentab

Also verify the disks defined in /etc/vxfentab are indeed the disks used for IOFencing coordinator DG.

Gaurav

B__Havey · ‎01-19-2012

What is the value of the PanicSystemOnDGLoss attribute of the DiskGroup resource?

A possible cause of panic is loss of access to shared storage.

kingtux · ‎01-22-2012

sorry for late post but kinda busy...

I've check coord dg which is online on all host

I've checked keys on all host -- they all match (but could be do the reboot)

I should probaly schedule a maint reboot and see if this happens again.

Gaurav_S · ‎01-22-2012

Hello,

When you say that keys match from all the hosts ... make sure that keys FROM all the hosts exists .. for e.g if its a 3 node cluster, you should see A--- , B----- & C---- keys (from all 3 nodes).

As I said before, it is quite possible that due to restart of servers, IOFencing registered the keys again so might not see the issue again.... however if you have any old veritas explorer, it would be worth to see if the correct keys existed in the coordinator disks..

G

VOX

HPUX Cluster 5.1