cancel
Showing results for 
Search instead for 
Did you mean: 

Got STALE_ADMIN_WAIT state after reboot with init 6 simultaneously both ser

574308445
Level 2
Hi,

I'm doing the standard reboot test in the VCS cluster. While executing init 6 in both servers simultaneously both nodes show STALE_ADMIN_WAIT state . To bringing to online I have to execute the command:
# hasys -force
then it's ok
This shouldn't be happening. Do you know why I'm having this behaviour?

Thanks
Cecilia
5 REPLIES 5

Stumpr2
Level 6
source technote:
After a reboot, a node in a VERITAS Cluster Server (VCS) environment is in an ADMIN_WAIT state or in a STALE_ADMIN_WAIT state

http://seer.support.veritas.com/docs/199462.htm


If all systems are in STALE_ADMIN_WAIT or ADMIN_WAIT, first validate the configuration file (/etc/VRTSvcs/conf/config/main.cf) on all systems in the cluster by running the 'hacf -verify .' command for syntax error check (ensure that this command is run in the directory containing the main.cf file), and reviewing its contents for proper resource and service group definitions.
Then enter the following command on the system with the correct configuration file to force start VCS.


# hasys -force system_name


This will have the effect of starting Cluster Server on that node and starting Cluster Server running on all other nodes in the ADMIN_WAIT or STALE_ADMIN_WAIT state.

One of the most common causes of a node being in one of these states is the existence of /etc/VRTSvcs/conf/config/.stale. This file is typically left behind if Cluster Server is stopped while the configuration is still open, i.e. someone has forgotten to save changes made to a running main.cf configuration. The .stale file is deleted automatically if changes are correctly saved and will therefore not force the relevant node into an ADMIN state when it next has to restart Cluster Server. As indicated earlier, the file can be safely removed if the main.cf file is known to be ok.Message was edited by:
Bob Stump

574308445
Level 2
Thanks a lot for your reply!
I was able to bring it back with the commad:
hasys -force

However next time I reboot both servers simultaneously again got the ADMIN_WAIT status and need to run the
hasys -force command again

I have verified and validated the configuration file by running hacf -verify . under /etc/VRTSvcs/conf/config/main.cf It's OK. Also don't have the /etc/VRTSvcs/conf/config/.stale script in either machine.

So I'm able to bring them online by forcing them but still the problem is open.
When I reboot one machine at a time the problem doesn't happen. It's just when rebooting simultaneously.

Thanks for your help. It's greatly appreciate it.

Janette

Robert_Bailey_3
Level 2
This error is the result of the main.cf being open for write at time of reboot. You might want to check if you have some user accessing the cluster via a GUI with a login that sets the main.cf to write.

To validate, try running hacf -dump -makerw directly before your reboot. If the command doesn't fail then it is being opened rw somewhere in your environment.

Robert_Bailey_3
Level 2
Sorry, pre-coffee typo:
hacf -dump -makero

Gene_Henriksen
Level 6
Accredited Certified
The reason it works fine when you reboot one node is that on reboot the node broadcasts out that it is initializing and is looking for a running member of the cluster, finding one, it gets a fresh copy of the configuration from the memory of the running node.

.stale is not a script, it is an empty file created by opening the configuration to make changes. Once changes are made you should close the configuration. Watch the .stale file appear and disappear as you execute haconf -makerw and haconf -dump -makero. the STALE condition can occur because you are shutting down the systems with the configuration in a writeable condition. The STALE is a warning to the admin that the configuration may have been open (it could be a syntax error also by editing the file and may be incorrect. This is analogous to the LockOut/TagOut warnings placed on equipment undergoing repair to avoid someone starting the equipment and injuring someone or breaking the equipment.

Close the configuration with hacondf -dump -makero and retry the reboot of both servers.