I have a scnario where applications are under cluster [2 node cluster]. I had an issue in the past where hastop -local was executed before bringing down applications manually which resulted in appsg to go in faulted state.
My question is should I freeze the service groups individually or should I freeze the nodes while performing OS patch maintainance ? Does freezing node solves the same purpose as that freezing sg's in a perticular node ?
If you are stopping VCS for maintenance and you want the service group online then you should run "hastop -local -force". The -force will leave all service groups online (except for the ClusterService group.)
If you are needing to stop and start the cluster for your maintenance and you do not want the service group to react, I would recommend persistantly freezing the node. That will affect all service groups on the node so you will not need to worry about each group individually. The persistant freeze will survive a restart of HAD on the node. A Temporary freeze will be automatically removed when HAD is restarted on the node.
If Group is frozen:
VCS will not take any failover action on the group. The Agents will only monitor the resources in the group, will report any state-change and group may be seen as faulted if a resource is faulted. However there will be no failover action for the group.
If the group is frozen persistently, all the above is applicable in a current life of cluster. However, if the cluster was to boot-strap again (ex: hastop -all followed by starting all nodes), the group's state on first probe will be honored. no further actions will be taken - like AutoStart of group.
If System is frozen:
VCS will not choose this node as target for online of any group. However already existing online groups will continue to be online till there is need for it to be failed-over due to user-action (switch/offline) or fault-event.
Temporary frozen system will have this behavior in current life of cluster. However the persistent frozen system will continue this behavior if the cluster was to boot-strap again.
Coming back to the question: For OS patch maintenance - i would suggest to evaluate the following:
1. If the OS maintenance does not impact the the health of applications (VCS and related service groups), then you may just apply the patch.
2. while in maintenance, if you do not want any service groups to be brought online on this particular node - but you need VCS to be running during this time, you would want node to be frozen. You may have evacuated the node by switching all online service groups to other node. If the patch needs reboot and system may not be yet stable after reboot is over, you would not want this node to bring groups online, in this case node should be persistently frozen.
3. If the reboot is required, VCS will evacuate the service groups to the other node. If you do not need the service groups to be evacuated, the persistent freeze of service groups is good idea.
Thanks Amit and Wally for your valuable answers. That really helps.
So, now, If you can advise if this is correct :
1) Resepctive Service Groups Persistenly frozen on both the nodes since OS maintanace is being carried on both the nodes at the same time and both the nodes will require reboot.
2) Apps will be brought down manually.
3) OS maintanace starts for which I am going to move rc2.d and rc3.d scripts for VCS on both the nodes so that with every reboot I don't have to worry about the cluster part. I will be doing hastop -all before this.
4) Once OS patches are applied in Single user mode the scripts will be replaced so that with the reboot cluster comes up [with the same state as it was before doing hastop -all ? ? ? ]
5) Since the groups are frozen persistently, apps will be brough up maunally after which SG's will be unfreezed persistently.
Use the following steps:
As you would need to offline all the apps anyway for OS maintanance take following steps:
1. Offline all SGs one after another (hagrp -offline) in appropriate dependency order.
2. freeze SGs persistently
2.1 open configuration for writting (haconf -makerw)
2.2 Persistent freeze all SGs (hagrp -freeze -persistent)
this will make sure that the SGs will not go online due to AutoStart/AutoStartList after cluster is formed. [Any resources found online at first probe will certainly reflect in the state of SG as PARTIAL or ONLINE, but VCS has not issues any online]
2.3 dump the configuration (haconf -dump -makero)
3. If you are okay to have VCS stack started during the reboot, the rc scripts can stay. However I would let you make this decision.
4. hastop -all on one of the node - to ensure all nodes are stopped.
5. Apply patches - complete the OS maintanance. Re-instate the rc scripts if they have been moved. This step can be done at a point just before last required reboot of the maintanance window.
6. Once nodes are up, and VCS is started forming the complete cluster, unfreeze all the SGs.
6.1 open configuration for writting (haconf -makerw)
6.2 Unfreeze SGs (hagrp -unfreeze -persistent).
This will not initiate AutoStart logic, hence bring the SGs online using 'hagrp -online' command with appropriate order.
6.3 dump the configuration (haconf -dump -makero)
What if I don't offline the SG's prior to freezing ? ? Anyway apps are going to be manually brought down and since the SG's are persistently freezed onlining and offliing is disabled. So, will it impact anything if I don't offline SG's and just freeze ? ?
hastop -local was executed before bringing down applications manually which resulted in appsg to go in faulted state.
Seems something is wrong with Service Group config.
hastop -local will offline all online service groups on the local node. This should not leave your appsg in faulted state. You need to double-check your config and see what errors were reported in engine_A.log.
I am going to move rc2.d and rc3.d scripts for VCS on both the nodes
Where are you moving these scripts to?
Seems you want to do a permanent move and not just temporarily for System Maintenance?
Best Practise for System Maintenance is to bring system up to single-user mode (rc1/rcS).
Offlining SG's before freezing is best practise.
arangari has given best possible solution.
If the apps are in control of VCS, why not use its intelligence of bringing apps down in appropriate order. Less chances of any errors. If you just freeze the SGs and issue bring the SGs offline manually, the state will reflect as faulted - no action taken by VCS, however may generate unwanted notifications if configured. Also as said earlier - it is ease of operation if you are not well verse with all the applications' start/stop logics.
@Marianne and Saurabh:
if the 'hastop -local' has resulted in fault of SG - there could be geneuine fault, or you had issued offline outside VCS control. There could be issue with SG config. Checking engine_A.log would certainly help.
My thinking is that if you are offlining the service groups, stopping VCS and disabling VCS from being started during reboot, why are you worried about freezing the service groups?
I would do the followng (highlevel steps).
1. hastop -all - this will stopp all service groups on all nodes and then stop VCS on all nodes.
2. Disable VCS startup - no need to worry about freezing the service groups because VCS will not auto start.
3. Perform OS maintenance on all nodes as required and reboot as needed.
4. Enable VCS startup.
5. Manually start VCS if not rebooting afer step 4. - All service groups set to auto start will start when VCS enters a running state.
Freezing the groups would be a good preventive step but if VCS is not able to run then it it really not needed.
I think that moving rc scripts etc. is really complicating things for no reason.
Pre-patching highly recommended, but not required, steps:
1. Evacuate the first system in the cluster. This will temporarily freeze it and fail all of its service groups over to the second system. This is to confirm that all of the groups will still fail over to/run on the second system.
2. Reboot, and then unfreeze the first system (since second system never loses cluster state, the freeze remains in effect). Evacuate the second system in the cluster. Again, you are confirming that your service groups will run on either system in the cluster.
3. Reboot, and then unfreeze the second system. Evacuate the first system in the cluster.
This is just best practice - you want to find out BEFORE you do any patching etc. that your applications are not capable of failover (technically an hagrp -switch in this case), have undergone some type of configuration drift, etc.
1. Persistently freeze at least the first system in the cluster (or both).
2. Dump/close the cluster configuration.
3. Offline whichever service groups (all?) that you want. A system freeze will not affect this.
4. Patch/reboot/etc. as desired. I recommend doing all of this on one node, unfreezing and confirming applications come up on that node, then doing the other node - but you can do both at once if your confidence is high or your application requires it.