Hi all,
THE BACKGROUND:
We have implemented a custom Veritas Agent for monitoring our application, and we are using the IntentionalOffline feature of V51 agents (VCS 5.0 MP3 or later) to signal to VCS that the application has been brought "Intentionally Offline". From the Veritas Agent Developer's Guide:
About intentional offline of applications
Certain agents can identify when an application has been intentionally shut down outside of VCS control. If an administrator intentionally shuts down an application outside of VCS control, VCS does not treat it as a fault. VCS sets the service group state as offline or partial, depending on the state of other resources in the service group.
This feature allows administrators to stop applications without causing failovers.
This is achieved by using a specific return code (RC 200 indicates intentional offline) from the custom agent monitoring script when the script detects that the application is offline outside of Veritas control.
Intentional Offline is supposed to set the resources offline in a similar way to manually taking a resource offline using 'hares -offline ...' or using the VCS GUI. It works as intended in general, and we are very happy with it. We also set the ExternalStateChange attribute on our resources which support Intentional Offline to 'OnlineGroup & OfflineGroup', meaning that VCS takes the service group online or offline as appropriate in respose to an external state change. Our service groups are very simple NIC->IP->Application resource dependencies, with no service group to service group dependencies defined.
THE PROBLEM:
Consider this scenario:
The "custom_app_grp" service group is online on "nodeA" of our cluster. Our cluster contains three nodes: nodeA, nodeB and nodeC. The "custom_app_grp" group contains NIC ("custom_app_nic_res"), IP ("custom_app_ip_res") and Application ("custom_app_srv_res") resources, and the monitoring script for the Application resource supports "Intentional Offline". The service group is allowed to start on any node in the cluster, but is a failover service group, and can only run on one node at once.
We take our application offline using the application controls outside of Veritas. The "custom_app_srv_res" resource goes OFFLINE in response, and since ExternalStateChange is set, this brings the "custom_app_grp" service group offline too. The "custom_app_grp" is now OFFLINE on "nodeA". All nodes, "nodeA", "nodeB" and "nodeC", are online, but are now not running any other service groups apart from the cluster service group.
We now reboot the "nodeA" server. What we now experience is that the "custom_app_grp" now invokes its fail-over behaviour, and VCS attempts to restart the service group on "nodeB".
Note: if instead of taking the "custom_app_grp" offline outside of VCS control, we instead simply offline the group using 'hagrp' or the VCS GUI, then reboot "nodeA", the "custom_app_grp" does NOT begin to fail over.
...
So: would people expect a service group which is OFFLINE in the cluster, so suddenly be marked as FAULTED and trigger VCS to begin restarting the service group on another node?
We wouldn't.
Has anyone else experienced this behaviour with Intentional Offline? Can Symantec support representatives reading the forum comment on whether they would consider this behaviour a bug?
Many thanks.
Kind regards,
Dave Hassett