about the VCS behaviour on faulted resources

singhgary · ‎07-27-2014

Hi Team,

Because ManageFaults is ALL by default, when the resource faults, VCS
calls Clean entry point. What is the task for Clean entry point for this
resource?  As far as I know, when VCS declares the resource as faulted,
depending on CRITICAL attribute for resource and AutoFailOver, VCS fail
sover the resource group. This resource group stays as faulted on the
first node. So, I need to clean fault manually to be able to make the
service group online on this node again. So,what the effect of "clean
entry point" if the reosurce faults?

If service group is faulted on both primary and secondary node, as far
as I know, it will not be failed-over. It stays as faulted on both node
until clearing it manually? Is there any way to automate the fail-over
while the service group are faulted on both node?

In fact I would like to understand the role of "clean entry point" for
resource when it faults bcause I am always clearing the resource's fault
manually? Is it there before declaring the resource is faulted? Please
explain me.

Gaurav_S · ‎07-27-2014

Hello,

Clean entry point may have different functions depending on what is the resource type. Take an example of diskgroup resource

https://sort.symantec.com/public/documents/sfha/6.1/solaris/productguides/html/vcs_bundled_agents/ch02s02s03.htm

Clean action called here will be to Terminate all ongoing resource actions and takes the resource offline - forcibly when necessary.

Similar way the clean action for mount agent is to Unmount the mounted file system forcefully.

I would say the clean action is more of a corrective action that would bring VCS (resources/service groups) to a stable state (either online/offline/partial), basis on what other parameters are set (like restartlimit, tolerancelimit) etc, VCS will device further action whether fault can be automatically fixed or manual intervention is required.

When you say you are cleaning the resource manually, by that time VCS would have made all its attempt to clean the resource as defined by parameters/tuneables.

G

Marianne · ‎07-27-2014

In support of Gaurav's excellent post - the Clean entry point will cleanup any 'left over' processes. For a process, if offline does a 'kill PID' and process fails to terminate, the 'Clean entry point' will do a 'kill -9 PID'.

This is what VCS admin guide says:

Cleans up after a resource fails to come online, fails to go offline, or
fails to detect as ONLINE when resource is in an ONLINE state. The
clean entry point is designed to clean up after an application fails.
The function ensures that the host system is returned to a valid state.
For example, the clean functionmayremove shared memory segments
or IPC resources that are left behind by a database.

You can read in VCS Bundled Agent Guide in the Agent functions section of each agent what the Clean entry point does. (Not all of the Bundled Agents have a Clean entry point.)

Do not confuse Clean entry point with Clear Fault of a faulted resource.

This is something that the VCS admin must do after the reason for the fault was examined and fixed.
e.g. if Mount resource could not come online because a mount point does not exist and faulted the resource, you will create the mountpoint, then clear the fault, after which the resource can be brought online.

This is what VCS admin guide says about Clearing a resource:

Clear a resource to remove a fault and make the resource available to go online.
A resource fault can occur in a variety of situations, such as a power failure or a
faulty configuration.

Hope this helps.

Handy NetBackup Links

RiaanBadenhorst · ‎07-28-2014

Hi,

In addition to what was explained above I'll give you an example using mount. Please remember that the cluster is programmed to perform certain tasks. Those tasks are to online, monitor, and offline resources.

To online a resource (mount) the cluster has been "taught" (scripted in the agent) to online a mount using the mount command + attributes that we place in the configuration (main.cf). The cluster uses this knowledge and information to online the mount resource "mount -t VXFS /dev/vx/dsk/oradg/oravo/ /oradata"

Great, the resource is online now.

Say some other admin goes and unmounts (from CLI and not using VCS) the mount. The clusters other task is to monitor the mount resource. If it goes down it should (depeding on restart limits and tolerance levels) either try and mount it again, or failover to the other node (taking offline the remaining resources in a controlled manner).

So now we'll get to your questions about clearing faults.

Suppose that admin that unmounted the FS also goes and deletes the folder used by the mount. If the cluster sees it, it would maybe try and mount the FS again (once again depeding on restart limits and tolerance levels) but will fail because there is no mount point to folder the FS on.

Your resource is now faulted. Do you think that clearing the fault automatically will resolve the problem here? No, of course you dont because you know there is no way that the mount will happen with out the folder being recreated. And that cannot happen until an Admin (You) goes and invesgitates and resolves the issue.

The cluster works within the boundaries of what the agent is programmed to do (online, clean, monitor, and offline). It cannot be smart and go and troubleshoot issues like this.

Now, to get to the clean action. Suppose in the example described above, there is another mount configured in your cluster, this one was not unmounted by the admin. During the failover process the cluster has to unmount that resource but some user is logged in and currently working in the folder. If the cluster tries to unmount the folder it will not work due to the user being in the folder. So now a more drastic approach needs to be followed, the clean action will be called. The clean in this case would perform a force unmount (and maybe an fuser too) to kick that user out and unmout the mount point so the service group can failover. Clean action is basically the "calling in the big guns" to get the job done , and to get the failover moving.

Hope that makes sense :)

VOX

about the VCS behaviour on faulted resources