Solved: Gentle reminder! This

justinfay · ‎06-16-2015

Hi,

I am following the veritas cluster server administrators guide for linux and trying to trigger the resnotoff script. From the documentation it is my understanding that is a resource faults and the clean command returns 1, resnotoff should be triggered.

To begin my service group is in an ONLINE state:

[root@node1 ~]# hastatus -sum | grep test
B Grp_CS_c1_testservice node1                Y          N               ONLINE
B Grp_CS_c1_testservice node2                Y          N               ONLINE

I have the clean limit set to 1 and the clean script set to /bin/false to force this to return an error exit code.

Res_App_c1_fmmed1_testapplication ArgListValues         node1      User 1       root    StartProgram    1       "/usr/share/litp/vcs
_lsb_start vmservice 5" StopProgram     1       "/usr/share/litp/vcs_lsb_stop vmservice 5"      CleanProgram    1       /bin/false M
onitorProgram   1       "/usr/share/litp/vcs_lsb_status vmservice"      PidFiles        0       MonitorProcesses        0       EnvF
ile     1       ""      UseSUDash       1       0       State   1       2       IState 1       0
Res_App_c1_fmmed1_testapplication ArgListValues         node2      User 1       root    StartProgram    1       "/usr/share/litp/vcs
_lsb_start vmservice 5" StopProgram     1       "/usr/share/litp/vcs_lsb_stop vmservice 5"      CleanProgram    1       /bin/false M
onitorProgram   1       "/usr/share/litp/vcs_lsb_status vmservice"      PidFiles        0       MonitorProcesses        0       EnvF
ile     1       ""      UseSUDash       1       0       State   1       2       IState 1       0
Res_App_c1_fmmed1_testapplication CleanProgram          global     /bin/false
Res_App_c1_fmmed1_testapplication CleanRetryLimit       global     1

The resnotoff is enables for this resource

Res_App_c1_fmmed1_testapplication TriggersEnabled global RESNOTOFF

Now I manually kill the service Grp_CS_c1_testservice on node 1 and see the following in the /var/log/messages

Jun 16 17:02:33 node1 AgentFramework[10323]: VCS ERROR V-16-2-13067 Thread(4147325808) Agent is calling clean for resource(Res_App_c
1_fmmed1_testapplication) because the resource became OFFLINE unexpectedly, on its own.

Jun 16 17:02:33 node1 Had[9975]: VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(Res_App_c1_fmmed1_testapplicatio
n) because the resource became OFFLINE unexpectedly, on its own.
Jun 16 17:02:34 node1 AgentFramework[10323]: VCS ERROR V-16-2-13069 Thread(4147325808) Resource(Res_App_c1_fmmed1_testapplication) -
clean failed.

and in the engine_A.log

2015/06/16 17:02:33 VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(Res_App_c1_fmmed1_testapplication) because the resourc
e became OFFLINE unexpectedly, on its own.
2015/06/16 17:02:34 VCS INFO V-16-10031-504 (node1) Application:Res_App_c1_fmmed1_testapplication:clean:Executed /bin/false as user root
2015/06/16 17:02:35 VCS ERROR V-16-2-13069 (node1) Resource(Res_App_c1_fmmed1_testapplication) - clean failed.

2015/06/16 17:03:35 VCS ERROR V-16-1-50148 ADMIN_WAIT flag set for resource Res_App_c1_fmmed1_testapplication on system node1 with the reason
4
2015/06/16 17:03:35 VCS INFO V-16-10031-504 (node1) Application:Res_App_c1_fmmed1_testapplication:clean:Executed /bin/false as user root

From my understanding of the VCS adminisrator guide section titles 'VCS behavior when an online resource faults' the resnotoff should be triggered however it is not and the resource goes to an ADMIN WAIT state.

group           resource             system          message
--------------- -------------------- --------------- --------------------
                Res_App_c1_fmmed1_testapplication node1           |ADMIN WAIT|

Is it possible to get the resnotoff triggered for a cluster in this state or do I need to use the resadminwait trigger (contrary to the documentation).

Thanks,

Sunil_Yadav · ‎06-16-2015

Hi,

1. “resnotoff” trigger is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. For the scenario you described, “resnotoff” is not right trigger.

2.

From the documentation it is my understanding that is a resource faults and the clean command returns 1, resnotoff should be triggered.

No. Not “resnotoff”. In this case, “resfault” trigger is invoked.

3. Resource if marked FAULTED if clean entry point succeeds after the unexpected offline. If clean entry point fails, resource is marked ADMIN_WAIT. Same behavior is observed in engine log too. Clean entry point has failed and thus resource Res_App_c1_fmmed1_testapplication was marked ADMIN_WAIT(not FAULTED).
This scenarios is neither “resnotoff”, nor “resfault”. In this case, "resadminwait" trigger will be invoked(if configured).

Thanks & Regards,
Sunil Y

View solution in original post

mikebounds · ‎06-16-2015

I have had a look at the documentation in 6.1 with regards resfault and resnotoff and it is misleading at best in some places and clearly wrong in other places, but there are so many mistakes, I will try to write an article as too long to document all here.

But your issue is a little simpler:

If an ONLINE resource reports offline, then VCS will call clean, but the monitor reported offline and so resNOToff is not applicable as resource IS offline and so by the clean returning 1 you are indicating to VCS that you are unable to clean up the resource (example process is down, but can't remove shared memory segments) so as VCS cannot make the resource clean then VCS goes into ADMIN_WAIT.

If VCS offlines a resource and after it runs the offline the resource is still NOT offline AND then after the clean runs, the resource is still NOT offline, then this is where the resnotoff trigger is called. However, in this scenario, if you are using the Application agent then the CleanProgram must exit 0, even though it is unsuccesful as the Bundled agents says for the Application agent:

Note: If the CleanProgram executable returns a non-zero value, the agent treats it as a clean failure and the resource will not fault.

If you write an agent from scratch then I pretty sure VCS passes the clean reason to the agent frame work and so you can exit non-zeo when clean fails when called for "resource fault" and exit zero when clean fails when called for "offline failed", but the Application agent CleanProgram is not passed clean reason, so probably better to always exit 0.

If this doesn't help with what you are trying to achieve, then please explain what you are trying to achieve with a real scenerio you want VCS to react to.

Mike

View solution in original post

Sunil_Yadav · ‎06-16-2015

Hi,

1. “resnotoff” trigger is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. For the scenario you described, “resnotoff” is not right trigger.

2.

From the documentation it is my understanding that is a resource faults and the clean command returns 1, resnotoff should be triggered.

No. Not “resnotoff”. In this case, “resfault” trigger is invoked.

3. Resource if marked FAULTED if clean entry point succeeds after the unexpected offline. If clean entry point fails, resource is marked ADMIN_WAIT. Same behavior is observed in engine log too. Clean entry point has failed and thus resource Res_App_c1_fmmed1_testapplication was marked ADMIN_WAIT(not FAULTED).
This scenarios is neither “resnotoff”, nor “resfault”. In this case, "resadminwait" trigger will be invoked(if configured).

Thanks & Regards,
Sunil Y

mikebounds · ‎06-16-2015

I have had a look at the documentation in 6.1 with regards resfault and resnotoff and it is misleading at best in some places and clearly wrong in other places, but there are so many mistakes, I will try to write an article as too long to document all here.

But your issue is a little simpler:

If an ONLINE resource reports offline, then VCS will call clean, but the monitor reported offline and so resNOToff is not applicable as resource IS offline and so by the clean returning 1 you are indicating to VCS that you are unable to clean up the resource (example process is down, but can't remove shared memory segments) so as VCS cannot make the resource clean then VCS goes into ADMIN_WAIT.

If VCS offlines a resource and after it runs the offline the resource is still NOT offline AND then after the clean runs, the resource is still NOT offline, then this is where the resnotoff trigger is called. However, in this scenario, if you are using the Application agent then the CleanProgram must exit 0, even though it is unsuccesful as the Bundled agents says for the Application agent:

Note: If the CleanProgram executable returns a non-zero value, the agent treats it as a clean failure and the resource will not fault.

If you write an agent from scratch then I pretty sure VCS passes the clean reason to the agent frame work and so you can exit non-zeo when clean fails when called for "resource fault" and exit zero when clean fails when called for "offline failed", but the Application agent CleanProgram is not passed clean reason, so probably better to always exit 0.

If this doesn't help with what you are trying to achieve, then please explain what you are trying to achieve with a real scenerio you want VCS to react to.

Mike

Sunil_Yadav · ‎08-04-2015

Gentle reminder! This discussion is open for last 1.5 months and keeps popping in "Can you solve these?" section. If your query is resolved, then please mark appropriate comments as solution.

Thanks & Regards,
Sunil Y

VOX

RESNOTOFF not triggered.