cancel
Showing results for 
Search instead for 
Did you mean: 

Manage time interval of ERROR V-16-2-13074 has consistently failed to determine the resource status.

nandoammeon2020
Level 2

Hi everyone, 

I've been performing some tests with monitor program. 

I could see that after many monitor failures the messages below came up :

2020/10/15 14:37:45 VCS ERROR V-16-2-13074 (cloud-svc-4) The monitoring program for resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 1 of 3) the resource.
2020/10/15 15:13:50 VCS ERROR V-16-2-13074 (cloud-svc-4) The monitoring program for resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 2 of 3) the resource.
2020/10/15 15:40:49 VCS ERROR V-16-2-13074 (cloud-svc-4) The monitoring program for resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 3 of 3) the resource.

After the third message above we have another monitor program issue and then my service became faulted:

2020/10/15 16:39:49 VCS ERROR V-16-2-13027 (cloud-svc-4) Resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) - monitor procedure did not complete within the expected time.
2020/10/15 16:40:49 VCS ERROR V-16-2-13210 (cloud-svc-4) Agent is calling clean for resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) because 3 successive invocations of the monitor procedure did not complete within the expected time.
2020/10/15 16:41:47 VCS INFO V-16-10031-504 (cloud-svc-4) Application:Res_App_svc_cluster_cmserv_vm_service_cmserv:clean:Executed /sbin/service as user root
2020/10/15 16:41:58 VCS INFO V-16-2-13716 (cloud-svc-4) Resource(Res_App_svc_cluster_cmserv_vm_service_cmserv): Output of the completed operation (clean)


2020/10/15 16:41:58 VCS INFO V-16-2-13068 (cloud-svc-4) Resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) - clean completed successfully.
2020/10/15 16:41:59 VCS INFO V-16-2-13026 (cloud-svc-4) Resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) - monitor procedure finished successfully after failing to complete within the expected time for (3) consecutive times.
2020/10/15 16:41:59 VCS INFO V-16-1-10307 Resource Res_App_svc_cluster_cmserv_vm_service_cmserv (Owner: Unspecified, Group: Grp_CS_svc_cluster_cmserv) is offline on cloud-svc-4 (Not initiated by VCS)
2020/10/15 16:41:59 VCS ERROR V-16-1-10205 Group Grp_CS_svc_cluster_cmserv is faulted on system cloud-svc-4

 

My question is about the message I mentioned first:

V-16-2-13074 -> The monitoring program for resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 1 of 3) the resource.

Is there anything to control the limit of these attempts ? We can see here the limit would be 3 attempts. Where do I see this configuration? I did not see any instruction related to it on main.cf

The second and main question would be regarding time interval of this procedure. 

We have the first error at 14:37:45, the second at 15:13:50, the third at 15:40:49 and then monitor program failed again and service got faulted at 16:41:59. 

How do I manage the time interval of this procedure / error?  It looks like if you have the error "V-16-2-13074 -> has consistently failed" at any time of the day, the counter of attempt will continue running. 

If the service has had 3 attempts during 2am and 5am, there is a risk the service becomes faulted at 11am if there is monitor program does not get status of service within expected time.  

 

Regards,

Fernando Santos

3 REPLIES 3

frankgfan
Level 6
   VIP   

there are many tunables on eevry each VCS agent/resource which user can tune to make VCS perform/"behave" as needed.

take a look at this technote https://sort.veritas.com/public/documents/vcs/6.0.1/aix/productguides/html/vcs_admin/ch01s04s10.htm to get a basic idea of resource monitoring.

To display resource or agent settings, run the commands below:

hatype -display | grep < agent name>

hares -display | grep < resource_name>

 

example

hatype -display  grep Application

Application AEPTimeout 0
Application ActionTimeout 30
Application AdvDbg
Application AgentClass TS
Application AgentDirectory
Application AgentFailedOn
Application AgentFile
Application AgentPriority 0
Application AgentReplyTimeout 130
Application AgentStartTimeout 60
Application AlertOnMonitorTimeouts 0
Application ArgList State IState User StartProgram StopProgram CleanProgram MonitorProgram PidFiles MonitorProcesses EnvFile UseSUDash
Application AttrChangedTimeout 60
Application CleanRetryLimit 0
Application CleanTimeout 60
Application CloseTimeout 60
Application ConfInterval 600
Application ContainerOpts RunInContainer 1 PassCInfo 0
Application EPClass -1
Application EPPriority -1
Application ExternalStateChange
Application FaultOnMonitorTimeouts 4
Application FaultPropagation 1
Application FireDrill 0
Application IMF Mode 3 MonitorFreq 1 RegisterRetryLimit 3
Application IMFRegList MonitorProcesses User PidFiles MonitorProgram StartProgram LevelTwoMonitorFreq
Application InfoInterval 0
Application InfoTimeout 30
Application LevelTwoMonitorFreq 1
Application LogDbg
Application LogFileSize 33554432
Application LogViaHalog 0
Application MonitorInterval 60
Application MonitorStatsParam Frequency 0 ExpectedValue 100 ValueThreshold 100 AvgThreshold 40
Application MonitorTimeout 60
Application NumThreads 10
Application OfflineMonitorInterval 300
Application OfflineTimeout 300
Application OfflineWaitLimit 0
Application OnlineClass -1
Application OnlinePriority -1
Application OnlineRetryLimit 0
Application OnlineTimeout 300
Application OnlineWaitLimit 2
Application OpenTimeout 60
Application Operations OnOff
Application RegList MonitorProcesses User
Application RestartLimit 0
Application ScriptClass TS
Application ScriptPriority 0
Application SourceFile ./types.cf
Application SupportedActions program.vfd user.vfd cksum.vfd getcksum propcv
Application SupportedOperations
Application ToleranceLimit 0
Application TypeOwner
Application TypeRecipients
#

to tune an existing agent/resource parameters, run the commands below

haconf -makerw

hatype -modify <modify_opt>    --- please see hatype manpage for the command syntax and usages

haconf -dump -makero

hatype -display <agent_name> to check the new parameters

PS - for tuning resource parameters, just run haews -modify instead of hatype -modify

for most of installation, the default VCS agent/resource parameetrs should be good enough and tuning is not in general required.

If you keep receiving resource monitor errors or agent restart, check system load to make sure there is not a high load related system performance issue.

 

Hi @frankgfan , 

First of all, thanks for your help. 

As per error below :

2020/10/15 15:40:49 VCS ERROR V-16-2-13074 (cloud-svc-4) The monitoring program for resource(Res_App_svc_cluster_cmserv_vm_service_cmserv) has consistently failed to determine the resource status within the expected time. Agent is restarting (attempt number 3 of 3) the resource.

The attempt number is related to RestartLimit instruction. Apparently, we can manage the number of these attempts changing RestartLimit value. 

But it is not clear if the RestartLimit instruction has a time interval.

As mentioned on my previous comment, how do I manage the time interval of RestartLimit?  It looks like if you have the error "V-16-2-13074 -> has consistently failed" at any time of the day, the counter of attempt will continue running. 

So if RestartLimit value is 3 and the service has had 3 attempts (Agent is restarting (attempt number 3 of 3)) during 2am and 6am, apparently the service can become faulted at 11am if monitor program does not get status of service within expected time.  

Apparently, there is no time interval instruction to manage RestartLimit because the attempt counter does not return to ZERO after one hour of the first attempt.

Is it possible to manage it ?

Regards,

Fernando Santos

 

frankgfan
Level 6
   VIP   

Please see my answers/comments to your questions below begining with "<<<"

Q1 "the RestartLimit instruction has a time interval."

A1 there is a   MonitorInterval attribute for each agent

Q2 "if RestartLimit value is 3 and the service has had 3 attempts (Agent is restarting (attempt number 3 of 3)) during 2am and 6am, apparently the service can become faulted at 11am"

A2 If the next monitor is successful, the counter will be reset and the restart limit becmes 3. In other words whether a resource is marked as faulted is very much dependent on the number of consecutive monitor "time out".

To resolve the issue, consider the followings:

1. patch up VCS

2. review cluster load (load pattern)

Am pretty sure a cluster restart (or a HAD restart)  would TEMPORARY resolve the issue