Solved: VCS Warning for Unknown State

Elvis_L_ · ‎03-24-2011

Hi,

I just curious why I received a Warning notification for Netlsnr Resource Group when the error is not logged into engine_A.log

I have read the VCS documentation but the only hint I have is this.

Resource state is unknown. Warning VCS cannot identify the state of the
resource.

Can anyone provide better explanation what could have caused VCS to send the warning email?

Wally_Heim · ‎03-24-2011

Hi Elvis L,

The monitor entry point of all resources have basically 3 return values for the state of a resource. The states are Online, Offline or Unknown. If the Monitor entry point is not able to determine if a resource is Online or Offline then it returns Unknown.

The Unknown warning is just to let you know that VCS was unable to determine the known state of the resource. Given that the state of the resource is unknown by VCS, VCS will not be able to control the resource. In other words, VCS can not online or offline a resource that is in an Unknown state.

Most of the time there is nothing really reported in the Engine log for this. You might find something in the agent specfic log on the node that was having problems probling the resource. But you might not. If you don't, then try increasing the LogDbg settings on that resource to increase logging and wait for the issue to reoccur.

Thanks,

Wally

FYi - There are actually 10 return codes for Online state with increasing levels of confidence in the Online state of a given resource. The 3 return states that I mentioned are for simlicity.

View solution in original post

Gaurav_S · ‎03-24-2011

hello,

do you mean to say there was nothing in the engine_A.log at that time ? not even a warning / notice for netlsnr resource ? Did you saw anything in NetlLsnr_A.log ? was there any spike in CPU/mem/load during the time this email came ?

what is the VCS version & OS version you are using ?

Gaurav

mikebounds · ‎03-24-2011

"Resource state is unknown" could mean Monitor entry point failed or timed out, but this message should have been logged to engine log. Any message you get sent via SNMP or SMTP should be in the engine log.

So it seems you have 2 issues:

Message sent via SNMP or SMTP was not logged in Engine log
Netlsnr agent could not probe resource - as Gaurav says, you should look at NetLsnr_A.log

Mike

Elvis_L_ · ‎03-24-2011

Mike and Gaurav,

appreciate the analysis done here. there was not even a single entry of that report on the log (both engineA and Netlsnr). However, I do notice from sar the CPU did spike within +-5 mins at the moment the notification came. Just to update the situation, I eventually got the error below approx 2 hours later. System CPU was also exhibiting spike at the same interval mentioned above. Could this possibly link to CPU spike or there was an issue with the Netlsnr monitoring script interval?

Entity Name: OracleLSNR
Entity Type: Resource
Entity Subtype: Netlsnr
Entity State: Resource monitoring has timed out
Traps Origin: Veritas_Cluster_Server
System Name: server01
Entities Container Name: LsnrSG
Entities Container Type: Service Group
Entities Owner: unknown

I'm using VCS 4.1 on Solaris 8 U5.

Wally_Heim · ‎03-24-2011

Hi Elvis L,

The monitor entry point of all resources have basically 3 return values for the state of a resource. The states are Online, Offline or Unknown. If the Monitor entry point is not able to determine if a resource is Online or Offline then it returns Unknown.

The Unknown warning is just to let you know that VCS was unable to determine the known state of the resource. Given that the state of the resource is unknown by VCS, VCS will not be able to control the resource. In other words, VCS can not online or offline a resource that is in an Unknown state.

Most of the time there is nothing really reported in the Engine log for this. You might find something in the agent specfic log on the node that was having problems probling the resource. But you might not. If you don't, then try increasing the LogDbg settings on that resource to increase logging and wait for the issue to reoccur.

Thanks,

Wally

FYi - There are actually 10 return codes for Online state with increasing levels of confidence in the Online state of a given resource. The 3 return states that I mentioned are for simlicity.

Elvis_L_ · ‎03-24-2011

Wally,

Great! I believe I have found the answer to the puzzle. May I know where could I refer to from Symantec official documentation that specifically explaining the reasons behind the Unknown state of resource group? BTW what could be the possible reasons behind the Unknown state? What are system factors that might lead VCS to issue Unknown warning notification on the Netlsnr? Why it only targets on Netlsnr and not the rest of Resource groups?

May I also request from you which configuration that directly adjust the LogDbg and what is the default value for that?

Thanks in advance!

Wally_Heim · ‎03-24-2011

Hi Elvis,

You should check for the Oracle Agent Guide for the specific version of the product that you are using for details on the Netlsnr resource states.

In general, the bundled agents guide and the Agent developers guide will provide general information on resource states.

As for the LogDbg settings, they are different for each resource type and for each platform. In general, I would use DBG_1, DBG_2, DBG_20 and DBG_21 in this case.

Thanks,

Wally

Wally_Heim · ‎03-24-2011

Hi Elvis,

The default values are also dependant on the resource and the platform. I don't know off hand what the default is for the LogDbg for the Netlsnr resource. When you go to edit the attribute you will be able to see what it is currently set to. If I had to guess, I would say that it would be blank as that seems to be the default for most resources but not all.

Thanks,

Wally

Wally_Heim · ‎03-24-2011

Hi Elvis,

Sorry to keep posting bits of this at a time. I'm doing a little research on this because the Oracle agents are not my strong point. I'm also a Windows guy. But VCS is very similar on all platforms supported.

In windows, the Netlsnr resource queries the SCM for the status of the listener service. If the service is not in a "Stopped" (offline) state or "Started" (online) stated then it would be in an "Unknown" state. VCS does give time during starting and stopping of the service to account for service states of "stopping" and "starting" but during normal monitoring it is looking for just "Stopped" or "Started" states.

Other platforms would be monitoring via other methods than querying SCM but would basically be doing the same types of checks.

Thanks,

Wally

Elvis_L_ · ‎03-24-2011

Wally, thanks for the prompt update. May I know where could I quote your statements below from Symantec official doc? I'd need to have a viable source so my customers would believe I am referring to official documentation from Symantec.

The monitor entry point of all resources have basically 3 return values for the state of a resource. The states are Online, Offline or Unknown. If the Monitor entry point is not able to determine if a resource is Online or Offline then it returns Unknown.

The Unknown warning is just to let you know that VCS was unable to determine the known state of the resource. Given that the state of the resource is unknown by VCS, VCS will not be able to control the resource. In other words, VCS can not online or offline a resource that is in an Unknown state.

g_lee · ‎03-24-2011

Elvis,

Note: The following references are taken from the VCS 4.1 (Solaris) documentation (since you stated you're "using VCS 4.1 on Solaris 8 U5"). For other versions, select the relevant platform/version category from the SORT documents page: https://sort.symantec.com/documents

VCS 4.1 User's Guide ( https://sort.symantec.com/public/documents/sf/4.1/solaris/pdf/vcs_users.pdf ) p13
-------------------------
Agent Operations
Agents carry out specific operations on resources on behalf of the cluster engine. The functions an agent performs are entry points, code sections that carry out specific functions, such as online, offline, and monitor. Entry points can be compiled into the agent itself or can be implemented as individual Perl scripts. For details on any of the following entry points, see the VERITAS Cluster Server Agent Developers Guide.
• Online.Brings a specific resource ONLINE from an OFFLINE state.
• Offline.Takes a resource from an ONLINE state to an OFFLINE state.
• Monitor.Tests the status of a resource to determine if the resource is online or offline.
During initial node startup, the monitor entry point probes and determines the status of all resources on the system. The monitor entry point runs after every online and offline operation to verify the operation was successful.
The monitor entry point is also run periodically to verify that the resource remains in its correct state. Under normal circumstances, the monitor is run every 60 seconds when a resource is online, and every 300 seconds when a resource is expected to be offline.
-------------------------

VCS 4.1 Agent Developer's Guide ( https://sort.symantec.com/public/documents/sf/4.1/solaris/pdf/vcs_agent_dev.pdf ) p20

(note: sections in bold are my emphasis to assist with answering your queries)

Agent Entry Points -> monitor
-------------------------
monitor
The monitor entry point typically contains the code to determine status of the resource. For example, the monitor entry point of the IP agent checks whether or not an IP address is configured, and returns the state online, offline, or unknown.
[...]
It returns the resource status (online, offline, or unknown), and the confidence level 0.100. The confidence level is informative only and is not used by VCS. It is returned only when the resource status is online.
[...]
If the exit value of the monitor script entry point falls outside the range 100.110, the status is considered unknown
-------------------------

Again from the Agent Developer's Guide -> State Transition Diagrams (p121)
-------------------------
Opening a resource (p122)
[...]
When the resource is enabled (Enabled=1), the open entry point is invoked. Periodic Monitoring begins after open times out or succeeds. Depending on the return value of monitor, the resource transitions to either the Online or the Offline state. In the unlikely event that monitor times out or returns unknown, the resource stays in a Probing state.

Onlining a resource: ManageFaults = ALL (p124)
[...]
If monitor returns a status of online, the resource moves to the Online state.
If, however, the monitor times out, or returns a status of “not Online” (that is, unknown or offline), the agent returns the resource to the Going Online Waiting state and waits for the next monitor cycle.

Resource fault without automatic restart (p126)
If clean succeeds, the resource is placed in the Going Offline Waiting state, where the agent waits for the next monitor.
[...]
• If monitor reports unknown or times out, the agent places the resource back into the Going Offline Waiting state, and sets the UNABLE_TO_OFFLINE flag in the engine.
-------------------------

ie: although there may not be a single definitive statement/document with Wally's exact wording, the statements are still correct/supported by the documentation.

Hope that helps.

regards,

Grace

Daniel_Schnack · ‎03-24-2011

Hi Elvis,

This information is documented in the VCS 5.1 SP2 Agent Developer's Guide (for Windows) which can be downloaded from the Symantec Operations Readiness site, https://sort.symantec.com/documents/doc_details/sfha/5.1%20SP2/Windows/ProductGuides/. The name of the guide (if you search on the page) is "Veritas Cluster Server 5.1 SP2 Agent Developer's Guide ". The information you are looking for is on page 34 (please excuse the formatting as this is a copy-and-paste):

Entry Point Return Values

Monitor Monitor C++ Based Returns ResStateValues:

VCSAgResOnline
VCSAgResOffline
VCSAgResUnknown
VCSAgResIntentionalOffline

Script-Based Exit values:

99 - Unknown
100 - Offline
101-110 - Online
200 - Intentional Offline
Other values - Unknown.

I hope that this answers your question.

Daniel

mikebounds · ‎03-24-2011

Can you post your Netlsnr reource config from main.cf. If you have setup as default, then all agent is doing is looking for lsnr process, so monitor should not have any problems, and if the CPU spiked severly so system couldn't get process listing, then I would expect a lot of agents to have monitor problems. If you have configured advanced monitoring, then a slow database can cause monitor to timeout. In UNIX, a monitor timeout should always be reported to the engine log and I am surprised that this is not true on Windows also. If the monitor times out 4 times in a row then the resource will fault.

Mike

Elvis_L_ · ‎03-24-2011

Mike,

Netlsnr config as posted below.

Netlsnr OracleLSNR (
Critical = 0
Owner = oracle
Home = "/opt/oracle/product/8.1.7.4.0"
TnsAdmin = "/var/opt/oracle/network/admin"
MonScript = "./bin/Netlsnr/LsnrTest.pl"
)

mikebounds · ‎03-25-2011

Looking at agent documentation, it appears the Netlsnr agent changed from 5.0 to 5.1 so that it now does detailed monitoring by default (but the Oracle agent still does basic monitoring by default). So the agent is running the script specified by MonScript which runs some command to detect the health of the listener. So this command probably timed out, in which case the error may be logged to NetLsnr_A.log rather than engine_A.log.

Mike

Elvis_L_ · ‎03-25-2011

Mike,

I dare to put my confidence to pinpoint it was caused by the monitoring timeout. But can we assume logger will also bail out to record the event to Netlsnr or Engine A log?

mikebounds · ‎03-25-2011

If you have no messages in your engine log or Netlsnr log, then unless /var was full then I would consider this to be a bug and I would log a call with Symantec Support. If you look at the code of the Netlsnr monitor script, there are 3 reasons you could get "Resource State unknown" (this is from 5.1 agent code, so it may be different for 4.1)

VCSAG_LOG_MSG ("E","Oracle home directory $Home does not exist",1,$Home);

exit $VCSAG_RES_UNKNOWN;
VCSAG_LOG_MSG ( "E" ,"lsnrctl not found in $Home/bin", 4,$Home);

exit $VCSAG_RES_UNKNOWN; # Resource state is UNKNOWN
VCSAG_LOG_MSG ( "E" , "lsnrctl operation timed out",14);

exit $VCSAG_RES_UNKNOWN;

I am guessing 1 and 2 are unlikey so issue is more than likely 3. Logging the message to agent log (Netlsnr log) should be the first thing happens - then Notifier daemon gets message from queue and sends notication depending on the severity.

Mike

Elvis_L_ · ‎03-25-2011

I believe this is perl equivalent of the same script for VCS 5.1

sub catch_alrm {
if ( $AgentDebug == 1) {
VCSAG_LOG_MSG ( "E" , "lsnrctl operation timed out",14);
}
exit (99); # Resource state is UNKNOWN
}
#
$ret =netlsnrlib::check_oracle_client ($Home,$LSNRMGR );

if ($ret) {
VCSAG_LOG_MSG ( "E" ,"lsnrctl not found in $Home/bin", 4,$Home);
exit 99; # Resource state is UNKNOWN
}

I would no doubt log the case to Symantec as the VCS never exhibit this issue before in our environment, rather than beating around the bush guessing it was timeout or monitoring lapse.

VOX

VCS Warning for Unknown State