07-17-2020 02:17 AM
I have a problem with my VCS setup, it's quite an old version running on VCS 6.0.50.0.
Since it's initial setup till today it's been running fine with no issue. However last few days our database listener was suddenly restarted. I did not find any errors on the DB & Listener logs, but I do notice from Netlsnr_A.log that while running lsnrtest.pl script it receive a timeout and assume the listener service is OFFLINE thus the cluster initiate a kill command.
2020/07/14 20:31:13 VCS INFO V-16-20002-211 (server1) Netlsnr:listener:monitor:Monitor procedure /opt/VRTSagents/ha/bin/Netlsnr/LsnrTest.pl returned the output: LD_LIBRARY_PATH - /usr/lib:
LSNRCTL for Linux: Version 11.2.0.2.0 - Production on 14-JUL-2020 20:30:47
Copyright (c) 1991, 2010, Oracle. All rights reserved.
TNS-12545: Connect failed because target host or object does not exist
TNS-12560: TNS:protocol adapter error
TNS-00515: Connect failed because target host or object does not exist
Linux Error: 110: Connection timed out
2020/07/14 20:31:13 VCS ERROR V-16-2-13067 (server1) Agent is calling clean for resource(listener) because the resource became OFFLINE unexpectedly, on its own.
2020/07/14 20:31:13 VCS NOTICE V-16-20002-42 (server1) Netlsnr:listener:clean:Listener(listener) kill TERM 1390
2020/07/14 20:31:24 VCS INFO V-16-2-13068 (server1) Resource(listener) - clean completed successfully.
Weirdly I did not notice any spike on the server load based on my observation on the SAR report pulled from that day. With no obvious error on both OS & DB, it's weird why the VCS killed the listener. Does anyone face similar issue before & how did you resolve it?
A workaround I'm using now is to disable the monitoring. I know this isn't recommended as it'll disable the auto failover as well but I've run out of choice on where else to troubleshoot.
07-17-2020 10:28 PM
this is an Oracle related issue,
the TNS errors you sent are Oracle errors
For example, the first TNS error below
TNS-12545: Connect failed because target host or object does not exist
is almost always an issue in your tnsnames.ora file, specifically your ADDRESS parameter, often a bad host name (node name).
if you are not very familiar with Oracle troubleshooting, engage your dba.
The way VCS listener agent monitors is "In the basic monitoring mode, the agent scans the process table for the tnslsnr process to verify that the listener process is running.".
You can also manually run the listener monitor script to seeif the agent exit code returned is correct when the listener is up
07-19-2020 06:34 PM
Hi Frank, I do agree that TNS errors are Oracle related errors. However I've crosscheck VCS's logs & Oracle logs for both DB & listener but did not find any matching errors. Oracle's logs detected everything running fine until the time when the listener was restarted.
This is the part that I find it hard to understand. Furthermore, this setup has been such way for years without any patching or changes to the infrastructure or even the tnsnames.ora. The issue creep up just a couple of weeks ago and had been intermittently happening since, or until I've turned off the monitoring.
I've worked with Oracle's principal support for many years now and normally they won't take any action on this issue since they did not find any errors on the logs that pointing to an Oracle related issue.
Is it possible to alter this script & direct it to send out warning emails instead of initiating the process restart? This way at least if an alert was triggered, I can logon to the affected server and verify the issue.
07-20-2020 01:12 AM
There are a few things you can do to troubleshoot the issue further.
As ou mentioned that you did not observe a server load spike so we can temporarily rule out the high system load was the cause.
Since you noticed that both Oracle and the Listener were up running during the times that L:istener resoruce was killed and restarted, we simply assum that VCS agent "misbehaved".
For instance, VCS listener was nopt able to com,plete the monitor in the time specified, one thing you can do is to increase this resource timeout value from 60 (default) to, say, 180 to see if it helps.
Youe can list most of ganet parameters by running the command bel;ow
hatype -display <agebnt_name> | grep MonitorTimeout
the parameter can be tuned online
please also take a look at this technote https://www.veritas.com/support/en_US/article.100004986
which covers some info about tuning VCS agent NumThreads attribute.
If all are Ok(load is not high, no changes made to system and network/SAN, configuration has worked for years etc) but listener resoruce keeps restart, you can place the Listener agent on error logging mode to capture all info to troubleshoot the issue furthe
If this is a prod cluster, when was the clusterl last rebooted? Hope the cluster has not being up running for over 3 years.
By the way, if possible do not modify any VCS agent online/offline/monitor scripts as Veritas only supports its products as they are released. If there are defects in producst, engage Veritas for a "hot fix" patch.
07-21-2020 12:49 AM
Hi Frank,
Noted on the suggestion, I'll try to change the timeout value.
Most of my production server are rebooted annually depending if we can get any down time or not.
On server loads, yes the issue so far is impacting server which aren't so heavily utilized, I can only check based on the SAR report pulled from the server. If there's network related then I'm unable to do so since I do not have any access. Further more the listener script is pinging to the listener on the same server thus i doubt network matters.
Do you know if I can manually initate Lsnrtest.pl script? I tried to execute it from the folder but getting errors, i guess due to the variables wasn't declared on my current user id (root)
Use of uninitialized value $Home in substitution (s///) at ./LsnrTest.pl line 69.
Use of uninitialized value $Home in substitution (s///) at ./LsnrTest.pl line 69.
Use of uninitialized value $Owner in concatenation (.) or string at ./LsnrTest.pl line 72.
Use of uninitialized value $Owner in regexp compilation at ./LsnrTest.pl line 72.
07-21-2020 08:06 PM
sure you can run VCS agent online/offline/monior scri[pts manually. as a matter of fact, this is sometimes a way to troubleshoot resource performance related issues.
The errors your encounterred running Lsnrtest.pl scrip was due to some environment variable settings. you can su - oracle then try it again.
the other way to test resoruce monitoring is during maintence window, check and make sure all resources are OK none in failed|partial statem then
manually kill listener process,
run
hastatus
watch the output for listener resource "unexpected offline outside vcs on its own" displayed
you shoukld see the same errors as you first emailed as well as in engine_A,log
how may times has the issue occruied? did it alway occured at around the same time? were any scron jobs like NBU backup jobs runninf during ther times the issue occured?
Orackle agent is a very well developed agent, most of this kind issues are not VCS directly related.
you can verify the log files to confirm if the shutdown was graceful.
Sample log message:
VCS INFO V-16-1-13470 Resource ORA_oraprod
(Owner: Unspecified, Group: ORA_PROD_Group) is offline on system.
(Intentional But NOT initiated by VCS)
Oracle agent has identified the Intentional offline for the resource.
07-21-2020 08:29 PM
please ignore the last 6 lines in the previous email
07-29-2020 10:21 PM
any progress made on troubleshooting the listener unexpected offline issue? an update on the status will be much appreciated
frank