Forum Discussion

desmond1212's avatar
5 years ago

LsnrTest script receive timeout and terminated Oracle Listener process

I have a problem with my VCS setup, it's quite an old version running on VCS 6.0.50.0. 

Since it's initial setup till today it's been running fine with no issue. However last few days our database listener was suddenly restarted. I did not find any errors on the DB & Listener logs, but I do notice from Netlsnr_A.log that while running lsnrtest.pl script it receive a timeout and assume the listener service is OFFLINE thus the cluster initiate a kill command.

2020/07/14 20:31:13 VCS INFO V-16-20002-211 (server1) Netlsnr:listener:monitor:Monitor procedure /opt/VRTSagents/ha/bin/Netlsnr/LsnrTest.pl returned the output: LD_LIBRARY_PATH - /usr/lib:

LSNRCTL for Linux: Version 11.2.0.2.0 - Production on 14-JUL-2020 20:30:47

Copyright (c) 1991, 2010, Oracle. All rights reserved.

TNS-12545: Connect failed because target host or object does not exist
TNS-12560: TNS:protocol adapter error
TNS-00515: Connect failed because target host or object does not exist
Linux Error: 110: Connection timed out

2020/07/14 20:31:13 VCS ERROR V-16-2-13067 (server1) Agent is calling clean for resource(listener) because the resource became OFFLINE unexpectedly, on its own.
2020/07/14 20:31:13 VCS NOTICE V-16-20002-42 (server1) Netlsnr:listener:clean:Listener(listener) kill TERM 1390
2020/07/14 20:31:24 VCS INFO V-16-2-13068 (server1) Resource(listener) - clean completed successfully.

Weirdly I did not notice any spike on the server load based on my observation on the SAR report pulled from that day. With no obvious error on both OS & DB, it's weird why the VCS killed the listener. Does anyone face similar issue before & how did you resolve it?

A workaround I'm using now is to disable the monitoring. I know this isn't recommended as it'll disable the auto failover as well but I've run out of choice on where else to troubleshoot.

  • this is an Oracle related issue,

    the TNS errors you sent are Oracle errors

    For example, the first TNS error below

    TNS-12545: Connect failed because target host or object does not exist

    is almost always an issue in your tnsnames.ora file, specifically your ADDRESS parameter, often a bad host name (node name).

    if you are not very familiar with Oracle troubleshooting, engage your dba.

    The way VCS listener agent monitors is "In the basic monitoring mode, the agent scans the process table for the tnslsnr process to verify that the listener process is running.".

    You can also manually run the listener monitor script to seeif the agent exit code returned is correct when the listener is up

    • desmond1212's avatar
      desmond1212
      Level 2

      Hi Frank, I do agree that TNS errors are Oracle related errors. However I've crosscheck VCS's logs & Oracle logs for both DB & listener but did not find any matching errors. Oracle's logs detected everything running fine until the time when the listener was restarted.

      This is the part that I find it hard to understand. Furthermore, this setup has been such way for years without any patching or changes to the infrastructure or even the tnsnames.ora. The issue creep up just a couple of weeks ago and had been intermittently happening since, or until I've turned off the monitoring. 

      I've worked with Oracle's principal support for many years now and normally they won't take any action on this issue since they did not find any errors on the logs that pointing to an Oracle related issue.

      Is it possible to alter this script & direct it to send out warning emails instead of initiating the process restart? This way at least if an alert was triggered, I can logon to the affected server and verify the issue.

      • frankgfan's avatar
        frankgfan
        Moderator

        There are a few things you can do to troubleshoot the issue further.

        As ou mentioned that you did not observe a server load spike so we can temporarily rule out the high system load was the cause.

        Since you noticed that both Oracle and the Listener were up running during the times that L:istener resoruce was killed and restarted, we simply assum that VCS agent "misbehaved".

        For instance, VCS listener was nopt able to com,plete the monitor in the time specified, one thing you can do is to increase this resource timeout value from 60 (default) to, say, 180 to see if it helps.

        Youe can list most of ganet parameters by running the command bel;ow

        hatype -display <agebnt_name> | grep MonitorTimeout 

        the parameter can be tuned online

        please also take a look at this technote https://www.veritas.com/support/en_US/article.100004986

        which covers some info about tuning VCS agent NumThreads attribute.

        If all are Ok(load is not high, no changes made to system and network/SAN, configuration has worked for years etc) but listener resoruce keeps restart, you can place the Listener agent on error logging mode to capture all info to troubleshoot the issue furthe

        If this is a prod cluster, when was the clusterl last rebooted?  Hope the cluster has not being up running for over 3 years.

        By the way, if possible do not modify any VCS agent online/offline/monitor scripts as Veritas only supports its products as they are released.  If there are defects in producst, engage Veritas for a "hot fix" patch.