cancel
Showing results for 
Search instead for 
Did you mean: 

Patrol Agent Oracle going offline why?

semi_vcs_expert
Level 3

Hi,

 

we have a server and for some reason the patrol agent resources keep going offline.  I found out the dba were using a local script to stop port 3181 and thought maybe that might be calling the stop and start scripts outside of vcs. However I'm not convince now going by these logs:

 

2010/11/26 07:10:19 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB3P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 07:15:32 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB3P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 07:17:39 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB3P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 2 of 2) the resource.

2010/11/26 07:26:59 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSWINP_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 07:27:37 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSWINP_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 07:35:22 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSUNXP_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 07:36:06 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSUNXP_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 07:44:48 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB1P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 07:46:53 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB1P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 08:00:41 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB2P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 08:01:37 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB2P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 08:30:49 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB3P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 08:32:50 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB3P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 08:49:02 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB1P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 08:51:13 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB1P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 09:16:18 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSUNXP_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 09:16:55 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSUNXP_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 09:34:06 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSUNXP_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 09:34:36 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSUNXP_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 09:58:43 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB2P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 09:59:15 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB2P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 10:33:54 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB3P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 10:34:27 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB3P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 12:50:18 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB2P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 12:50:56 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB2P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

2010/11/26 13:50:24 VCS ERROR V-16-2-13067 (euqstdb1p) Agent is calling clean for resource(FMSDB1P_app_patrol) because the resource became OFFLINE unexpectedly, on its own.

2010/11/26 13:50:57 VCS ERROR V-16-2-13073 (euqstdb1p) Resource(FMSDB1P_app_patrol) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

 

One of the DBA pointed out that the  script used to monitor the Patrol Agent is returning status code of 110 and 105. What do these mean in terms of VCS?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Gaurav_S
Moderator
Moderator
   VIP    Certified

well I am not sure at this point if this is the exact cause, as there might  be something wrong at application end also (whether application is really going down or just VCS is detecting it ) which I can't predict but somehow I am not confident on the values coming out here for FC & PC...

In your initial comment you said that exit codes are returning 105 as well... that means at some point FC was also 1 ... so I am suspecting that number of processes may be changing & hence wc -l is reporting different values ?

I would still recommend to have a test & modify the monitor script. I would suggest to take following considerations:

-- Put an AND condition so that both FC & PC are checked & if either one doesn't report the expected value, it should return 100 (unsucessful)

-- If you observe that number of processes are changing or may change, in your logic put logical operator rather then exact value of 1 ... for e.g  if [ $PC -ge 1 ] or if [$FC -le 1 ] ... this will be more robust & should take care of changing values

 

Hope this helps..

Gaurav

View solution in original post

13 REPLIES 13

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hello,

In VCS 110 return code means script successful with 100% confidence level .... return code 100 means an unsuccessful completion...

I would suggest to check the monitor script & ensure you have kept 110 as successful completion while 100 as unsuccessful completion so that VCS will understand these codes.

Even after correcting the return codes, if you face isues, I would recommend to check the monitor script thoroughly & see the algorithm/logic on where is it failing as the logs are suggesting that something has happened outside to VCS

 

Gaurav

semi_vcs_expert
Level 3

agreed. This is what i advised the dba. I have suggested turning on the application debugging on the cluster:

 

/opt/VRTSvcs/bin/haconf -makerw
/opt/VRTSvcs/bin/hatype -modify Application LogDbg DBG_AGDEBUG
/opt/VRTSvcs/bin/haconf -makero 

 

perhaps this might help shed some light.

Gaurav_S
Moderator
Moderator
   VIP    Certified

Yes that may help, but the outputs would be returning c code outputs in your engine log which would be best analyzed by Symantec support .. so you would need to open a support case for that.

However before that, I would recommend to check the monitor script logic, if its a shell script, you can run it outside VCS & see the error codes coming out (by echo $?)... that may give you an insight whether problem exist at VCS monitoring or the monitoring logic itself has some issues...

can you paste/attach the monitor script ?

Gaurav

semi_vcs_expert
Level 3

#!/bin/ksh

#

#       BMC PATROL Agent start/stop/monitor script

#

# Filename:         <ORACLE_ADMIN/SID/PATROL>/patrol_vcs.sh

#

# Syntax:           must be run only from ROOT user for start/stop

#

# Script Usage:     <ORACLE_ADMIN>/<SID>/PATROL/patrol_vcs.sh stop/start/monitor

# Description:      This shell script is the action script to stop / start / monitor the PATROL Agent for Oracle

#

#############################################################################

#

# HISTORY:

#

# Date              Version  Modified By                        Modifications

#

# 19/07/10          1.0          mp61105 ? DBA infra support    Initial release.

# 20/07/10          1.1          dr69316 ? EDUXSOL              Monitoring

#

#############################################################################

#

# Files Accessed: /opt/bmc/patrol/Patrol3/patrolrc.sh

#

# Files Created:

#

# Workfiles:

#

#############################################################################

#

#########################

# Environment Variables #

#########################

ORACLE_SID=FMSWINP;export ORACLE_SID

PATROL_HOME=/opt/bmc/patrol/Patrol3

CLUSTER_LOG=/tmp"/cluster_"$ORACLE_SID".log"

PATROL_ORA_HOME=/opt/VRTSvcs/bin/${ORACLE_SID}/PATROL

# #

###     For each Database to be monitored on the same server you'll need a

###     different unique portnumber

# #

PATROL_PORT=4002;export PATROL_PORT

# #

####    This is normaly set to the alias name of the IP for that Database

# #

 

#PATROL_ORA_DISPLAYNAME=eurrdc-ora-rad9.emea.citicorp.com;export PATROL_ORA_DISPLAYNAME PATROL_ORA_DISPLAYNAME=fmswinp-ln-dbs;export PATROL_ORA_DISPLAYNAME SU=/sbin/su.static

 

################

# Main Script body  #

################

case "$1" in

'start')

        ps -ef | grep "PatrolAgent -p $PATROL_PORT" | grep -v grep >/dev/null

        if [ $? != 0 ]

        then

                # cd $PATROL_HOME;. ./patrolrc.sh;PatrolAgent -p $PATROL_PORT -id $PATROL_ORA_DISPLAYNAME & > /dev/null

                $SU - patrolag -c "cd $PATROL_HOME;. ./patrolrc.sh;PatrolAgent -p $PATROL_PORT -id $PATROL_ORA_DISPLAYNAME & > /dev/null"

        fi

        exit 0

        ;;

'monitor')

        ECODE=100

        # pmon for that DB instance

        PC=`ps -fe | grep -w ora_pmon_${ORACLE_SID} | grep -v grep | wc -l`

        # PatrolAgent for that DB instance

        FC=`ps -fe | grep "PatrolAgent -p ${PATROL_PORT}" | grep -v grep | wc -l`

        if [ $FC = 1 ] ; then

           ###  We know know that the process is running

           ECODE=105

           if [ $PC = 1 ] ; then

           ###  Were confident that what's to be monitored is running

              ECODE=110

           fi

        fi

        exit $ECODE

        ;;

'stop')

        ps -ef | grep "PatrolAgent -p $PATROL_PORT" | grep -v grep >/dev/null

        if [ $? = 0 ]

        then

                # cd $PATROL_HOME;. ./patrolrc.sh;pconfig +KILL -p $PATROL_PORT > /dev/null

                $SU - patrolag -c "cd $PATROL_HOME;. ./patrolrc.sh;pconfig +KILL -p $PATROL_PORT > /dev/null"

                count=0

                while true

                do

                        ps -ef | grep "PatrolAgent -p $PATROL_PORT" | grep -v grep >/dev/null

                        if [ $? != 0 ]

                        then

                                break

                        fi

 

                        if [[ $count == "4" ]]

                        then

                                for i in `ps -ef | grep "PatrolAgent -p $PATROL_PORT" | awk '{print $2}'`

                                do

                                        kill -9 $i

                                done

                        fi

                        count=`expr $count + 1 `

                        sleep 10

                done

        fi

    exit 0

    ;;

*)

        echo "Usage: $0 { start | stop | monitor }"

        exit 1

        ;;

esac

exit 0

Gaurav_S
Moderator
Moderator
   VIP    Certified

so here is the monitor code:

==========================================

ECODE=100

        # pmon for that DB instance

        PC=`ps -fe | grep -w ora_pmon_${ORACLE_SID} | grep -v grep | wc -l`

        # PatrolAgent for that DB instance

        FC=`ps -fe | grep "PatrolAgent -p ${PATROL_PORT}" | grep -v grep | wc -l`

        if [ $FC = 1 ] ; then

           ###  We know know that the process is running

           ECODE=105

           if [ $PC = 1 ] ; then

           ###  Were confident that what's to be monitored is running

              ECODE=110

           fi

        fi

        exit $ECODE

======================================================

have you tried to run the ps command mentioned above for PC & FC variables & saw the output what is coming ?

moreover, the logic is going like this:

exit code set to 100 (unsuccessful).. however if FC OR PC comes to 1, it is successful (105 or 110) .. I am not so sure but plz correct me if I am wrong, won't you want for BOTH FC & PC to be 1 in order to be successful completion of monitor ? In the above code, what if FC= 0 & PC=1 , it will still return 110 code which means successful to VCS ?

 

Gaurav

semi_vcs_expert
Level 3

you make a very good point re what would happen if fc=0 and pc=1

 

here is the output from ps

ps -fe | grep -w ora_pmon_${ORACLE_SID} | grep -v grep | wc -l
       1

ps -fe | grep "PatrolAgent -p ${PATROL_PORT}" | grep -v grep | wc -l
       4
 

so in this case the PC=1 but the FC=4  with the logic above it is only checking for a value of 1 could this be why we are seeing those offlines?
 

Gaurav_S
Moderator
Moderator
   VIP    Certified

well I am not sure at this point if this is the exact cause, as there might  be something wrong at application end also (whether application is really going down or just VCS is detecting it ) which I can't predict but somehow I am not confident on the values coming out here for FC & PC...

In your initial comment you said that exit codes are returning 105 as well... that means at some point FC was also 1 ... so I am suspecting that number of processes may be changing & hence wc -l is reporting different values ?

I would still recommend to have a test & modify the monitor script. I would suggest to take following considerations:

-- Put an AND condition so that both FC & PC are checked & if either one doesn't report the expected value, it should return 100 (unsucessful)

-- If you observe that number of processes are changing or may change, in your logic put logical operator rather then exact value of 1 ... for e.g  if [ $PC -ge 1 ] or if [$FC -le 1 ] ... this will be more robust & should take care of changing values

 

Hope this helps..

Gaurav

semi_vcs_expert
Level 3

ahi well in the output above the wc counter is showing 4 for the FC  and 1 for PC

 

you mention using AND but then also mention using OR

 

so should i be doing:

 

[[ $PC -ge 1 ]] || [[ $FC -le 1 ]]

or [[ $PC = 1 ]] && [[ $FC = 1 ]]

 

Gaurav_S
Moderator
Moderator
   VIP    Certified

last one (using or) was just an example not an operator....

you should use AND operator ...

-- If you think value for PC & FC should be or will be 1 always:

if [[ $PC = 1 ]] && [[ $FC = 1 ]]

then

exit 110

else

exit 100

-- If you think value for PC & FC could be 1 or greater than 1, then

if [[ $PC -ge 1 ]] && [[ $FC -ge 1 ]]

then

exit 110

else

exit 100

 

you don't need to use OR operator

Gaurav

semi_vcs_expert
Level 3

thanks, vcs reckons anything that is not 110 is a  problem, but havent seen anything about 105? any idea

Gaurav_S
Moderator
Moderator
   VIP    Certified

From VCS Bundled agents guide:

MonitorProgram can return the following VCSAgResState values: OFFLINE value is 100; ONLINE values range from 101 to 110 (depending on the confidence level); 110 equals confidence level of 100%. Any other value = UNKNOWN.

so 105 is a online state but would be with 50% confidence level...

Have a look at VCS users guide & VCS Bundled agents for more details..

Guides can be found at

https://sort.symantec.com/documents

 

Gaurav

semi_vcs_expert
Level 3

so maybe when it becomes 105 the resource is 50% online but waiting for some dependant resource to come online as well?

Gaurav_S
Moderator
Moderator
   VIP    Certified

nop... no relation to dependent resources.... it simply means that resource is online however VCS is not confident about it ...