cancel
Showing results for 
Search instead for 
Did you mean: 

problem with VRTS ONG resource

bonny6
Level 3

Ive got a problem . i have 2 servers with VRTS cluster

altought VRTS ONG process is online,im getting this message very often

 MIG_MPM_1a mpm1a (Veritas_Cluster_Server): ONG (ONG): Resource state is unknown

and also this message

 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300

 i have compared both configuration files and nothing missing .

what could be the problem here ?

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions

g_lee
Level 6
bonny,

The error you are getting suggests the ps command in the monitor script is picking up more than one instance of ong_agent and/or ong_monitor, so the test is failing

example:
Here is a process with many instances:
# ps -ef |grep httpd
   juser  9111  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 10190  9094   0 12:31:56 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9112  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9109  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
    root  9094  6589   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9110  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 11247  9094   0 12:39:14 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 12314  9094   0 12:42:03 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9113  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 10195  9094   0 12:32:29 ?           0:00 /usr/local/apache2/bin/httpd -k start
    root 16818 16541   0 13:06:11 pts/4       0:00 grep httpd

Substitute this process into the monitor script logic:
# process1=`ps -ef | grep httpd | grep -v grep | awk '{ print $2 }'`
# if [ X$process1 = "X" ]; then
> echo 100
> else
> echo 110
> fi
test: unknown operator 10190
^^^^^^^ this is the error you are getting (with diff proc number obviously)
this is because test is expecting $process1 to be a single arg, but it's not:
# echo $process1
9111 10190 9112 9109 9094 9110 11247 12314 9113 10195

Compare to a "good" process with only one instance:
# ps -ef |grep lpsched
    root  6858  6589   0   Jan 21 ?           0:00 /usr/lib/lp/local/lpsched
    root 17617 16541   0 13:07:45 pts/4       0:00 grep lpsched
# process2=`ps -ef | grep lpsched | grep -v grep | awk '{print $2 }'`
# if [ X$process2 = "X" ]; then
> echo 100
> else
> echo 110
> fi
110
^^^^^^^ correct output
# echo $process2
6858

Re: why it's picking up multiple instances, it's not possible to determine that from here (some possibilities include: someone/something else might be running the program manually at the same time, or running a proc/file with the same name)

If the application multiple instances running (ie: it is fine as long as it can find at least one instance), then the monitor script can be modified as follows as a workaround (similar to Gaurav's suggestion, but will account for multiple lines found in ps):

process1=`ps -ef | grep '/'ong_agent | grep -vc grep`
process2=`ps -ef | grep '/'ong_alerter | grep -vc grep`

# Check the process of the ODM
if [ $process1 -le 0 ] ; then
    retcode=100
elif [ $process2 -le 0 ] ; then
    retcode=100
else
    retcode=110
fi

If the application cannot handle multiple processes running (ie: there should only be one process running at a time, any more is problem/issue), then you will need to investigate on your system or follow up with the ONG vendor to see where/how the extra processes are being run.

View solution in original post

25 REPLIES 25

kunal
Level 4
Employee
Hi,

This seems to be a problem monitor script.

When monitor procedure is run, VCS expects the return code of 100-110 to determine the state of the resource.

Talking about the errors:

"MIG_MPM_1a mpm1a (Veritas_Cluster_Server): ONG (ONG): Resource state is unknown"

This error means that the monitor procedure did not receive a status code of range 100-110

" VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300"

This seems to be an error returned by script "/opt/VRTSvcs/bin/ONG/monitor". 

Looking at the error, it seems that this is a shell script and one of the "if condition" in the script is not getting evaluated.

I hope this helps.

Regards,
Kunal


Anoop_Kumar1
Level 5

Kunal pointed to right direction.

A monitor script should be able to declare a known state ( i.e online/offline/faulty ) of resource. If there is a unknown state, that means monitor is failing to detect any known state of resource. For known states of resource, valid exit codes ( i.e 100/110 ) should be defiined in monitor script.

You can perform a simple test yourself by running monitor script to find what exit code it is providing to agent.

# /opt/VRTSvcs/bin/ONG/monitor
# echo $?

If above echo command shows exit code other than 100/110, that could be problem.  Please check in monitor script.

bonny6
Level 3

Thanks alot but after i compare monitor script with another VRTS  machine that working fine  i saw that nothing is diffrent with syntax ,

i also run the script with user oracle and i got 110 after run the command #echo $?
so also i get the right output,

im still try to figure what could it be and waht i can  check more,

and again . thanks


Anoop_Kumar1
Level 5
Then it could be a user permission issue.

Is there any user attribute for this resource ? Compare on another good VRTS machine and check.

Can you paste us  below command output from good VRTS machine and this machine ?

# hares -display ONG

bonny6
Level 3

user attributes is good , evreyone can read and excute ,

i've attached two screenshots of one good VRTS machine and one bad VRTS machine ,


please let me know if its help ,


Thanks ,


Anoop_Kumar1
Level 5

Above snapshot shows that the ONG resource is online on node mpm1a. That means its working fine.

Now, we need to check on function "test" in the monitor script. 

/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300"

Its not causing any problem to resource, but generatinig messages in engine log file. I feel "unknown operator 2300" belong to some function error.

Is it possible to extract test function in /opt/VRTSvcs/bin/ONG/monitor for us ?

bonny6
Level 3
yes thanks , i know that everyehing is working fine and still ive got this error  messages once a day ,


2010/08/04 06:36:38 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300
2010/08/05 12:07:20 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 20069
2010/08/05 21:58:01 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 12070
2010/08/06 00:59:10 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300


 i dont have function names test , i dont know why in the error its display  "test"


and still  as i mention at the start ive  also got second error

MIG_MPM_1a mpm1a (Veritas_Cluster_Server): ONG (ONG): Resource state is unknown


ive attached the script MONITOR

Thanks ,

bonny6
Level 3

i think this is a bug . because the script return the values 100,110 and once a day as it seems in the log the monitor script dumps a number thats not suposed to be ,

it is make sense ?

someone can tell me if this random numbers related to somthing ?

Gaurav_S
Moderator
Moderator
   VIP    Certified

Just to double check, can u manually run the "ps -ef " commands defined in monitor script for process1 & process2 variables .. & paste the output here ?

Gaurav

bonny6
Level 3
Hi

at your request , here is the "  ps  -ef " of the ONG process1 & process2  and also i print screen the output of the script monitor ,just to show that the script work fine ,


Thanks,


Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi Bonny..

Thanks for pasting it... 

I have two things to say.... your monitor script puts 110 code  if either of one process exists (ong_agent OR ong_alerter) since they are part of elif loop...  Is this correct ?  OR you need both the process to be up ?

Though your process logic seems correct ... however if we see the error.....

2010/08/04 06:36:38 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300
2010/08/05 12:07:20 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 20069
2010/08/05 21:58:01 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 12070
2010/08/06 00:59:10 VCS INFO V-16-2-13001 (mpm1a) Resource(ONG): Output of the completed operation (monitor)
/opt/VRTSvcs/bin/ONG/monitor: test: unknown operator 2300

see the random numbers ?  2300 / 20069 /12070  .. I am wondering what are these !  are these PIDs ?

If I take a look at your ps output, 2300 seems to be the PID for ong_agent ....  this might be a coincidence ... If we see the ps -ef defined in monitor script.... at last you are doing a awk '{print $2} '  ... which is PID .. & in this case 2300 ....

can't say for sure but I would suggest you to try out a different logic in your monitor script...... for e.g

process1=`ps -ef |grep "ong_agent"  |grep -v grep |wc -l`

you can pass this to a for loop like

If above gives 0, that means no process & hence can return with code 100

If above gives 1  (that means process exists) so you can take out return code of 110


Gaurav

bonny6
Level 3
Hi Gaurav,

for your first question , yes i need both  process up ,so i want the code 110 only if they both up .

i think the 2300 error output was acoincidence becuse non of the other PID exist .

the problem here that i have same script in another VRTS server that work fine ,

its so wierd .

i will check the change your write her ,but still i dont know if it will be  the cause,

Thanks

Gaurav_S
Moderator
Moderator
   VIP    Certified
Hi Bonny, As you say none other PID then 2300 exists ... is there a possibility that same process is restarting with different PIDs ? just in case if process restarts it might get a different PID .. Regarding other point... monitor script logic is OK (since elif is applied to return code 100) .... that modification I suggested might be a test..... again this test is useful only if we confirm that those numbers coming up in errors are nothing but PIDs.. Gaurav

rregunta
Level 4
Hello Bonny,

The agent ONG is this provided by Symantec or custom built agent?

Regards
Rajesh

bonny6
Level 3

Hi ,

 im possitive that the Error numbers are not PID ,  the ONG resource provided by thired party company for  ORACLE ,

i have no ideas how to continue investaget this , all the configuration are same as another good server , the script return the right results , and still . this error keep on coming even the resource is still working ,






kunal
Level 4
Employee
Hi,

I still believe that this error " test: unknown operator 2300" is due to the if condition.

the shell used for your script is /bin/sh

I just tried a small test on this shell.


if [ "foo" abc "foo" ] ;
> then
> echo ABC
> fi
test: unknown operator abc


 if [ "foo" = "foo" ] ;
> then
> echo ABC
> fi
ABC

I understand that in your case, the second example applies but looking at the output above, it seems like at some point of the day, the if condition acts strange - no idea why.

This still does not answer the question, why it happens just once a day and with random numbers. The only number in your if condition is PID. So as Gaurav said, it might worth considering a different approach to the script.

Other option might be to use a different shell for your script 

instead of /bin/sh try to use /usr/bin/bash  - this will give an idea if error changes.

Hope this helps.
 

g_lee
Level 6
bonny,

The error you are getting suggests the ps command in the monitor script is picking up more than one instance of ong_agent and/or ong_monitor, so the test is failing

example:
Here is a process with many instances:
# ps -ef |grep httpd
   juser  9111  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 10190  9094   0 12:31:56 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9112  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9109  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
    root  9094  6589   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9110  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 11247  9094   0 12:39:14 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 12314  9094   0 12:42:03 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser  9113  9094   0 12:27:48 ?           0:00 /usr/local/apache2/bin/httpd -k start
   juser 10195  9094   0 12:32:29 ?           0:00 /usr/local/apache2/bin/httpd -k start
    root 16818 16541   0 13:06:11 pts/4       0:00 grep httpd

Substitute this process into the monitor script logic:
# process1=`ps -ef | grep httpd | grep -v grep | awk '{ print $2 }'`
# if [ X$process1 = "X" ]; then
> echo 100
> else
> echo 110
> fi
test: unknown operator 10190
^^^^^^^ this is the error you are getting (with diff proc number obviously)
this is because test is expecting $process1 to be a single arg, but it's not:
# echo $process1
9111 10190 9112 9109 9094 9110 11247 12314 9113 10195

Compare to a "good" process with only one instance:
# ps -ef |grep lpsched
    root  6858  6589   0   Jan 21 ?           0:00 /usr/lib/lp/local/lpsched
    root 17617 16541   0 13:07:45 pts/4       0:00 grep lpsched
# process2=`ps -ef | grep lpsched | grep -v grep | awk '{print $2 }'`
# if [ X$process2 = "X" ]; then
> echo 100
> else
> echo 110
> fi
110
^^^^^^^ correct output
# echo $process2
6858

Re: why it's picking up multiple instances, it's not possible to determine that from here (some possibilities include: someone/something else might be running the program manually at the same time, or running a proc/file with the same name)

If the application multiple instances running (ie: it is fine as long as it can find at least one instance), then the monitor script can be modified as follows as a workaround (similar to Gaurav's suggestion, but will account for multiple lines found in ps):

process1=`ps -ef | grep '/'ong_agent | grep -vc grep`
process2=`ps -ef | grep '/'ong_alerter | grep -vc grep`

# Check the process of the ODM
if [ $process1 -le 0 ] ; then
    retcode=100
elif [ $process2 -le 0 ] ; then
    retcode=100
else
    retcode=110
fi

If the application cannot handle multiple processes running (ie: there should only be one process running at a time, any more is problem/issue), then you will need to investigate on your system or follow up with the ONG vendor to see where/how the extra processes are being run.

View solution in original post

bonny6
Level 3
HI , Thanks a lot , as it seems after you recreate the problem is that the variable cant handle multiple PID , i will take it from here , one more little thing , as i mention in the start Ive got second error , MIG_MPM_1a mpm1a (Veritas_Cluster_Server): ONG (ONG): Resource state is unknown is it possible its happening because the first error , the ONG goes to unknown state after monitor script failed ?

Gaurav_S
Moderator
Moderator
   VIP    Certified
Perfect catch Grace....

Hi Bonny,

does this message comes alongwith "unknown operator" message ? or this comes at different times ?

Gaurav