Solved: clean script exiting abnormally (exit code 0x9)

infinitiguy · ‎03-31-2011

I recently upgraded to VCS 5.1 and I'm noticing if I bring my application offline the clean script is abnormally exiting with exit code (0x9). Another effect, which made me notice this in the first place is that when I bring my application offline VCS doesn't actually recognize that it was gone offline and doesn't fail it over. I can switch the application manually between both nodes just fine.

Has anyone experienced this behavior before and know what might be causing it? Let me know what logs might be helpful and I can provide them. The VCS clean script is very simple..

PGREP=/usr/bin/pgrep

KILL=/bin/kill

$KILL -9 `$PGREP -f httpd`

$KILL -9 `$PGREP -f teamforge`

exit 0

mikebounds · ‎04-03-2011

I have figured out what your issue probablly is - you are killing a process matching "httpd", but the clean entry point will match this and therefore your script is killing /opt/VRTSvcs/bin/Application/clean - if you exclude this (like add "grep -v /opt/VRTSvcs/bin/Application/clean") then script should work

But actually, you don't need clean script as the the agent clean (i.e code in /opt/VRTSvcs/bin/Application/clean) will kill processes in PID files, so you only need a clean if you were running something like /etc/init.d/httpd -force stop (if such an option existed).

Mike

View solution in original post

infinitiguy · ‎03-31-2011

another effect (potentially related) is that when I offline my application from the command line (/etc/init.d/app stop) VCS never realizes that it's offline even though the pid that I've specified to monitor has been removed. I've noticed that there are different options in 5.1 than in 5.0.3 (like confidence level) which I don't know if that has anything to do with it

Gaurav_S · ‎03-31-2011

so depends on what your monitor script is observing ... if your app is offline, then monitor script should monitor it ... can u paste the monitoring script ?

G

mikebounds · ‎04-01-2011

Information that would be helpful is an Engine log (/var/VRTSvcs/log/engine_A.log) extract from when offline of resource was initiated to the point the clean failed and also an extract from main.cf (in /etc/VRTSvcs/conf/config) of your resource.

Problem could be that your offline script exits too quickly so PID file is still there when monitor runs. You could temporarily try putting a sleep at end of offline script and if this works then code something more sophisticated.

Mike

infinitiguy · ‎04-01-2011

I have a support case open now, and I've narrowed it down to specific - repeatable behavior - it has to do with the clean script... but the clean script is written correctly.

I've simplified and only dealing with apache now. Scripts are below. I've modfied the clean script to be a little smarter so it wasn't trying to remove pids that were gone and things like that.

If I offline apache with the init script, or by causing a fault (rm pid) veritas complains that the clean script is not happy. If I run the clean script manually i have a return code 0.

[root@linopstfg02 bin]# ./apachekill ; echo $?

1

apache is dead

0

Veritas does run the clean script successfully in terms of apache actually being offline, but it doesn't like the return code it's recieving (assuming 9 from the 0x9). If I change the clean script to exit immediately (echo hi exit 0) vcs marks the node as faulted and starts apache on the other node.

I don't know what's wrong with the clean script, and in fact, the original script used to work fine with 5.0.3. I was told by support that 5.1 has changed the Application agent quite a bit which probably attributes to the new behavior. I'm a little suspicious of the pid that seems to be written (changes per run) in the logs below.. not sure if that's causing any issues.

3 scripts:

[root@linopstfg02 bin]# cat apachestop

#!/bin/bash

/etc/init.d/httpd stop

[root@linopstfg02 bin]# cat apachestart

#!/bin/bash

/etc/init.d/httpd start

[root@linopstfg02 bin]# cat apachekill

#!/bin/bash

pgrep -f httpd

returncode="$?"

echo $returncode

if [ "$returncode" = 1 ]; then

{

echo apache is dead

}

else

{

kill -9 `pgrep -f httpd`

if [ -f /var/run/httpd.pid ]; then

{

echo removing pid

rm /var/run/httpd.pid

}

fi

}

fi

exit 0

below is the snippet of engine.log

2011/04/01 15:24:57 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:24:57 VCS ERROR 2011/04/01 15:24:57 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:24:57 VCS ERROR V-16-2-13069 (linopstfg02.prod.domain.com) Resource(teamforge_app_server) - clean failed.

2011/04/01 15:25:58 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)

==============================================

25595

0

==============================================

2011/04/01 15:25:58 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:26:59 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)

==============================================

25642

0

==============================================

2011/04/01 15:26:59 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:28:00 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)

==============================================

25739

0

==============================================

2011/04/01 15:28:00 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:29:02 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)

==============================================

25847

0

==============================================

2011/04/01 15:29:02 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:30:03 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)

==============================================

25910

0

==============================================

2011/04/01 15:30:03 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/01 15:30:12 VCS INFO V-16-1-50135 User vcsadmin fired command: MSG_RES_PROBE teamforge_app_server linopstfg02.prod.domain.com from 10.17.32.130

2011/04/01 15:30:12 VCS INFO V-16-10031-504 (linopstfg02.prod.domain.com) Application:teamforge_app_server:clean:Executed /usr/local/bin/apachekill as user root

2011/04/01 15:30:22 VCS WARNING V-16-10031-542 (linopstfg02.prod.domain.com) Application:teamforge_app_server:clean:PidFile </var/run/httpd.pid> does not exist, process will not be killed

2011/04/01 15:30:23 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)

==============================================

hi <----this is me echoing hi and exiting 0 immediately; basically disabling the clean script from actually doing anything it's supposed to do

==============================================

2011/04/01 15:30:23 VCS INFO V-16-2-13078 (linopstfg02.prod.domain.com) Resource(teamforge_app_server) - clean completed successfully after 6 failed attempts.

2011/04/01 15:30:23 VCS INFO V-16-1-10307 Resource teamforge_app_server (Owner: Unspecified, Group: VM_VCS_Teamforge) is offline on linopstfg02.prod.domain.com (Not initiated by VCS)

2011/04/01 15:30:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) on System linopstfg02.prod.domain.com

2011/04/01 15:30:24 VCS INFO V-16-1-10305 Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) is offline on linopstfg02.prod.domain.com (VCS initiated)

2011/04/01 15:30:24 VCS ERROR V-16-1-10205 Group VM_VCS_Teamforge is faulted on system linopstfg02.prod.domain.com

2011/04/01 15:30:24 VCS NOTICE V-16-1-10446 Group VM_VCS_Teamforge is offline on system linopstfg02.prod.domain.com

2011/04/01 15:30:24 VCS INFO V-16-1-10493 Evaluating linopstfg01.prod.domain.com as potential target node for group VM_VCS_Teamforge

2011/04/01 15:30:24 VCS INFO V-16-1-10493 Evaluating linopstfg02.prod.domain.com as potential target node for group VM_VCS_Teamforge

2011/04/01 15:30:24 VCS INFO V-16-1-50010 Group VM_VCS_Teamforge is online or faulted on system linopstfg02.prod.domain.com

2011/04/01 15:30:24 VCS NOTICE V-16-1-10301 Initiating Online of Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) on System linopstfg01.prod.domain.com

2011/04/01 15:30:32 VCS INFO V-16-1-10298 Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) is online on linopstfg01.prod.domain.com (VCS initiated)

2011/04/01 15:30:32 VCS NOTICE V-16-1-10301 Initiating Online of Resource teamforge_app_server (Owner: Unspecified, Group: VM_VCS_Teamforge) on System linopstfg01.prod.domain.com

2011/04/01 15:30:32 VCS INFO V-16-10031-504 (linopstfg01.prod.domain.com) Application:teamforge_app_server:online:Executed /usr/local/bin/apachestart as user root

2011/04/01 15:30:33 VCS INFO V-16-2-13716 (linopstfg01.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (online)

==============================================

Starting httpd: [ OK ]

==============================================

2011/04/01 15:30:33 VCS INFO V-16-1-10298 Resource teamforge_app_server (Owner: Unspecified, Group: VM_VCS_Teamforge) is online on linopstfg01.prod.domain.com (VCS initiated)

2011/04/01 15:30:33 VCS NOTICE V-16-1-10447 Group VM_VCS_Teamforge is online on system linopstfg01.prod.domain.com

2011/04/01 15:30:33 VCS NOTICE V-16-1-10448 Group VM_VCS_Teamforge failed over to system linopstfg01.prod.domain.com

infinitiguy · ‎04-01-2011

main.cf export

group VM_VCS_Teamforge (

SystemList = { "linopstfg01.prod.domain.com" = 0,

"linopstfg02.prod.domain.com" = 1 }

)

Application teamforge_app_server (

StartProgram = "/usr/local/bin/apachestart"

StopProgram = "/usr/local/bin/apachestop"

CleanProgram = "/usr/local/bin/apachekill"

PidFiles = { "/var/run/httpd.pid" }

)

IP vcs_teamforge_vip (

Device = eth0

Address = "10.17.32.132"

NetMask = "255.255.252.0"

)

NIC vcs_teamforge_nic (

Device = eth0

)

teamforge_app_server requires vcs_teamforge_vip

vcs_teamforge_vip requires vcs_teamforge_nic

// resource dependency tree

//

// group VM_VCS_Teamforge

// {

// Application teamforge_app_server

// {

// IP vcs_teamforge_vip

// {

// NIC vcs_teamforge_nic

// }

mikebounds · ‎04-01-2011

VCS comes bundled with an Apache agent - can you not use this agent, rather than using Application agent?

Mike

Gaurav_S · ‎04-01-2011

Hello,

Looking at the very first stage of script, VCS doesn't understand the exit code of 0 & 1 , you need to check the same in the monitor script ..

for VCS , successful is exit code 110 while unsuccesful is exit code 100. There are more levels from 101-109 which defines the confidence level however for simplicity you can use 100 & 110 instead of 0 & 1.

Now I understand that at unix level , exit codes would be 0 & 1 however that should be fed to VCS in the terms VCS understands it. Now in the resource defination of main.cf

Application teamforge_app_server (

StartProgram = "/usr/local/bin/apachestart"

StopProgram = "/usr/local/bin/apachestop"

CleanProgram = "/usr/local/bin/apachekill"

PidFiles = { "/var/run/httpd.pid" }

)

In here, what is your monitor program doing ? You should have monitor setup correctly that if it finds the PID, it should return 110 & if it doesn't find the PID it should return 100. You may think of creating one more entry above in resource defination like:

MonitorProgram - "/usr/local/bin/apachemonitor"

& you need to create the script the way I described above.

Gaurav

infinitiguy · ‎04-01-2011

re: VCS comes bundled with an Apache agent - can you not use this agent, rather than using Application agent?

I guess I could. The only reason I'm monitoring the apache pid is I figured that was the easiest thing to do. The application isn't just apache, it's actually apache, and 4 java processes, but I had trouble trying to check the java processes.

re: the monitor program... I don't have a monitor program specified (none of our VCS enabled resources in other clusters do). We just have the start/stop/clean and checking pids usually. Is the monitor program something that the Application agent relies on more in 5.1, since this setup used to work in 5.0.3? I'll create a monitor program and test to see if that improves the situation.

infinitiguy · ‎04-01-2011

just tried the monitor program and it caused a clean to run and bring down the application (and not fault the node). This is strange because the monitor program should've responded saying that apache was alive (100 code).

Below is the script, and the return code of a run on a node that has it running.

[root@linopstfg02 bin]# cat apachemonitor

#!/bin/bash

echo hi

pgrep -f httpd

returncode="$?"

echo $returncode

if [ "$returncode" = 1 ]; then

{

echo apache is dead

exit 110

}

else

{

echo apache is alive

exit 100

}

fi

[root@linopstfg02 bin]# pgrep -f httpd

[root@linopstfg02 bin]# pgrep -f teamforge

[root@linopstfg02 bin]# ./apachemonitor

hi

388

400

402

403

404

405

406

407

408

409

0

apache is alive

[root@linopstfg02 bin]# echo $?

100

mikebounds · ‎04-02-2011

Your exit codes are the wrong way round:

110 should be ONLINE

100 should be OFFLINE

Mike

infinitiguy · ‎04-02-2011

i reversed the codes.. doesn't look like that made a difference. Still getting the abnormal clean exit code. I also added exit 110 to the end of the clean script incase veritas was looking for that to report a different code, but it didn't seem to be the case either.

if [ "$returncode" = 1 ]; then

{

echo apache is dead

exit 100

}

else

{

echo apache is alive

exit 110

}

fi

mikebounds · ‎04-02-2011

Looking at your logs this looks like a bug in the Application agent from log entry:

Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

This is "/opt/VRTSvcs/bin/Application/clean" abnormally terminating, not your clean script "/usr/local/bin/apachekill". Regardless of what your script does, the Appliction clean entry point should not abnormally terminate. I would push back on support. Alternatively you could try using Agent builder (see https://sort.symantec.com/agents/detail/171), rather than Application agent.

Mike

infinitiguy · ‎04-02-2011

I've noticed some really interesting behavior... I shortened my script further and now am just killing httpd and rm pid with an exit 0

What I've noticed is the clean script (vcs perl script that veritas provides) is abnormally exiting with the SAME error code that I'm using to kill apache.. look at the below logs. The first set terminates due to signal 9 (kill -9) error code 0x9... where as the second terminates abnormally to signal 15 (kill -TERM) exit code 0xf (15).

The question is... why is the perl program killing itself when my clean script executes (which is what it seems to be doing).

[root@linopstfg02 bin]# vi apachekill

#!/bin/bash

set -x

kill -9 `ps -ef | grep httpd | grep -v grep | awk {'print $2'}`

rm /var/run/httpd.pid

echo hi

exit 0

[root@linopstfg02 bin]# ./apachekill

++ ps -ef

++ grep httpd

++ grep -v grep

++ awk '{print $2}'

+ kill -9 29355 29358 29359 29360 29361 29362 29363 29364 29365

+ rm /var/run/httpd.pid

+ echo hi

hi

+ exit 0

[root@linopstfg02 bin]# 2011/04/02 19:08:07 VCS ERROR V-16-2-13067 Thread(4153691024) Agent is calling clean for resource(teamforge_app_server) because the resource became OFFLINE unexpectedly, on its own.

2011/04/02 19:08:08 VCS WARNING V-16-10031-13196 Thread(4153691024) script (/opt/VRTSvcs/bin/Application/clean) terminated due to signal (9)

2011/04/02 19:08:08 VCS ERROR V-16-2-13040 Thread(4153691024) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).

2011/04/02 19:08:08 VCS ERROR V-16-2-13069 Thread(4153691024) Resource(teamforge_app_server) - clean failed.

[root@linopstfg02 bin]#

[root@linopstfg02 bin]# vi apachekill

#!/bin/bash

set -x

kill -TERM `ps -ef | grep httpd | grep -v grep | awk {'print $2'}`

rm /var/run/httpd.pid

echo hi

exit 0

"apachekill" 25L, 359C written

[root@linopstfg02 bin]# ./apachekill

++ ps -ef

++ grep httpd

++ grep -v grep

++ awk '{print $2}'

+ kill -TERM 29813 29816 29817 29818 29819 29820 29821 29822 29823

+ rm /var/run/httpd.pid

+ echo hi

hi

+ exit 0

[root@linopstfg02 bin]# 2011/04/02 19:08:59 VCS ERROR V-16-2-13067 Thread(4153691024) Agent is calling clean for resource(teamforge_app_server) because the resource became OFFLINE unexpectedly, on its own.

2011/04/02 19:09:00 VCS WARNING V-16-10031-13196 Thread(4153691024) script (/opt/VRTSvcs/bin/Application/clean) terminated due to signal (15)

2011/04/02 19:09:00 VCS ERROR V-16-2-13040 Thread(4153691024) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0xf).

2011/04/02 19:09:00 VCS ERROR V-16-2-13069 Thread(4153691024) Resource(teamforge_app_server) - clean failed.

infinitiguy · ‎04-02-2011

confirming this... I changed my "clean script" to shut down apache gracefully. (/etc/init.d/httpd stop).. which isn't useful because if the clean script needs to run it's because the graceful stop didn't work, and this allows faulting to occur and apache to fail over between the nodes.

So there's definately something that is funny with the veritas perl script and running a kill command in a clean script in VCS 5.1 because these same scripts worked in 5.0.3 :(

mikebounds · ‎04-03-2011

I have figured out what your issue probablly is - you are killing a process matching "httpd", but the clean entry point will match this and therefore your script is killing /opt/VRTSvcs/bin/Application/clean - if you exclude this (like add "grep -v /opt/VRTSvcs/bin/Application/clean") then script should work

But actually, you don't need clean script as the the agent clean (i.e code in /opt/VRTSvcs/bin/Application/clean) will kill processes in PID files, so you only need a clean if you were running something like /etc/init.d/httpd -force stop (if such an option existed).

Mike

infinitiguy · ‎04-03-2011

I had the clean scipt in place because originally I had trouble monitoring actual processes and apache is the only thing in the stack of processes that writes a pid. Now that I know about the monitor program I might make the monitor program more sophisticated and check all my processes and see if that handles running a failover without a problem. I'll also grep out the clean program in my kill script to make sure it can't find it. The processes that will potentially fail are java, and my guess is when they fail, they'll need to be kill -9'ed to terminate.. let me play a bit and see what the results are.

infinitiguy · ‎04-03-2011

Mike,

I think that did the trick. I've been testing all morning and it looks like failover is working just fine. I'm not declairing total victory until testing completes... but I think I'm in good shape. Thanks a million

VOX

clean script exiting abnormally (exit code 0x9)