03-31-2011 09:48 AM
I recently upgraded to VCS 5.1 and I'm noticing if I bring my application offline the clean script is abnormally exiting with exit code (0x9). Another effect, which made me notice this in the first place is that when I bring my application offline VCS doesn't actually recognize that it was gone offline and doesn't fail it over. I can switch the application manually between both nodes just fine.
Has anyone experienced this behavior before and know what might be causing it? Let me know what logs might be helpful and I can provide them. The VCS clean script is very simple..
Solved! Go to Solution.
04-03-2011 02:05 AM
I have figured out what your issue probablly is - you are killing a process matching "httpd", but the clean entry point will match this and therefore your script is killing /opt/VRTSvcs/bin/Application/clean - if you exclude this (like add "grep -v /opt/VRTSvcs/bin/Application/clean") then script should work
But actually, you don't need clean script as the the agent clean (i.e code in /opt/VRTSvcs/bin/Application/clean) will kill processes in PID files, so you only need a clean if you were running something like /etc/init.d/httpd -force stop (if such an option existed).
Mike
03-31-2011 01:10 PM
another effect (potentially related) is that when I offline my application from the command line (/etc/init.d/app stop) VCS never realizes that it's offline even though the pid that I've specified to monitor has been removed. I've noticed that there are different options in 5.1 than in 5.0.3 (like confidence level) which I don't know if that has anything to do with it
03-31-2011 03:04 PM
so depends on what your monitor script is observing ... if your app is offline, then monitor script should monitor it ... can u paste the monitoring script ?
G
04-01-2011 02:27 AM
Information that would be helpful is an Engine log (/var/VRTSvcs/log/engine_A.log) extract from when offline of resource was initiated to the point the clean failed and also an extract from main.cf (in /etc/VRTSvcs/conf/config) of your resource.
Problem could be that your offline script exits too quickly so PID file is still there when monitor runs. You could temporarily try putting a sleep at end of offline script and if this works then code something more sophisticated.
Mike
04-01-2011 12:37 PM
I have a support case open now, and I've narrowed it down to specific - repeatable behavior - it has to do with the clean script... but the clean script is written correctly.
I've simplified and only dealing with apache now. Scripts are below. I've modfied the clean script to be a little smarter so it wasn't trying to remove pids that were gone and things like that.
If I offline apache with the init script, or by causing a fault (rm pid) veritas complains that the clean script is not happy. If I run the clean script manually i have a return code 0.
3 scripts:
2011/04/01 15:24:57 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:24:57 VCS ERROR 2011/04/01 15:24:57 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:24:57 VCS ERROR V-16-2-13069 (linopstfg02.prod.domain.com) Resource(teamforge_app_server) - clean failed.
2011/04/01 15:25:58 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)
==============================================
25595
0
==============================================
2011/04/01 15:25:58 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:26:59 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)
==============================================
25642
0
==============================================
2011/04/01 15:26:59 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:28:00 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)
==============================================
25739
0
==============================================
2011/04/01 15:28:00 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:29:02 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)
==============================================
25847
0
==============================================
2011/04/01 15:29:02 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:30:03 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)
==============================================
25910
0
==============================================
2011/04/01 15:30:03 VCS ERROR V-16-2-13040 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/01 15:30:12 VCS INFO V-16-1-50135 User vcsadmin fired command: MSG_RES_PROBE teamforge_app_server linopstfg02.prod.domain.com from 10.17.32.130
2011/04/01 15:30:12 VCS INFO V-16-10031-504 (linopstfg02.prod.domain.com) Application:teamforge_app_server:clean:Executed /usr/local/bin/apachekill as user root
2011/04/01 15:30:22 VCS WARNING V-16-10031-542 (linopstfg02.prod.domain.com) Application:teamforge_app_server:clean:PidFile </var/run/httpd.pid> does not exist, process will not be killed
2011/04/01 15:30:23 VCS INFO V-16-2-13716 (linopstfg02.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (clean)
==============================================
hi <----this is me echoing hi and exiting 0 immediately; basically disabling the clean script from actually doing anything it's supposed to do
==============================================
2011/04/01 15:30:23 VCS INFO V-16-2-13078 (linopstfg02.prod.domain.com) Resource(teamforge_app_server) - clean completed successfully after 6 failed attempts.
2011/04/01 15:30:23 VCS INFO V-16-1-10307 Resource teamforge_app_server (Owner: Unspecified, Group: VM_VCS_Teamforge) is offline on linopstfg02.prod.domain.com (Not initiated by VCS)
2011/04/01 15:30:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) on System linopstfg02.prod.domain.com
2011/04/01 15:30:24 VCS INFO V-16-1-10305 Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) is offline on linopstfg02.prod.domain.com (VCS initiated)
2011/04/01 15:30:24 VCS ERROR V-16-1-10205 Group VM_VCS_Teamforge is faulted on system linopstfg02.prod.domain.com
2011/04/01 15:30:24 VCS NOTICE V-16-1-10446 Group VM_VCS_Teamforge is offline on system linopstfg02.prod.domain.com
2011/04/01 15:30:24 VCS INFO V-16-1-10493 Evaluating linopstfg01.prod.domain.com as potential target node for group VM_VCS_Teamforge
2011/04/01 15:30:24 VCS INFO V-16-1-10493 Evaluating linopstfg02.prod.domain.com as potential target node for group VM_VCS_Teamforge
2011/04/01 15:30:24 VCS INFO V-16-1-50010 Group VM_VCS_Teamforge is online or faulted on system linopstfg02.prod.domain.com
2011/04/01 15:30:24 VCS NOTICE V-16-1-10301 Initiating Online of Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) on System linopstfg01.prod.domain.com
2011/04/01 15:30:32 VCS INFO V-16-1-10298 Resource vcs_teamforge_vip (Owner: Unspecified, Group: VM_VCS_Teamforge) is online on linopstfg01.prod.domain.com (VCS initiated)
2011/04/01 15:30:32 VCS NOTICE V-16-1-10301 Initiating Online of Resource teamforge_app_server (Owner: Unspecified, Group: VM_VCS_Teamforge) on System linopstfg01.prod.domain.com
2011/04/01 15:30:32 VCS INFO V-16-10031-504 (linopstfg01.prod.domain.com) Application:teamforge_app_server:online:Executed /usr/local/bin/apachestart as user root
2011/04/01 15:30:33 VCS INFO V-16-2-13716 (linopstfg01.prod.domain.com) Resource(teamforge_app_server): Output of the completed operation (online)
==============================================
Starting httpd: [ OK ]
==============================================
2011/04/01 15:30:33 VCS INFO V-16-1-10298 Resource teamforge_app_server (Owner: Unspecified, Group: VM_VCS_Teamforge) is online on linopstfg01.prod.domain.com (VCS initiated)
2011/04/01 15:30:33 VCS NOTICE V-16-1-10447 Group VM_VCS_Teamforge is online on system linopstfg01.prod.domain.com
2011/04/01 15:30:33 VCS NOTICE V-16-1-10448 Group VM_VCS_Teamforge failed over to system linopstfg01.prod.domain.com
04-01-2011 12:45 PM
main.cf export
04-01-2011 03:44 PM
VCS comes bundled with an Apache agent - can you not use this agent, rather than using Application agent?
Mike
04-01-2011 03:46 PM
Hello,
Looking at the very first stage of script, VCS doesn't understand the exit code of 0 & 1 , you need to check the same in the monitor script ..
for VCS , successful is exit code 110 while unsuccesful is exit code 100. There are more levels from 101-109 which defines the confidence level however for simplicity you can use 100 & 110 instead of 0 & 1.
Now I understand that at unix level , exit codes would be 0 & 1 however that should be fed to VCS in the terms VCS understands it. Now in the resource defination of main.cf
Application teamforge_app_server (
04-01-2011 05:53 PM
re: VCS comes bundled with an Apache agent - can you not use this agent, rather than using Application agent?
I guess I could. The only reason I'm monitoring the apache pid is I figured that was the easiest thing to do. The application isn't just apache, it's actually apache, and 4 java processes, but I had trouble trying to check the java processes.
re: the monitor program... I don't have a monitor program specified (none of our VCS enabled resources in other clusters do). We just have the start/stop/clean and checking pids usually. Is the monitor program something that the Application agent relies on more in 5.1, since this setup used to work in 5.0.3? I'll create a monitor program and test to see if that improves the situation.
04-01-2011 06:43 PM
just tried the monitor program and it caused a clean to run and bring down the application (and not fault the node). This is strange because the monitor program should've responded saying that apache was alive (100 code).
Below is the script, and the return code of a run on a node that has it running.
04-02-2011 05:51 AM
Your exit codes are the wrong way round:
110 should be ONLINE
100 should be OFFLINE
Mike
04-02-2011 07:01 AM
i reversed the codes.. doesn't look like that made a difference. Still getting the abnormal clean exit code. I also added exit 110 to the end of the clean script incase veritas was looking for that to report a different code, but it didn't seem to be the case either.
04-02-2011 01:27 PM
Looking at your logs this looks like a bug in the Application agent from log entry:
Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
This is "/opt/VRTSvcs/bin/Application/clean" abnormally terminating, not your clean script "/usr/local/bin/apachekill". Regardless of what your script does, the Appliction clean entry point should not abnormally terminate. I would push back on support. Alternatively you could try using Agent builder (see https://sort.symantec.com/agents/detail/171), rather than Application agent.
Mike
04-02-2011 04:13 PM
I've noticed some really interesting behavior... I shortened my script further and now am just killing httpd and rm pid with an exit 0
What I've noticed is the clean script (vcs perl script that veritas provides) is abnormally exiting with the SAME error code that I'm using to kill apache.. look at the below logs. The first set terminates due to signal 9 (kill -9) error code 0x9... where as the second terminates abnormally to signal 15 (kill -TERM) exit code 0xf (15).
The question is... why is the perl program killing itself when my clean script executes (which is what it seems to be doing).
[root@linopstfg02 bin]# vi apachekill
#!/bin/bash
set -x
kill -9 `ps -ef | grep httpd | grep -v grep | awk {'print $2'}`
rm /var/run/httpd.pid
echo hi
exit 0
[root@linopstfg02 bin]# ./apachekill
++ ps -ef
++ grep httpd
++ grep -v grep
++ awk '{print $2}'
+ kill -9 29355 29358 29359 29360 29361 29362 29363 29364 29365
+ rm /var/run/httpd.pid
+ echo hi
hi
+ exit 0
[root@linopstfg02 bin]# 2011/04/02 19:08:07 VCS ERROR V-16-2-13067 Thread(4153691024) Agent is calling clean for resource(teamforge_app_server) because the resource became OFFLINE unexpectedly, on its own.
2011/04/02 19:08:08 VCS WARNING V-16-10031-13196 Thread(4153691024) script (/opt/VRTSvcs/bin/Application/clean) terminated due to signal (9)
2011/04/02 19:08:08 VCS ERROR V-16-2-13040 Thread(4153691024) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0x9).
2011/04/02 19:08:08 VCS ERROR V-16-2-13069 Thread(4153691024) Resource(teamforge_app_server) - clean failed.
[root@linopstfg02 bin]#
[root@linopstfg02 bin]# vi apachekill
#!/bin/bash
set -x
kill -TERM `ps -ef | grep httpd | grep -v grep | awk {'print $2'}`
rm /var/run/httpd.pid
echo hi
exit 0
"apachekill" 25L, 359C written
[root@linopstfg02 bin]# ./apachekill
++ ps -ef
++ grep httpd
++ grep -v grep
++ awk '{print $2}'
+ kill -TERM 29813 29816 29817 29818 29819 29820 29821 29822 29823
+ rm /var/run/httpd.pid
+ echo hi
hi
+ exit 0
[root@linopstfg02 bin]# 2011/04/02 19:08:59 VCS ERROR V-16-2-13067 Thread(4153691024) Agent is calling clean for resource(teamforge_app_server) because the resource became OFFLINE unexpectedly, on its own.
2011/04/02 19:09:00 VCS WARNING V-16-10031-13196 Thread(4153691024) script (/opt/VRTSvcs/bin/Application/clean) terminated due to signal (15)
2011/04/02 19:09:00 VCS ERROR V-16-2-13040 Thread(4153691024) Resource(teamforge_app_server): Program(/opt/VRTSvcs/bin/Application/clean) was abnormally terminated with the exit code(0xf).
2011/04/02 19:09:00 VCS ERROR V-16-2-13069 Thread(4153691024) Resource(teamforge_app_server) - clean failed.
04-02-2011 04:53 PM
confirming this... I changed my "clean script" to shut down apache gracefully. (/etc/init.d/httpd stop).. which isn't useful because if the clean script needs to run it's because the graceful stop didn't work, and this allows faulting to occur and apache to fail over between the nodes.
So there's definately something that is funny with the veritas perl script and running a kill command in a clean script in VCS 5.1 because these same scripts worked in 5.0.3 :(
04-03-2011 02:05 AM
I have figured out what your issue probablly is - you are killing a process matching "httpd", but the clean entry point will match this and therefore your script is killing /opt/VRTSvcs/bin/Application/clean - if you exclude this (like add "grep -v /opt/VRTSvcs/bin/Application/clean") then script should work
But actually, you don't need clean script as the the agent clean (i.e code in /opt/VRTSvcs/bin/Application/clean) will kill processes in PID files, so you only need a clean if you were running something like /etc/init.d/httpd -force stop (if such an option existed).
Mike
04-03-2011 08:25 AM
I had the clean scipt in place because originally I had trouble monitoring actual processes and apache is the only thing in the stack of processes that writes a pid. Now that I know about the monitor program I might make the monitor program more sophisticated and check all my processes and see if that handles running a failover without a problem. I'll also grep out the clean program in my kill script to make sure it can't find it. The processes that will potentially fail are java, and my guess is when they fail, they'll need to be kill -9'ed to terminate.. let me play a bit and see what the results are.
04-03-2011 11:28 AM
Mike,
I think that did the trick. I've been testing all morning and it looks like failover is working just fine. I'm not declairing total victory until testing completes... but I think I'm in good shape. Thanks a million