cancel
Showing results for 
Search instead for 
Did you mean: 

how come the application resource neglects the dependency constraint?

543700257
Level 3
The system platform is solaris 8, with oracle 8i and SAP R3, as well as VCS for solaris.
After unexpected poweroff (unluckily more than once), the VCS could not startup normally. When I forcibly started it, the upper SAP resource is online with its depending resources such as oracle and lsnrctl offline. And it is no use to offline it, the status is 'waiting for going offline' but it does not work. And this happens on both of the two nodes in the cluster. That is the same 'application' type resource is online on both nodes simultaneously.

Can anyone tell me what the matter is?

Part of engine_A.log:

TAG_D 2005/01/26 21:48:45 (sapprd) VCS:13064:Agent is calling clean for resource
(SAPapp) because the resource is up even after offline completed.
TAG_C 2005/01/26 21:50:56 VCS:10023:Agent Application not sending alive messages
since Wed Jan 26 21:48:44 2005

TAG_B 2005/01/26 21:50:56 VCS:10008:Warning: Agent Application has faulted 6 tim
es since Wed Jan 26 21:39:56 2005

TAG_B 2005/01/26 21:50:56 VCS:10009:Agent Application has faulted 6
times in less than 950 seconds -- Will not
attempt to restart
Correct the problem and use haagent
-start to start the agent
8 REPLIES 8

Eric_Hennessey1
Level 5
Employee Certified
Without knowing how your Application agent is configured to start, stop and monitor SAP, it's impossible to tell what's really happening.There's an agent available for SAP that's been developed and tested and fully supported by VERITAS. May I suggest you consider using that instead?

543700257
Level 3
Dear Eric Hennessey,

This is part of main.cf:


include "types.cf"
include "OracleTypes.cf"

cluster vcs-cluster (
UserNames = { admin = kErBn2JgMH282, dengsk = "803K4AWvIrgDE" }
Administrators = { admin, dengsk }
Operators = { dengsk }
CounterInterval = 5
)

system sapprd (
)

system sapqas (
)

group oradb (
SystemList = { sapprd = 1, sapqas = 2 }
AutoStartList = { sapprd }
Administrators = { dengsk }
Operators = { dengsk }
)

Application SAPapp (
StartProgram = "/usr/bin/startPRD"
StopProgram = "/usr/bin/stopPRD"
CleanProgram = "/usr/bin/stopPRD"
PidFiles = { "/export/home/r3padm/pid_sap_dw",
"/export/home/r3padm/pid_sap_co",
"/export/home/r3padm/pid_sap_se",
"/export/home/r3padm/pid_sap_ms" }
)

Application lsnrctl (
User = root
StartProgram = "/usr/bin/startlsn"
StopProgram = "/usr/bin/stoplsn"
CleanProgram = "/usr/bin/stoplsn"
PidFiles = { "/export/home/r3padm/pid_lsn" }
)

Disk datadisk (
Partition = c3t5d0s2
)

Disk datadisk1 (
Partition = c3t5d1s2
)

Disk datadisk2 (
Partition = c3t5d2s2
)

IPMultiNIC ip (
Address = "10.28.2.100"
NetMask = "255.255.255.0"
MultiNICResName = nica
)

Mount datafile (
MountPoint = "/sapmnt/R3P"
BlockDevice = "/dev/dsk/c3t5d0s3"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile1 (
MountPoint = "/oracle/R3P"
BlockDevice = "/dev/dsk/c3t5d2s0"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile2 (
MountPoint = "/oracle/R3P/origlogA"
BlockDevice = "/dev/dsk/c3t5d0s0"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile3 (
MountPoint = "/oracle/R3P/mirrlogA"
BlockDevice = "/dev/dsk/c3t5d1s0"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile4 (
MountPoint = "/usr/sap/R3P"
BlockDevice = "/dev/dsk/c3t5d0s4"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile5 (
MountPoint = "/oracle/R3P/origlogB"
BlockDevice = "/dev/dsk/c3t5d0s1"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile6 (
MountPoint = "/oracle/R3P/sapreorg"
BlockDevice = "/dev/dsk/c3t5d0s5"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile7 (
MountPoint = "/oracle/R3P/mirrlogB"
BlockDevice = "/dev/dsk/c3t5d1s1"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

Mount datafile8 (
MountPoint = "/oracle/R3P/saparch"
BlockDevice = "/dev/dsk/c3t5d1s3"
FSType = ufs
MountOpt = rw
FsckOpt = "-y"
)

MultiNICA nica (
Device @sapprd = { eri0 = "10.28.2.101", qfe2 = "10.28.2.101" }
Device @sapqas = { ce0 = "10.28.2.102", qfe2 = "10.28.2.102" }
NetMask = "255.255.255.0"
)

Oracle Oracle (
Sid = R3P
Owner = orar3p
Home = "/oracle/R3P/817_64"
Pfile = "/oracle/R3P/817_64/dbs/initR3P.ora"
EnvFile = "/oracle/R3P/.dbenv_sapprd.csh"
)

Oracle requires datafile
Oracle requires datafile1
Oracle requires datafile2
Oracle requires datafile3
Oracle requires datafile4
Oracle requires datafile5
Oracle requires datafile6
Oracle requires datafile7
Oracle requires datafile8
SAPapp requires lsnrctl
datafile requires datadisk
datafile1 requires datadisk2
datafile2 requires datadisk
datafile2 requires datafile1
datafile3 requires datadisk1
datafile3 requires datafile1
datafile4 requires datadisk
datafile5 requires datadisk
datafile5 requires datafile1
datafile6 requires datadisk
datafile6 requires datafile1
datafile7 requires datadisk1
datafile7 requires datafile1
datafile8 requires datadisk1
datafile8 requires datafile1
ip requires nica
lsnrctl requires Oracle
lsnrctl requires ip

---------------------------------------------------------------------

The /usr/bin/startPRD script:

su - r3padm -c "startsap r3"
DW=dw.sapR3P_DVEBMGS00
CO=co.sapR3P_DVEBMGS00
SE=se.sapR3P_DVEBMGS00
MS=ms.sapR3P_DVEBMGS00
PID_SAPSTART=`ps -ef|grep sapstart|grep -v\ grep|awk '{print $2}'`
PID_SAP_DW=`ps -ef|grep $PID_SAPSTART|grep -v\ grep|grep -v sapstart|grep $DW|awk\
'{print $2}'`
PID_SAP_CO=`ps -ef|grep $PID_SAPSTART|grep -v\ grep|grep -v sapstart|grep $CO|awk\
'{print $2}'`
PID_SAP_SE=`ps -ef|grep $PID_SAPSTART|grep -v\ grep|grep -v sapstart|grep $SE|awk\
'{print $2}'`
PID_SAP_MS=`ps -ef|grep $PID_SAPSTART|grep -v\ grep|grep -v sapstart|grep $MS|awk\
'{print $2}'`

echo $PID_SAP_DW > /export/home/r3padm/pid_sap_dw
echo $PID_SAP_CO > /export/home/r3padm/pid_sap_co
echo $PID_SAP_SE > /export/home/r3padm/pid_sap_se
echo $PID_SAP_MS > /export/home/r3padm/pid_sap_ms

---------------------------------------------------------------------

The /usr/bin/stopPRD script:

su - r3padm -c "stopsap r3"

----------------------------------------------------------------------

So my method to do the monitor is PID files. I think I should have left the "clean" field blank.

The SAP enterprise agent is surely preferred if I have one copy of it. I want to do it with Application agent temporarily and it has been working for some time. But I really did not expect such a fault. Could tell me how to better my solution?

Thanks

Roger_Davis
Level 3
Hi, I've seen things like this in the past. The issue you are having appears to be two-fold from your engine log. Firstly, your bundled Application agent is not starting - for this reason, the App agent is unable to perform any operations on your application type resources. You need to check the status with haagent -display Application - check it on both your cluster nodes to see if it's started or stopped on both. Check to make sure that both Applcation agent directories are the same in /opt/VRTSvcs/bin/Application - if you have one agent started on one node and not on the other then copy over the good stuff from one to the other.

The issue regarding online dependency being broken is clearcut, you can always break dependency inside VCS by manually onlining processes outside of VCS as the VCS engine normally deals with onlining in dependency order and cannot internally break it's own dependency in this way - what would normally have in this way is VCS would detect this and attempt to clean - but your agent is broken, so this does not happen. One quick fix I would suggest for a one off VCS cleanliness is to use the GUI and copy the resource declared as online, delete it, stop the process from the command line, then paste the resource back into the group. Ensure the settings are correct (newer versions of the GUI default all pasted resources to critical, whether they were before or not - one to watch out for!) ....... With regards to SAP agents and you using application agents - the agents normally provide easy to manage online/offline/monitor/clean etc, but you should be able to easily use Application agent as you have been - we do a similar thing for WebLogic - we use Application instead and hand crafted our own specialised monitoring scripts ..... please let me know how you get on ...

543700257
Level 3
Dear Roger Davis ,
Thanks for your response.
I reapplied VCS on the system, and now the symptom is:
while all the resources(including SAPapp) are not started up manually, as soon as you startup VCS, the SAPapp resource is being online. And you can't offline it manually.
The engine_A.log(part):
TAG_D 2005/04/09 13:26:26 VCS:10322:System sapprd (Node '1') changed state from LOCAL_BUILD to RUNNING

TAG_D 2005/04/09 13:26:27 VCS:10016:Agent /opt/VRTSvcs/bin/Application/Applicati
onAgent for resource type Application successfully started at Sat Apr 9 13:26:27 2005

TAG_D 2005/04/09 13:26:27 VCS:10016:Agent /opt/VRTSvcs/bin/Disk/DiskAgent for re
source type Disk successfully started at Sat Apr 9 13:26:27 2005

TAG_D 2005/04/09 13:26:27 VCS:10016:Agent /opt/VRTSvcs/bin/IPMultiNIC/IPMultiNIC
Agent for resource type IPMultiNIC successfully started at Sat Apr 9 13:26:27 2005

TAG_D 2005/04/09 13:26:27 VCS:10016:Agent /opt/VRTSvcs/bin/Mount/MountAgent for
resource type Mount successfully started at Sat Apr 9 13:26:27 2005

TAG_D 2005/04/09 13:26:27 VCS:10016:Agent /opt/VRTSvcs/bin/MultiNICA/MultiNICAAg
ent for resource type MultiNICA successfully started at Sat Apr 9 13:26:27 2005
TAG_D 2005/04/09 13:26:27 VCS:10016:Agent /opt/VRTSvcs/bin/Oracle/OracleAgent fo
r resource type Oracle successfully started at Sat Apr 9 13:26:27 2005

TAG_C 2005/04/09 13:26:27 VCS:10518:IpmHandle::send peer exited errno = 32
TAG_C 2005/04/09 13:26:27 VCS:10517:IpmHandle::send _write_errno is 4
TAG_C 2005/04/09 13:26:27 VCS:10526:IpmHandle::recv peer exited errno 131
TAG_E 2005/04/09 13:26:28 VCS:10304:Resource lsnrctl (Owner: unknown, Group: oradb) is offline on sapprd (First probe)
TAG_E 2005/04/09 13:26:28 VCS:10297:Resource SAPapp (Owner: unknown, Group: oradb) is online on sapprd (First probe)
TAG_D 2005/04/09 13:26:28 VCS:10233:Clearing Restart attribute for group oradb on all nodes
TAG_E 2005/04/09 13:26:29 VCS:10304:Resource Oracle (Owner: unknown, Group: oradb) is offline on sapprd (First probe)
TAG_E 2005/04/09 13:26:29 VCS:10297:Resource datafile (Owner: unknown, Group: oradb) is online on sapprd (First probe)
TAG_D 2005/04/09 13:26:29 VCS:10233:Clearing Restart attribute for group oradb on all nodes
TAG_E 2005/04/09 13:26:29 VCS:10297:Resource datafile1 (Owner: unknown, Group: oradb) is online on sapprd (First probe)
TAG_D 2005/04/09 13:26:29 VCS:10233:Clearing Restart attribute for group oradb on all nodes
TAG_E 2005/04/09 13:26:29 VCS:10297:Resource datafile4 (Owner: unknown, Group: oradb) is online on sapprd (First probe)
TAG_D 2005/04/09 13:26:29 VCS:10233:Clearing Restart attribute for group oradb on all nodes
TAG_E 2005/04/09 13:26:29 VCS:10304:Resource datafile8 (Owner: unknown, Group: oradb) is offline on sapprd (First probe)

Please pay attention to the BLACK part, the SAPapp is in the status online as the first probe occurs. And I am sure at the time the SAP processes are not in the ps list and no PIDs are directed to those monitore target files like '/export/home/r3padm/pid_sap_dw'...

As to the Application agent, I am not sure that it is broken. Since I can online and offline another Application resource lsnrctl manually normally.

Is the problem about monitor part? I just have no idea about the real point.

Carsten_Hennig
Level 4
Certified
Hi, please check the contents of your PID files before firing up VCS. I would also suggest to invalidate the contents of the PID files after stopping SAP. You could do that e.g. by writing "999999" into the PID files.

Regards, Carsten

Roger_Davis
Level 3
Hi 543700257 ,

You say you've reinstalled VCS? Well, you certainly do not seem to be getting Agent faults anymore, so this appears to have cured one issue :)

Just wondered why you are using Application for your Oracle listener when you have the OracleTypes file available (either sqlnet type or Netlsnr depedning on VCS agent version)?

You can also set the resource attribute 'Owner' for your app resources inside VCS, which means you won't need the su inside your start script.

You could also try to set the AutoStart attribute for the entire service group to null, this means that the group will not attempt to online during cluster startup and can help with your troubleshooting, although admittedly, this may not be much help in this case.

When VCS starts, it probes (monitors) all resources and it undertakes the method(s) specified in the resource config - you have four PID files specified and if the contents of the PID files match 4 processes with the same number, the process will declare online (even if the wrong processes of course). Good practise would suggest that the clean and stop scripts either remove or purge the contents of these files after each resource stop. During normal operations with VCS started, the monitor cycle does not start until the start scripts exits. I would suggest also that you check the contents of the PID files against the ps table and see if they match anything.

You say you can't offline once online - what does your stop script do - for example pkills or a more graceful method?

So - as recap, stop VCS, check and record the contents of the pidfiles, start VCS (/etc/init.d/S99vcs start, see if you get same issue - if you do do a ps -ef on the process numbers in pid files, see if they match, if so, that's your issue, if not, could be something else ..... pls let me know ....

543700257
Level 3
Roger Davis and czhe01,

Thanks for your help.

I haven't done any test on the VCS since last test due to it is a 7*24 production system. I will work on it as I do the scheduled offline backup at the end of the month.

Both of you referred to the PID files' content. I read those files and they are surely not the same as beforehind. As I cat those files, the output is blank. So I doubt VCS reads this 'blank' as '0'.

If it is the point, there is such deduction:

At the moment of VCS Application agent startup, it probes the status of resources before doing online. It reads the PID files and gets the result 'blank'----takes it as '0'? Then VCS watches the process list, it surely finds PID number '0' in it. So the status of Application resource SAPapp returnes online.


To Roger Davis:

The Oracle agent surely includes sqlnet type resource. In our system, the product application is on node A, the second node B which always stands by in normal case plays another role(Quality system) in SAP landscape. That is, there is another SAP application and Oracle database going on it. So in our lsnrctl resource, there are statements which offline local listener and online another listener while needed. This is the original scheme that the SAP supplier provided. I know VCS is enough powerful to feed two or more service groups, but this method seems simpler than configuring group dependeny.

I will try again and post the result after OKing it.

543700257
Level 3
My problem is solved.

It is surely due to the pid files 'blank'. After I write something in these files, it is OK again.

Thanks for all your help!