Re: VCS 4.1 and Solaris 9 Issue

Sameer_Nirmal · ‎11-12-2007

Hello,

We are having an issue with VCS 4.1 3-node cluster running on Solaris 9 platform.

The message we are getting in messages file is
Nov 12 10:33:31 <hostname> AgentFramework[19082]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(17) Resource(<Diskgroup>) - monitor procedure did not complete within the expected time.
Nov 12 10:33:31 <hostname> AgentFramework[19082]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(17) Resource(<Diskgroup>) - monitor procedure did not complete within the expected time.
Nov 12 10:33:31 <hostname> Had[18990]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 (<hostname>) Resource(<Diskgroup>) - monitor procedure did not complete within the expected time.
Nov 12 10:33:31 <hostname> Had[18990]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 (<hostname>) Resource(<Diskgroup>) - monitor procedure did not complete within the expected time.

I guess "AgentFramework" message is given by Oracle resource agent and HAD messages are coming out from VCS bundled agent "DiskGroup" in this case. I ran "vxexplore" on one of the nodes and trying to analyzing it.
It seems that I can get more detailed information if I can get debug of the oracle and diskgroup agents.
Can anyone in here help me out how to turn on the debug on these agents?
If there is a need how to increase their timeout values from defaults (?)?

Preliminary observation shows that whenever there is some heavy activity on oracle db takes place, these monitor scripts flashes those messages. The diskgroups are stored on a EMC DMX box.

Let me know if more information is needed.

Thanks in advance.

Gene_Henriksen · ‎11-21-2007

The default monitor timeout is 60 seconds. You should try to find out why the "vxdg list" command used to monitor a DG takes longer than 60 seconds. Perhaps your HBA is not functioning as well as it should or your CPU is being flooded with work. I wouls suspect it is the SAN area rather than CPU since you do not have the problem with other agents susch as Mount.

Sameer_Nirmal · ‎11-21-2007

Thanks for your response Gene!

Yes, I do agree with. We are trying to find out the cause and do suspect something is wrong at storage side or maybe at oracle side. Since this problem is kind of intermittent in nature, we are going to enable the agent debug ( Diskgroup/Application/Oracle) to get more insight. Looking at the OS states, it seems that there is high "wio" occurs during a specific interval of time maybe then we see the monitor errors. System load avergae seems to okay ( 7 around for a V1280 server out of 3 -nodes cluster).

EMC folks are saying everything is alright from SAN end. We are trying to get some information so we can say it's SAN problem maybe RAID issue etc. Do you know some ways or anything we can do from OS side to get more insight of the issue?

We have asked the DBA to run STATPACK, but suspect that it might show report like high "wio" or "waiting".

VOX

VCS 4.1 and Solaris 9 Issue