Solved: Hello, Looking at new

bonny6 · ‎09-15-2010

hi all ,

recently i come up against oracle issue ,i'm getting the next error message :

VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuOracleLog_lv) - monitor procedure did not complete within the expected time.

my question is why this error can appear?

this is also causing to a failover of the servers and the database changing to FAULTED state

DBA solution was to adjust interval/timeout values and is might help ,but i want to anlayze this problem and unstrstand why this is hapenning on my system .

I have atteached logs and useful information .

if someone can help me with this

Thx ,

Gaurav_S · ‎09-16-2010

Hello,

Looking at new messages its DB related however again timeout which makes me to think of system overload again...

Regarding the attribute definations.. have a look at VCS users guide for 5.0MP3 AIX ... page 656 .. it lists all attributes & its definitions...

VCS users guide can be found here:

http://sfdoccentral.symantec.com/Storage_Foundation_HA_50MP3_AIX.html

OR

https://vos.symantec.com/documents

I think these definations are aloso available in man page of hatype (don't remember the section of man)

Gaurav

View solution in original post

Ted_Summers · ‎09-15-2010

What version?

Reply with output of:

rpm -aqi |grep VRTS

You may have encountered a known issue with an agent, but we would need more data.

Regards,

Ted

Marianne · ‎09-15-2010

MonitorTimeout can be adjusted - the default timeout is 60 seconds.

Please consider this advice in VCS Admin Guide:

You can also adjust how often VCS monitors various functions by modifying their associated attributes. The attributes MonitorTimeout, OnlineTimeOut, and OfflineTimeout indicate the maximum time (in seconds) within which the monitor, online, and offline functions must complete or else be terminated. The default for the MonitorTimeout attribute is 60 seconds. The defaults for the OnlineTimeout and OfflineTimeout attributes is 300 seconds.

For best results, Symantec recommends measuring the time it takes to bring a resource online, take it offline, and monitor before modifying the defaults. Issue an online or offline command to measure the time it takes for each action. To measure how long it takes to monitor a resource, fault the resource and issue a probe, or bring the resource online outside of VCS control and issue a probe.

Handy NetBackup Links

bonny6 · ‎09-16-2010

HI Ted,

Ive attached the version of VRTS in the file.

Marianne .i dont see any place that says The defaults for the OnlineTimeout and OfflineTimeout attributes is 300 seconds. expet doing the test of online and offline the resource . i dont understand what you mean .

also is seems that after this error the resource change to FAULTED state .

Gaurav_S · ‎09-16-2010

Hello,

First I would ask, what type of resource is this mdsuOracleLog_lv ? is it a logical Volume ? or Oracle resource ? you can tell this from main.cf ..

Secondly, from logs it appears that most of the resources are showing same messages, i.e monitor procedure failing.... so what I understand here is, multiple agents are facing same issue so that is not a single agent issue... however it appears to me that entire server itself is busy..

Wht is the CPU/mem/io usage when u see these messages ?

what Marianne is saying above is about adjusting MonitorTimeout (not onlinetimeout or offlinetimeout which is 300 sec) .... MonitorInterval & MonitorTimeout are 60 sec ... you can tune this to a higher value so that agent gets more time to complete monitor cycle.. but as I mentioned above I believe server is struggling for resources.... check if server is busy..

Gaurav

bonny6 · ‎09-16-2010

Hi Gaurav,

this is for your first questions

    LVMLogicalVolume mdsuOracleLog_lv (
        LogicalVolume = mdsuOracleLog_lv
        VolumeGroup = mdsu_vg

Ive also attached some error ive got recently .

i will check about the cpu and memory . to see if there was overload on the system .

can you please tell me what evrey parmeter mean and waht he is doing ?

Oracle OnlineTimeout 300

Oracle OfflineTimeout 300

Oracle OfflineMonitorInterval 300

Oracle MonitorTimeout 60

Oracle MonitorInterval 60

Gaurav_S · ‎09-16-2010

Hello,

Looking at new messages its DB related however again timeout which makes me to think of system overload again...

Regarding the attribute definations.. have a look at VCS users guide for 5.0MP3 AIX ... page 656 .. it lists all attributes & its definitions...

VCS users guide can be found here:

http://sfdoccentral.symantec.com/Storage_Foundation_HA_50MP3_AIX.html

OR

https://vos.symantec.com/documents

I think these definations are aloso available in man page of hatype (don't remember the section of man)

Gaurav

bonny6 · ‎09-16-2010

Hi ,

THX for the guide . also i notice that LVMLogicalVolume was Terminate . and i know that if the LVMLogicalVolume become dont it will cause the failover to occur ,i didnt find anything in the sar files regarding to cpu and memory problem . my only problem is to decide why the failover occuer at first place . and why the APPDB become FAULTED state . my only lead now is because a overload , but i didnt find anything regard this .

2010/09/11 10:23:07 VCS WARNING V-16-1-10023 Agent LVMLogicalVolume not sending alive messages since Sat 11 Sep 2010 10:20:52 AM MVT

2010/09/11 10:23:07 VCS NOTICE V-16-1-53026 Agent LVMLogicalVolume ipm connection still valid

2010/09/11 10:23:07 VCS NOTICE V-16-1-53027 Waiting one more try for ipm connection for agent LVMLogicalVolume to go away

2010/09/11 10:23:18 VCS WARNING V-16-1-10023 Agent LVMLogicalVolume not sending alive messages since Sat 11 Sep 2010 10:20:52 AM MVT

2010/09/11 10:23:18 VCS NOTICE V-16-1-53026 Agent LVMLogicalVolume ipm connection still valid

2010/09/11 10:23:18 VCS NOTICE V-16-1-53030 Termination request sent to LVMLogicalVolume agent process with pid 18032

2010/09/11 10:23:18 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/LVMLogicalVolume/LVMLogicalVolumeAgent for resource type LVMLogicalVolume

Gaurav_S · ‎09-18-2010

I see that you are using 5.0MP3RP2 version..

I was just looking at VCS release note for 5.1 here:

http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vcs_notes.pdf

I found this on page 50/51

=====================================================================

Known issues (Means this issue is still open & not fixed yet)

Poor agent performance and inability to heartbeat to the engine

If the system has more than 200 configured resources, the agent may not get enough CPUcycles to function properly. This can prevent the agent from producing a heartbeat synchronously with the engine. If you notice poor agent performance and an agent's inability to heartbeat to the engine, check for the following symptoms.

Navigate to /var/VRTSvcs/diag/agents/ and look for files that resemble:

FFDC_AGFWMain_729_agent_type.log

FFDC_AGFWTimer_729_agent_type.log core
FFDC_AGFWSvc_729_agent_type.log

agent_typeAgent_stack_729.txt

Where agent_type is the type of agent, for example Application or FileOnOff. If you find these files, perform the next step.
Navigate to /var/VRTSvcs/log/ and check the engine_*.log file for messages that resemble:

2009/10/06 15:31:58 VCS WARNING V-16-1-10023 Agent agent_type not sending alive messages since Tue Oct 06 15:29:27 2009
2009/10/06 15:31:58 VCS NOTICE V-16-1-53026 Agent agent_type ipm connection still valid
2009/10/06 15:31:58 VCS NOTICE V-16-1-53030 Termination request sent to agent_type agent process with pid 729

Workaround: If you see that both of the above criteria are true, increase the value of the AgentReplyTimeout attribute value. (Up to 300 seconds or as necessary.)
[1853285]

===================================================================================

As you can see, messages looks similar to what you are seeing, perform the above check to see if you see any core/log files, If yes, then perform the above mentioned workaround...

As this is an open issue, might be fixed with some patch in upcoming version...

Gaurav

bonny6 · ‎09-19-2010

Hi ,

THX , but for some reason i dont have the path

/var/VRTSvcs/diag/agents/

but still its exacly ther errors i've got .

i will check the workaround to see if it helps .

thx again .

Ted_Summers · ‎09-20-2010

So I searched on the "system messages" about extents you sent and found this:

http://oracle.su/docs/11g/server.112/e10595/schema002.htm

If I read this correctly, as long as the DB has a query waiting for it, additional queries / db transactions wouldn't take place.

So- if our Oracle monitor wants to touch or read a table in the db, then it won't be able to and you would get timeouts... because our monitor cycle can't complete.

If someone else has another take on these messages and what they mean with the new information I provided, then certainly let me know. I am intrigued by this scenario...

Also- please provide the output of both of these commands:

ls -l /etc/VRTSvcs/conf/types.cf

ls -l /etc/VRTSvcs/conf/config/types.cf

Sometimes types.cf doesn't get updated in the config directory and can cause all kind of unexpected behaviors.

Gaurav_S · ‎09-20-2010

Hi Ted,

yep, agree that it could be database issue but I am wondering why would LVMLogicalVolume have problem with that ? we can see above, agent not sending alive messages, that makes me to think more on performance part...

let me know if you think otherwise..

Gaurav

Ted_Summers · ‎09-20-2010

I believe bonny6's servers may have more than one issue.

LVM related may be performance, but the db getting suspended might cause some other undesired system load possibly....

My opinion is bonny6 should start a case with Technical Support and send full VRTSexplorers from all nodes for review.

This doesn't seem like it can be diagnosed with little data snippets....

Regards,

Ted

PS- wrong "types.cf" file can cause all manner of weird behavior and thus I recommend to check into that always.

I have seen mounts not mount, DG not import, applications not start or intermittent timeout- all due to incorrect types.cf being in place....

rregunta · ‎09-23-2010

Hello,

You didn't mention since when you have been getting this issue. If this is appearing after any installation/upgradation/patch update, then I would be interested to see the below outputs -

# ls -l /etc/VRTSvcs/conf/*.cf

# ls -l /etc/VRTSvcs/conf/config/*.cf

# ls -l /etc/VRTSagents/ha/conf/Oracle/*.cf

Regards

Rajesh

bonny6 · ‎09-26-2010

Hi all ,

first of all Ive attached file about your requsted output of all VRTS path .

second , i didnt preform any upgrade/patch update reagarding this system .its occur without any system chacnge .

Ted , why do you mean by wrong "types.cf" ? i did not changed anything in the system or in the "types.cf" files .before this error appeard.

i will also read about the oracle problem and see if it relevant

also i have a question regarding the first solution Gaurav recommended .

if I wil increased AgentReplyTimeout attribute value to 300 sec (the default was 130 sec); can you please let me know the impact of this change?

PS-There is a timeline for the permanent fix for this issue.?

Thanks and Regards .

Gaurav_S · ‎09-26-2010

To answer your questions..

For AgentReplyTimeout, the definition is, The number of seconds the engine waits to receive a heartbeat
from the agent before restarting the agent.

So if your server is busy or there is a real bug in software, HAD will wait a little more for agent to respond, before taking the action of restarting the agent.... positive side is, there might be chance that agent will revert back in 300 sec (130 sec might be little short for agent).. negative side could be, if agent really hangs, then HAD will be little delayed to take the necessary action.... I would recommend to atleast give a try & observe the system for a day....

To answer other queries, wrong types.cf again relates to update of agent/software .... for e.g if you upgraded software from 4.1 to 5.0, quite possible that some agents were re-written in 5.0 so resource definations will change... in such cases, during upgrade, by default, new types.cf file is kept in /etc/VRTSvcs/conf/ directory... & to solve the issue, types.cf from conf directory needs to be moved to config directory... but to me, that doesn;t looks to be the case since you havn;t done any upgrade... just to reassure, compare the

/etc/VRTSvcs/conf/config/OracleTypes.cf       with 
/etc/VRTSagents/ha/conf/Oracle/OracleTypes.cf

I can see little difference in size but that is possible since OracleTypes.cf under config directory is active & in use .. just check if value/line difference is there...

Gaurav

bonny6 · ‎09-26-2010

Hi,

after compare both types.cf . there is no diff between them .

Ive also have this errors .

2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuOracleLog_lv) - monitor procedure did not complete within the expected time.
2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuDbData_lv) - monitor procedure did not complete within the expected time.
2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuData_lv) - monitor procedure did not complete within the expected time

Ted after you said about the DB "as long as the DB has a query waiting for it, additional queries / db transactions wouldn't take place.So- if our Oracle monitor wants to touch or read a table in the db, then it won't be able to and you would get timeouts... because our monitor cycle can't complete."

in both servers there is no 200 resources or even not half of that might casue a memory or cpu crash. the server has 12G memory.

i didnt find any sign of overload or any upgrade or change in the systerm that might casue this to happen,it seems there no a trace of lead .

i think the big problem here is to figure why the DB and LVM cant finish in his expect timeout . mybe changine the ORACLE and LVM TimeoutAgent fix the problem . but will not explain the reason why all that occur in first place .

Gaurav_S · ‎09-26-2010

I would recommend to get AgentReplytimeout tested first, in any case if that does or doesn't help, since you want to know the root cause of why its happening at the first place, get a case opened with Symantec Technical support... You will need to provide them debug outputs of agent (need to enable LogDbg for agent), HAD (add some debug tags to VCS engine) , truss , threadlists ... that would surely help you to get the answer you are looking for..

Unfortunately, analyzing these outputs could be far tedious on these forums...

To contact Symantec support, use below link:

http://www.symantec.com/business/support/index?page=home

on the right side, you will find contact support, open a case via web.... multiple options would be there..

Gaurav

VOX

Problem with VRTS . oracle alram