09-15-2010 08:21 AM
hi all ,
recently i come up against oracle issue ,i'm getting the next error message :
VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuOracleLog_lv) - monitor procedure did not complete within the expected time.
my question is why this error can appear?
this is also causing to a failover of the servers and the database changing to FAULTED state
DBA solution was to adjust interval/timeout values and is might help ,but i want to anlayze this problem and unstrstand why this is hapenning on my system .
I have atteached logs and useful information .
if someone can help me with this
Thx ,
Solved! Go to Solution.
09-16-2010 03:47 AM
Hello,
Looking at new messages its DB related however again timeout which makes me to think of system overload again...
Regarding the attribute definations.. have a look at VCS users guide for 5.0MP3 AIX ... page 656 .. it lists all attributes & its definitions...
VCS users guide can be found here:
http://sfdoccentral.symantec.com/Storage_Foundation_HA_50MP3_AIX.html
OR
https://vos.symantec.com/documents
I think these definations are aloso available in man page of hatype (don't remember the section of man)
Gaurav
09-15-2010 10:10 AM
What version?
Reply with output of:
rpm -aqi |grep VRTS
You may have encountered a known issue with an agent, but we would need more data.
Regards,
Ted
09-15-2010 02:37 PM
MonitorTimeout can be adjusted - the default timeout is 60 seconds.
Please consider this advice in VCS Admin Guide:
You can also adjust how often VCS monitors various functions by modifying their associated attributes. The attributes MonitorTimeout, OnlineTimeOut, and OfflineTimeout indicate the maximum time (in seconds) within which the monitor, online, and offline functions must complete or else be terminated. The default for the MonitorTimeout attribute is 60 seconds. The defaults for the OnlineTimeout and OfflineTimeout attributes is 300 seconds.
For best results, Symantec recommends measuring the time it takes to bring a resource online, take it offline, and monitor before modifying the defaults. Issue an online or offline command to measure the time it takes for each action. To measure how long it takes to monitor a resource, fault the resource and issue a probe, or bring the resource online outside of VCS control and issue a probe.
09-16-2010 01:41 AM
HI Ted,
Ive attached the version of VRTS in the file.
Marianne .i dont see any place that says The defaults for the OnlineTimeout and OfflineTimeout attributes is 300 seconds. expet doing the test of online and offline the resource . i dont understand what you mean .
also is seems that after this error the resource change to FAULTED state .
09-16-2010 02:12 AM
Hello,
First I would ask, what type of resource is this mdsuOracleLog_lv ? is it a logical Volume ? or Oracle resource ? you can tell this from main.cf ..
Secondly, from logs it appears that most of the resources are showing same messages, i.e monitor procedure failing.... so what I understand here is, multiple agents are facing same issue so that is not a single agent issue... however it appears to me that entire server itself is busy..
Wht is the CPU/mem/io usage when u see these messages ?
what Marianne is saying above is about adjusting MonitorTimeout (not onlinetimeout or offlinetimeout which is 300 sec) .... MonitorInterval & MonitorTimeout are 60 sec ... you can tune this to a higher value so that agent gets more time to complete monitor cycle.. but as I mentioned above I believe server is struggling for resources.... check if server is busy..
Gaurav
09-16-2010 03:13 AM
Hi Gaurav,
this is for your first questions
LVMLogicalVolume mdsuOracleLog_lv (
LogicalVolume = mdsuOracleLog_lv
VolumeGroup = mdsu_vg
Ive also attached some error ive got recently .
i will check about the cpu and memory . to see if there was overload on the system .
can you please tell me what evrey parmeter mean and waht he is doing ?
Oracle OnlineTimeout 300
Oracle OfflineTimeout 300
Oracle OfflineMonitorInterval 300
Oracle MonitorTimeout 60
Oracle MonitorInterval 60
09-16-2010 03:47 AM
Hello,
Looking at new messages its DB related however again timeout which makes me to think of system overload again...
Regarding the attribute definations.. have a look at VCS users guide for 5.0MP3 AIX ... page 656 .. it lists all attributes & its definitions...
VCS users guide can be found here:
http://sfdoccentral.symantec.com/Storage_Foundation_HA_50MP3_AIX.html
OR
https://vos.symantec.com/documents
I think these definations are aloso available in man page of hatype (don't remember the section of man)
Gaurav
09-16-2010 08:19 AM
Hi ,
THX for the guide . also i notice that LVMLogicalVolume was Terminate . and i know that if the LVMLogicalVolume become dont it will cause the failover to occur ,i didnt find anything in the sar files regarding to cpu and memory problem . my only problem is to decide why the failover occuer at first place . and why the APPDB become FAULTED state . my only lead now is because a overload , but i didnt find anything regard this .
2010/09/11 10:23:07 VCS WARNING V-16-1-10023 Agent LVMLogicalVolume not sending alive messages since Sat 11 Sep 2010 10:20:52 AM MVT
2010/09/11 10:23:07 VCS NOTICE V-16-1-53026 Agent LVMLogicalVolume ipm connection still valid
2010/09/11 10:23:07 VCS NOTICE V-16-1-53027 Waiting one more try for ipm connection for agent LVMLogicalVolume to go away
2010/09/11 10:23:18 VCS WARNING V-16-1-10023 Agent LVMLogicalVolume not sending alive messages since Sat 11 Sep 2010 10:20:52 AM MVT
2010/09/11 10:23:18 VCS NOTICE V-16-1-53026 Agent LVMLogicalVolume ipm connection still valid
2010/09/11 10:23:18 VCS NOTICE V-16-1-53030 Termination request sent to LVMLogicalVolume agent process with pid 18032
2010/09/11 10:23:18 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/LVMLogicalVolume/LVMLogicalVolumeAgent for resource type LVMLogicalVolume
09-18-2010 02:58 AM
I see that you are using 5.0MP3RP2 version..
I was just looking at VCS release note for 5.1 here:
http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vcs_notes.pdf
I found this on page 50/51
=====================================================================
Known issues (Means this issue is still open & not fixed yet)
Poor agent performance and inability to heartbeat to the engine
If the system has more than 200 configured resources, the agent may not get enough CPUcycles to function properly. This can prevent the agent from producing a heartbeat synchronously with the engine. If you notice poor agent performance and an agent's inability to heartbeat to the engine, check for the following symptoms.
Navigate to /var/VRTSvcs/diag/agents/ and look for files that resemble:
FFDC_AGFWMain_729_agent_type.log
FFDC_AGFWTimer_729_agent_type.log core
FFDC_AGFWSvc_729_agent_type.log
agent_typeAgent_stack_729.txt
Where agent_type is the type of agent, for example Application or FileOnOff. If you find these files, perform the next step.
Navigate to /var/VRTSvcs/log/ and check the engine_*.log file for messages that resemble:
2009/10/06 15:31:58 VCS WARNING V-16-1-10023 Agent agent_type not sending alive messages since Tue Oct 06 15:29:27 2009
2009/10/06 15:31:58 VCS NOTICE V-16-1-53026 Agent agent_type ipm connection still valid
2009/10/06 15:31:58 VCS NOTICE V-16-1-53030 Termination request sent to agent_type agent process with pid 729
Workaround: If you see that both of the above criteria are true, increase the value of the AgentReplyTimeout attribute value. (Up to 300 seconds or as necessary.)
[1853285]
===================================================================================
As you can see, messages looks similar to what you are seeing, perform the above check to see if you see any core/log files, If yes, then perform the above mentioned workaround...
As this is an open issue, might be fixed with some patch in upcoming version...
Gaurav
09-19-2010 03:57 AM
Hi ,
THX , but for some reason i dont have the path
/var/VRTSvcs/diag/agents/
but still its exacly ther errors i've got .
i will check the workaround to see if it helps .
thx again .
09-20-2010 08:46 AM
So I searched on the "system messages" about extents you sent and found this:
http://oracle.su/docs/11g/server.112/e10595/schema002.htm
If I read this correctly, as long as the DB has a query waiting for it, additional queries / db transactions wouldn't take place.
So- if our Oracle monitor wants to touch or read a table in the db, then it won't be able to and you would get timeouts... because our monitor cycle can't complete.
If someone else has another take on these messages and what they mean with the new information I provided, then certainly let me know. I am intrigued by this scenario...
Also- please provide the output of both of these commands:
ls -l /etc/VRTSvcs/conf/types.cf
ls -l /etc/VRTSvcs/conf/config/types.cf
Sometimes types.cf doesn't get updated in the config directory and can cause all kind of unexpected behaviors.
09-20-2010 09:30 AM
Hi Ted,
yep, agree that it could be database issue but I am wondering why would LVMLogicalVolume have problem with that ? we can see above, agent not sending alive messages, that makes me to think more on performance part...
let me know if you think otherwise..
Gaurav
09-20-2010 10:44 AM
I believe bonny6's servers may have more than one issue.
LVM related may be performance, but the db getting suspended might cause some other undesired system load possibly....
My opinion is bonny6 should start a case with Technical Support and send full VRTSexplorers from all nodes for review.
This doesn't seem like it can be diagnosed with little data snippets....
Regards,
Ted
PS- wrong "types.cf" file can cause all manner of weird behavior and thus I recommend to check into that always.
I have seen mounts not mount, DG not import, applications not start or intermittent timeout- all due to incorrect types.cf being in place....
09-23-2010 03:15 AM
Hello,
You didn't mention since when you have been getting this issue. If this is appearing after any installation/upgradation/patch update, then I would be interested to see the below outputs -
09-26-2010 03:46 AM
Hi all ,
first of all Ive attached file about your requsted output of all VRTS path .
second , i didnt preform any upgrade/patch update reagarding this system .its occur without any system chacnge .
Ted , why do you mean by wrong "types.cf" ? i did not changed anything in the system or in the "types.cf" files .before this error appeard.
i will also read about the oracle problem and see if it relevant
also i have a question regarding the first solution Gaurav recommended .
if I wil increased AgentReplyTimeout attribute value to 300 sec (the default was 130 sec); can you please let me know the impact of this change?
PS-There is a timeline for the permanent fix for this issue.?
Thanks and Regards .
09-26-2010 04:37 AM
To answer your questions..
For AgentReplyTimeout, the definition is, The number of seconds the engine waits to receive a heartbeat
from the agent before restarting the agent.
So if your server is busy or there is a real bug in software, HAD will wait a little more for agent to respond, before taking the action of restarting the agent.... positive side is, there might be chance that agent will revert back in 300 sec (130 sec might be little short for agent).. negative side could be, if agent really hangs, then HAD will be little delayed to take the necessary action.... I would recommend to atleast give a try & observe the system for a day....
To answer other queries, wrong types.cf again relates to update of agent/software .... for e.g if you upgraded software from 4.1 to 5.0, quite possible that some agents were re-written in 5.0 so resource definations will change... in such cases, during upgrade, by default, new types.cf file is kept in /etc/VRTSvcs/conf/ directory... & to solve the issue, types.cf from conf directory needs to be moved to config directory... but to me, that doesn;t looks to be the case since you havn;t done any upgrade... just to reassure, compare the
/etc/VRTSvcs/conf/config/OracleTypes.cf with /etc/VRTSagents/ha/conf/Oracle/OracleTypes.cf
I can see little difference in size but that is possible since OracleTypes.cf under config directory is active & in use .. just check if value/line difference is there...
Gaurav
09-26-2010 08:52 AM
Hi,
after compare both types.cf . there is no diff between them .
Ive also have this errors .
2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuOracleLog_lv) - monitor procedure did not complete within the expected time.
2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuDbData_lv) - monitor procedure did not complete within the expected time.
2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuData_lv) - monitor procedure did not complete within the expected time
Ted after you said about the DB "as long as the DB has a query waiting for it, additional queries / db transactions wouldn't take place.So- if our Oracle monitor wants to touch or read a table in the db, then it won't be able to and you would get timeouts... because our monitor cycle can't complete."
in both servers there is no 200 resources or even not half of that might casue a memory or cpu crash. the server has 12G memory.
i didnt find any sign of overload or any upgrade or change in the systerm that might casue this to happen,it seems there no a trace of lead .
i think the big problem here is to figure why the DB and LVM cant finish in his expect timeout . mybe changine the ORACLE and LVM TimeoutAgent fix the problem . but will not explain the reason why all that occur in first place .
09-26-2010 09:24 PM
I would recommend to get AgentReplytimeout tested first, in any case if that does or doesn't help, since you want to know the root cause of why its happening at the first place, get a case opened with Symantec Technical support... You will need to provide them debug outputs of agent (need to enable LogDbg for agent), HAD (add some debug tags to VCS engine) , truss , threadlists ... that would surely help you to get the answer you are looking for..
Unfortunately, analyzing these outputs could be far tedious on these forums...
To contact Symantec support, use below link:
http://www.symantec.com/business/support/index?page=home
on the right side, you will find contact support, open a case via web.... multiple options would be there..
Gaurav