Forum Discussion

bonny6's avatar
bonny6
Level 3
15 years ago
Solved

Problem with VRTS . oracle alram

hi all ,

recently i come up against oracle issue ,i'm getting the next error message :

VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuOracleLog_lv) - monitor procedure did not complete within the expected time.

my question is why this error can appear?

this is also  causing to a failover of the servers and the database changing to FAULTED state

DBA solution was to adjust interval/timeout values  and is might help ,but i want to anlayze this problem and unstrstand why this is hapenning on my system .

 

I have  atteached logs and useful information .

if someone can help me with this 

Thx ,

 

 

  • Hello,

     

    Looking at new messages its DB related however again timeout which makes me to think of system overload again...

    Regarding the attribute definations..  have a look at VCS users guide for 5.0MP3 AIX ... page  656 .. it lists all attributes & its definitions...

    VCS users guide can be found here:

    http://sfdoccentral.symantec.com/Storage_Foundation_HA_50MP3_AIX.html

    OR

    https://vos.symantec.com/documents

     

    I think these definations are aloso available in man page of hatype (don't remember the section of man)

     

    Gaurav

  • Hi Ted,

    yep, agree that it could be database issue but I am wondering why would LVMLogicalVolume have problem with that ? we can see above, agent not sending alive messages, that makes me to think more on performance part...

    let me know if you think otherwise..

     

    Gaurav

  • I believe bonny6's servers may have more than one issue.

    LVM related may be performance, but the db getting suspended might cause some other undesired system load possibly....

    My opinion is bonny6 should start a case with Technical Support and send full VRTSexplorers from all nodes for review.

    This doesn't seem like it can be diagnosed with little data snippets....

    Regards,

    Ted

     

    PS- wrong "types.cf" file can cause all manner of weird behavior and thus I recommend to check into that always.

    I have seen mounts not mount, DG not import, applications not start or intermittent timeout- all due to incorrect types.cf being in place....

  • Hello,

    You didn't mention since when you have been getting this issue. If this is appearing after any installation/upgradation/patch update, then I would be interested to see the below outputs -

     

    # ls -l /etc/VRTSvcs/conf/*.cf
    # ls -l /etc/VRTSvcs/conf/config/*.cf
    # ls -l /etc/VRTSagents/ha/conf/Oracle/*.cf
     
    Regards
    Rajesh
  • Hi all ,

    first of all Ive attached file about your requsted output of all VRTS path .

    second , i didnt preform any upgrade/patch update reagarding this system .its occur without any system chacnge .

    Ted , why do you mean by wrong "types.cf" ?  i did not  changed anything in the system or in the "types.cf" files .before this error appeard.

    i will also  read about the oracle problem and see if it relevant

    also i have a question regarding the first solution Gaurav recommended .

    if  I wil increased AgentReplyTimeout attribute value to 300 sec (the default was 130 sec); can you please let me know the impact of this change? 

     

    PS-There is a timeline for the permanent fix for this issue.?

    Thanks and Regards .
     

  • To answer your questions..

    For AgentReplyTimeout, the definition is, The number of seconds the engine waits to receive a heartbeat
    from the agent before restarting the agent.

    So if your server is busy or there is a real bug in software, HAD will wait a little more for agent to respond, before taking the action of restarting the agent.... positive side is, there might be chance that agent will revert back in 300 sec (130 sec might be little short for agent).. negative side could be, if agent really hangs, then HAD will be little delayed to take the necessary action.... I would recommend to atleast give a try & observe the system for a day....

    To answer other queries, wrong types.cf again relates to update of agent/software .... for e.g if you upgraded software from 4.1 to 5.0, quite possible that some agents were re-written in 5.0 so resource definations will change... in such cases, during upgrade, by default, new types.cf file is kept in /etc/VRTSvcs/conf/ directory... & to solve the issue, types.cf from conf directory needs to be moved to config directory... but to me, that doesn;t looks to be the case since you havn;t done any upgrade... just to reassure, compare the

    /etc/VRTSvcs/conf/config/OracleTypes.cf       with 
    /etc/VRTSagents/ha/conf/Oracle/OracleTypes.cf 
    

    I can see little difference in size but that is possible since OracleTypes.cf under config directory is active & in use .. just check if value/line difference is there...

     

    Gaurav

  • Hi,

    after compare both types.cf .  there is no diff between them .

    Ive also have this errors .

    2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuOracleLog_lv) - monitor procedure did not complete within the expected time.
    2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuDbData_lv) - monitor procedure did not complete within the expected time.
    2010/09/11 10:27:59 VCS ERROR V-16-2-13027 (mdsu1a) Resource(mdsuData_lv) - monitor procedure did not complete within the expected time

    Ted  after you said about the DB  "as long as the DB has a query waiting for it, additional queries / db transactions wouldn't take place.So- if our Oracle monitor wants to touch or read a table in the db, then it won't be able to and you would get timeouts... because our monitor cycle can't complete."

    in both servers there is no 200 resources or even not half of that might casue a memory or cpu crash. the server has 12G memory.

    i didnt find any sign of overload or  any upgrade or change in the systerm that might casue this to happen,it seems there no a trace of lead .

    i think the big problem here is to figure why the DB and LVM  cant finish in his expect timeout . mybe changine the ORACLE and LVM TimeoutAgent fix the problem . but will not explain the reason  why all that occur in first place .

     

  • I would recommend to get AgentReplytimeout tested first, in any case if that does or doesn't help, since you want to know the root cause of why its happening at the first place, get a case opened with Symantec Technical support... You will need to provide them debug outputs of agent (need to enable LogDbg for agent), HAD (add some debug tags to VCS engine) , truss , threadlists ... that would surely help you to get the answer you are looking for..

    Unfortunately, analyzing these outputs could be far tedious on these forums...

    To contact Symantec support, use below link:

    http://www.symantec.com/business/support/index?page=home

    on the right side, you will find contact support, open a case via web.... multiple options would be there..

     

    Gaurav