Forum Discussion

sparmar's avatar
sparmar
Level 3
15 years ago

VCS nodes keep rebooting


Hi

I wonder if you kind people can help me again.

I have a 3 node cluster on Sun x4240 servers, which I have installed VCS v5.0.
There are only about 9 service groups created on them which just have mounts and volumes, so no load on them.

The issue I am seeing is randomly one server in the cluster drops off the network and then I can't access it via the console as root.
This seems to happen for about 15 minutes then it fixes itself, then the other server does the same.


I have noticed that the heart beat connections go first.


My Cluster set up is:
Redhat 5.4 x86
VCS v5.0 RP3
Heartbeats on = eth1 and eth3 (100mb full duplex)
All the servers are built exactly the same with no variation.


has anyone come across this before?


Thanks

Sparmar


  • Just to let you all know, the issue was a faulty network card for one of the heart beats which has now been replaced.
    Also there was an issue with a PCI card which connects via a fibre cable as a media server for Netbackup which seemed to hang the servers on reboot. (Keeps scanning down the lpfc)

    So, it looks as if it was hardware related.

    Many thanks for all the input in helping me get to some resolution.


    Sparmar

4 Replies

  • Check /var/log/messages. Look for the section prior to the server booting again.

    We have recently seen the situation described in this TechNote:
    http://seer.entsupport.symantec.com/docs/184301.htm

    The solution was to track down & troubleshoot the process causing cpu usage to spike, leaving system unresponsive.
  • Is your VCS version mentioned correctly? or is that VCS 5.0MP3RP3?

    I believe that VCS 5.0 Base version was not supported on RHEL 5.


    Please give us following outputs..

    #rpm -aq |grep VRTSvcs
    #had -version

    Also, when you said that you can not access the console once you lose the network, does it affect on all the hosts at the same time? I mean are you able to connect to any other node either on console or through ssh/telnet?

    How its is configured in your network? Did you check from  your network side?

    Thanks,
    Mandar

  • Hi

    Heres the output from had -version:

    Engine Version=5.0
    PSTAMP: Veritas-5.0MP3-07/16/08-02:01:00


    And the output from rpm aq | grep VRTSvcs

    VRTSvcs-5.0.30.00-MP3_GENERIC
    VRTSvcsvr-5.0.30.00-MP3_GENERIC
    VRTSvcsag-5.0.30.00-MP3_RHEL5
    VRTSvcsor-5.0.30.00-MP3_RHEL5
    VRTSvcs-5.0.30.00-MP3_RHEL5
    VRTSvcsdr-5.0.30.00-MP3_RHEL5
    VRTSvcsmn-5.0.30.00-MP3_GENERIC


    I have installed RP3 as well.

    There does seem to be a lot of LLT errors output in the messages logs on the servers which are off the network via the console.

    I've also checked with our networks guys, and there doesn't seem to be any issues with the switch or the network, so I figure it must be down to the software, or I've not installed something correctly.




    Thanks

    sparmar







  • Just to let you all know, the issue was a faulty network card for one of the heart beats which has now been replaced.
    Also there was an issue with a PCI card which connects via a fibre cable as a media server for Netbackup which seemed to hang the servers on reboot. (Keeps scanning down the lpfc)

    So, it looks as if it was hardware related.

    Many thanks for all the input in helping me get to some resolution.


    Sparmar