Forum Discussion

IT-SYSMIKE's avatar
IT-SYSMIKE
Level 2
11 years ago

V-35-410: Cluster server not running on local node on solaris

Hello,

We had a hardware failure and on restarting the server we could not reach our mount points or even start the server with hastart but nothing was started and we keep getting the error in Title above.

Kindly assist in resolving this issue.

 

 

  • Hello

    looks like you have made your cluster to use IOFencing however fencing is not configured correctly ..

    refer below logs

    2013/11/28 11:11:46 VCS NOTICE V-16-1-52006 UseFence=SCSI3. Fencing is enabled
    2013/11/28 11:11:46 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
    2013/11/28 11:12:01 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...

    2013/11/28 11:12:16 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying..

     

    See if your main.cf contains below line

    UseFence = SCSI3

    # cat /etc/VRTSvcs/conf/config/main.cf |grep -i usefence

     

    Above line exists in main.cf, that means cluster is intended to use fencing which is not configured correctly.

    IOFencing provided data protection from cluster split brain situations

     

    Refer to VCS admin guide & see the article on how to configure IOFencing. If you do not intend to use IOFencing (which is not recommended), you can remove the entry from main.cf after stopping the cluster & start the cluster again.

     

    Link to documentation

     

    https://sort.symantec.com/documents

    IOFencing link for VCS 5.1 on solaris

    https://sort.symantec.com/public/documents/sf/5.1/solaris/html/vcs_admin/ch_admin_fencing.html#760094

     

    G

10 Replies

  • Hello

    looks like you have made your cluster to use IOFencing however fencing is not configured correctly ..

    refer below logs

    2013/11/28 11:11:46 VCS NOTICE V-16-1-52006 UseFence=SCSI3. Fencing is enabled
    2013/11/28 11:11:46 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
    2013/11/28 11:12:01 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...

    2013/11/28 11:12:16 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying..

     

    See if your main.cf contains below line

    UseFence = SCSI3

    # cat /etc/VRTSvcs/conf/config/main.cf |grep -i usefence

     

    Above line exists in main.cf, that means cluster is intended to use fencing which is not configured correctly.

    IOFencing provided data protection from cluster split brain situations

     

    Refer to VCS admin guide & see the article on how to configure IOFencing. If you do not intend to use IOFencing (which is not recommended), you can remove the entry from main.cf after stopping the cluster & start the cluster again.

     

    Link to documentation

     

    https://sort.symantec.com/documents

    IOFencing link for VCS 5.1 on solaris

    https://sort.symantec.com/public/documents/sf/5.1/solaris/html/vcs_admin/ch_admin_fencing.html#760094

     

    G

  • Hi,

    Is this a CFS server ? And what's the hardware failure exactly, since the local hard disk failure could lead to data loss which impact VCS configuration.

    And pls paste content of below files:

    /etc/VRTSvcs/conf/sysname

    /etc/llttab

    And  output of below commands:

    lltstat -nvv active

    gabconfig -a

  • Hello stinsong,

    Thanks for your response.

    The hardware failure caused a shared mount point to become unavailable, Yes it is a CFS server, currently we could see the filesystem  on one node but it is not coming up on the other node , when trying to start it up it gives a new error.

    VCS ERROR V-16-1-10600 Cannot connect to VCS engine.

     
  • Hello,

    Could not connect to VCS engine means your "had" process has not started or not running.

    For VCS to run, you need to ensure that components like LLT, GAB & Fencing (if configured) are running. Please paste the output of

    # lltconfig

    # lltstat -vvn | head -10

    # gabconfig -a

    # modinfo | egrep 'gab|llt|vxfen'

    # had -version

    # uname -a

     

    when you say that nothing was started .. assuming its a unix system, are your rc scripts all OK ? i.e

    /etc/rc2.d/S70llt

    /etc/rc2.d/S92gab

    /etc/rc3.d/S99vcs

    If services are configured under SMF, are the SMF services in online state ?

     

    G

  • Hello,

    This are the results:

    root@ap1.gf.net # lltconfig
    LLT is running
    root@ap1.gf.net # lltstat -vvn | head -10
    LLT node information:
        Node                 State    Link  Status  Address
       * 0 ap1          OPEN    
                                      igb2   UP      00:21:28:BB:40:3C
                                      igb3   UP      00:21:28:BB:40:3D
         1 ap2          OPEN    
                                      igb2   UP      00:21:28:BB:0F:04
                                      igb3   UP      00:21:28:BB:0F:05
         2                   CONNWAIT
                                      igb2   DOWN    
     
    root@ap1.gf.net # gabconfig -a
    GAB Port Memberships
    ===============================================================
    Port a gen   68f501 membership 01
    Port d gen   68f506 membership 01
     
    root@ap1.gf.net # df -ah
    Filesystem             size   used  avail capacity  Mounted on
    rpool/ROOT/s10s_u9wos_14a
                           274G    26G   238G    10%    /
    /devices                 0K     0K     0K     0%    /devices
    ctfs                     0K     0K     0K     0%    /system/contract
    proc                     0K     0K     0K     0%    /proc
    mnttab                   0K     0K     0K     0%    /etc/mnttab
    swap                    53G   504K    53G     1%    /etc/svc/volatile
    objfs                    0K     0K     0K     0%    /system/object
    sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
    /platform/sun4v/lib/libc_psr/libc_psr_hwcap2.so.1
                           264G    26G   238G    10%    /platform/sun4v/lib/libc_psr.so.1
    /platform/sun4v/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
                           264G    26G   238G    10%    /platform/sun4v/lib/sparcv9/libc_psr.so.1
    fd                       0K     0K     0K     0%    /dev/fd
    swap                    53G    72K    53G     1%    /tmp
    swap                    53G    72K    53G     1%    /var/run
    swap                    53G     0K    53G     0%    /dev/vx/dmp
    swap                    53G     0K    53G     0%    /dev/vx/rdmp
    applprod1              150G    40G   109G    27%    /applprod1
    applprod2               98G    39G    59G    40%    /applprod2
    rpool/export           274G    23K   238G     1%    /export
    rpool/export/home      274G   3.6G   238G     2%    /export/home
    rpool                  274G    97K   238G     1%    /rpool
    -hosts                   0K     0K     0K     0%    /net
    auto_home                0K     0K     0K     0%    /home
    ap1.gf.net:vold(pid2375)
                             0K     0K     0K     0%    /vol
    /dev/odm                 0K     0K     0K     0%    /dev/odm
    root@ap1.gf.net # cfsmount all
      Error: V-35-410: Cluster Server not running on local node: to
     
    root@ap1.gf.net # modinfo | egrep 'gab|llt|vxfen'
    234 7aaea000  2cf88 331   1  llt (LLT 5.1SP1)
    235 7ab0e000  5a338 332   1  gab (GAB device 5.1SP1)
    236 7ab4c000  6a0c8 333   1  vxfen (VRTS Fence 5.1SP1)
     
    root@ap1.gf.net # had -version
    Engine Version    5.1
    Join Version      5.1.10.0
    Build Date        Fri Oct 01 07:30:00 2010
    PSTAMP            5.1.100.000-5.1SP1-2010-09-30_23.30.00
     
    root@ap1.gf.net # uname -a
    SunOS ap1.gf.net 5.10 Generic_147440-19 sun4v sparc sun4v
     
    This scripts below do not exist in our server:

    /etc/rc2.d/S70llt

    /etc/rc2.d/S92gab

    /etc/rc3.d/S99vcs

     

     

     

  • Well 'had' is not running for some reason, but most everything else seems to be...

    You may not see those 'rc'-scripts because llt, gab, and vcs may be under Solaris' SMF control on your system.  Check your SMF configuration and see when 'had' (vcs) should have been started.

    What run-level is your system in?:

    # who -r 

    You may be in a run-level whereby SMF is not configured to run VCS ('had'), and then you would get an error like:  'Cluster Server not running on local node'

    Either manually start VCS (via 'hastart') or transition your host to the appropriate run-level. 

    -HTH

     

  • Hello,

    It is on run-level 3.

     

    root@ap1.gf.net # who -r 
       .       run-level 3  Nov 27 15:52     3      0  3

     

  • Have you tried to 'hastart' it yet?

    Make sure to report back to us any error messages that go into VCS' message log (/opt/VRTSvcs/log/engine_A.log) and the Solaris messages log (/var/adm/messages) after you you ran 'hastart'...

    Do you have a valid VCS license? -- if it has expired than you will get a message in those logs.

    Run 'vxlicrep -s' and provide the output..., as well as the relevent output from the various messages files mentioned above...

    -kjb

     

  • Also, make sure that "had -version" is same on both the nodes, above you have only pasted outputs from one node so can't confirm.

    Also, as suggested above, try an hastart & let us know the output from engine_A.log

     

    G

  • Please check your SMF services for any issues.

    #svcs -a|egrep 'vxfen|vcs|llt|gab'

    online         Oct_17   svc:/system/vxfen:default
    online         Oct_17   svc:/system/llt:default
    online         Oct_17   svc:/system/gab:default
    online         Oct_17   svc:/system/vcs:default
     
    If any service is not online please check the reason why is it not online 
    #svcs -xv vxfen
     
    And you may check the SMF logs for more details.
     
    #more /var/svc/log/system-vxfen\:default.log 
     
    If fencing is not coming up you may configiure fencing with vxfenconfig command.
     
    #vxfenadm -d
    #vxfenconfig 
     
    Thanks,
    Venkat