Solved: Cluster resources going down

Sridhar_sri · ‎03-29-2010

Hi Gurus,
                       Following is my main.cf file

include "types.cf"
include "prodTypes.cf"

cluster prod_SAN (
   UserNames = { administrator = eLIfIHhIIiIK }
   ClusterAddress = "10.77.210.17"
   Administrators = { administrator }
   )

system lca-t2000 (
   )

system prod-t1000-5 (
   )

group App_Service (
   SystemList = { lca-t2000 = 0, prod-t1000-5 = 1 }
   AutoStartList = { lca-t2000, prod-t1000-5 }
   )

   DiskGroup datadg_san (
        DiskGroup = datadg
        )

   IP App_IP (
        Device @lca-t2000 = e1000g0
        Device @prod-t1000-5 = bge0
        Address = "10.77.210.7"
        NetMask = "255.255.255.192"
        )

   prod prodAgent (
        Critical = 0
        )

   Mount cscopx_Mount (
        MountPoint = "/data/prod40_SAN/"
        BlockDevice = "/dev/vx/dsk/datadg/cscopx"
        FSType = vxfs
        FsckOpt = "-y"
        )

   Mount varcscopx_Mount (
        MountPoint = "/var/adm/CSCOpx"
        BlockDevice = "/dev/vx/dsk/datadg/varcscopx"
        FSType = vxfs
        FsckOpt = "-y"
        )

   Proxy NIC_proxy (
        TargetResName = csgnic
        )

   Volume cscopx (
        Volume = cscopx
        DiskGroup = datadg
        )

   Volume varcscopx (
        Volume = varcscopx
        DiskGroup = datadg
        )

   App_IP requires NIC_proxy
   prodAgent requires App_IP
   prodAgent requires cscopx_Mount
   prodAgent requires varcscopx_Mount
   cscopx requires datadg_san
   cscopx_Mount requires cscopx
   varcscopx requires datadg_san
   varcscopx_Mount requires varcscopx

   // resource dependency tree
   //
   //   group App_Service
   //   {
   //   prod prodAgent
   //        {
   //        IP App_IP
   //            {
   //            Proxy NIC_proxy
   //            }
   //        Mount cscopx_Mount
   //            {
   //            Volume cscopx
   //                {
   //                DiskGroup datadg_san
   //                }
   //            }
   //        Mount varcscopx_Mount
   //            {
   //            Volume varcscopx
   //                {
   //                DiskGroup datadg_san
   //                }
   //            }
   //        }
   //   }

group ClusterService (
   SystemList = { lca-t2000 = 0, prod-t1000-5 = 1 }
   AutoStartList = { lca-t2000, prod-t1000-5 }
   OnlineRetryInterval = 120
   )

   IP webip (
        Device @lca-t2000 = e1000g0
        Device @prod-t1000-5 = bge0
        Address = "10.77.210.7"
        NetMask = "255.255.255.192"
        )

   NIC csgnic (
        Device @lca-t2000 = e1000g0
        Device @prod-t1000-5 = bge0
        )

   NotifierMngr ntfr (
        SmtpServer = "email.cn.com"
        SmtpRecipients = { "mail@cisco.com" = Information }
        )

   VRTSWebApp VCSweb (
        Critical = 0
        AppName = cmc
        InstallDir = "/opt/VRTSweb/VERITAS"
        TimeForOnline = 5
        RestartLimit = 3
        )

   VCSweb requires webip
   ntfr requires csgnic
   webip requires csgnic

   // resource dependency tree
   //
   //   group ClusterService
   //   {
   //   VRTSWebApp VCSweb
   //        {
   //        IP webip
   //            {
   //            NIC csgnic
   //            }
   //        }
   //   NotifierMngr ntfr
   //        {
   //        NIC csgnic
   //        }
   //   }

After configuring my cluster, all resources are up and running. here i have proxied the NIC resource. After running few hours i could see there is some problem with NIC and my resource goes down (parent resrource , prodagent is called for offline). When i go through various sites, i found this http://seer.entsupport.symantec.com/docs/324321.htm link, where the possibility of error is described. But i could see i have configured properly.

from engine_a.log:

2010/03/25 12:24:37 VCS WARNING V-16-10001-7505 (lms-t1000-5) NIC:csgnic:monitor:(28105369) is less than or equal to (28105369): Resource is offline
2010/03/25 12:24:38 VCS INFO V-16-1-10307 Resource csgnic (Owner: unknown, Group: ClusterService) is offline on lms-t1000-5 (Not initiated by VCS)
2010/03/25 12:24:38 VCS NOTICE V-16-1-10300 Initiating Offline of Resource ntfr (Owner: unknown, Group: ClusterService) on System lms-t1000-5
2010/03/25 12:24:38 VCS NOTICE V-16-1-10300 Initiating Offline of Resource VCSweb (Owner: unknown, Group: ClusterService) on System lms-t1000-5
2010/03/25 12:24:50 VCS INFO V-16-6-15004 (lms-t1000-5) hatrigger:Failed to send trigger for resfault; script doesn't exist

Can anyone help me on this

With regards,
Sri

Sridhar_sri · ‎04-02-2010

Thanks mandar & marianne for your brief explaination. Currently i have increased timedout value for my resources from 60 (default) - 180 seconds. I didnt see any such issues for the past 36 hours.

i will also enable this networkhost attribute and observe the status

Regards,
Sri

View solution in original post

g_lee · ‎03-29-2010

Sridhar_sri,

The error in the engine log indicates the NIC went offline ie: this isn't the config error described in the technote you linked in your post.

2010/03/25 12:24:37 VCS WARNING V-16-10001-7505 (lms-t1000-5) NIC:csgnic:monitor:(28105369) is less than or equal to (28105369): Resource is offline
2010/03/25 12:24:38 VCS INFO V-16-1-10307 Resource csgnic (Owner: unknown, Group: ClusterService) is offline on lms-t1000-5 (Not initiated by VCS)

Do you see any errors in /var/adm/messages on prod-t1000-5 for the NIC (bge0)?
(assuming lms-t1000-5 is an alias for prod-t1000-5 - the names should match in the config though so you should look at sorting this out as well!)

Also regarding the configuration, although what you have will work, it's still doesn't match the configuration described in the technote, as you don't have a separate parallel group for the NIC resource + phantom - you are currently using the NIC in the (failover) ClusterService sg.

regards,
g_lee

Sridhar_sri · ‎03-29-2010

Hi Lee,
As you assumed that prod-t1000-5 is lms-t1000-5 , before sharing the file in this forum i have manually edited few names, just for security purpose. But i didnt look into the server login messages at the time when NIC resoruce goes down.

So as you say , what you have will work, then why the NIC resource has gone down ? what could be the probable cause ? . I read in a webpage as this can happen some time when server is very busy in responding . so that the monitor interval fails and hence failover may happen ? will that be a cause ?

Regards,
Sri

Marianne · ‎03-30-2010

Try to add the optional attribute NetworkHosts to the NIC resource. Add the IP address of the defaultrouter.
Extract from Bundled Agent Guide:
List of hosts on the network that are pinged to determine if the network connection is alive. Enter the IP address of the host, instead of the host name, to prevent the monitor from timing out. DNS causes the ping to hang. If more than one network host is listed, the monitor returns ONLINE if at least one of the hosts is alive.
If you do not specify network hosts, the monitor tests the NIC by sending pings to the broadcast address on the NIC.

You did not mention your VCS and O/S version? We recently experienced a similar NIC problem (ifconfig would hang straight after the virtual IP was added to the NIC). It was fixed by installing latest Solaris 10 patches and drivers.

Handy NetBackup Links

g_lee · ‎03-30-2010

Sri,

Re: why the NIC went down, this is why I said to check /var/adm/messages for any errors at the time, as it was not clear if you'd already checked this / you still don't appear to have looked at this to see if there were any corresponding OS level NIC issues at the same time.

> So as you say , what you have will work, then why the NIC resource has gone
> down ? what could be the probable cause ? . I read in a webpage as this can
> happen some time when server is very busy in responding . so that the monitor
> interval fails and hence failover may happen ? will that be a cause ?

I said the configuration would work from the standpoint of you don't have multiple NIC resources configured for the same interface / you're not using the same interface in multiple/duplicate resources (ie: at least it's not a duplicate monitoring issue). If there is a problem with the physical NIC though it will still fail, which is why you were asked to check system messages.

It is also possible that it's a monitoring/server busy issue, however you haven't provided enough information / investigated any other possibilities to be able to confirm this is the cause (ie: haven't ruled out anything else).

This is why the suggestion was made to check the messages file. Marianne's suggestions are also worth looking at if you continue seeing the issue (ie: setting NetworkHosts, and look at patching OS/VCS if not on latest/recent patches)

g_lee

Sridhar_sri · ‎03-30-2010

The VCS version that i am using VCS 5.0 HA. Following is solaris server details

bash-3.00# cat /etc/release
                       Solaris 10 8/07 s10s_u4wos_12b SPARC
           Copyright 2007 Sun Microsystems, Inc. All Rights Reserved.
                            Assembled 16 August 2007

Marianne · ‎03-31-2010

Hopefully you have installed VCS and Solaris patches since those base installs?
16 August 2007 is an awful long time ago - SUN releases patch updates more or less once a month...
See https://vos.symantec.com/patch/matrix for SF/HA patch updates.

Have you had a look at /var/adm/messages for NIC errors as g_lee suggested?

Handy NetBackup Links

vcs_man · ‎03-31-2010

Hi Sridhar,

As mentioned earlier by Marianne, Adding a NetworkHosts attribute could be a proper solution for your issue.

Since you are getting following error...
010/03/25 12:24:37 VCS WARNING V-16-10001-7505 (lms-t1000-5) NIC:csgnic:monitor:(28105369) is less than or equal to (28105369): Resource is offline
2010/03/25 12:24:38 VCS INFO V-16-1-10307 Resource csgnic (Owner: unknown, Group: ClusterService) is offline on lms-t1000-5 (Not initiated by VCS)

This means the Packet count change is either less or equal while VCS monitors the NIC resource . Also, please check in your network if there are any Packet drops.

Here is the below explanation how VCS monitors NIC Resource.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When VCS starts, the NIC agent will run monitor against Network Interface(say eri0).

The monitor runs “netstat –in” and looks for the output relevant to Network Interface (e.g eri0). The Ipkts field is read and stored in a file. Subsequently, each time the monitor runs VCS:

-pings the broadcast address of the network (and/or any NetworkHosts configured in the main.cf file)

-runs netstat -in

-checks the current Ipkts against the previous run Ipkts count (which was stored in a file)

- If the current value is greater than the previous value, that means we’re receiving packets on this interface, meaning we’re connected to a good network, meaning the NIC Resource is online

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Thanks,
Mandar

Sridhar_sri · ‎04-02-2010

Thanks mandar & marianne for your brief explaination. Currently i have increased timedout value for my resources from 60 (default) - 180 seconds. I didnt see any such issues for the past 36 hours.

i will also enable this networkhost attribute and observe the status

Regards,
Sri

VOX

Cluster resources going down