Solved: Re: Need Help VCS & IPMP Configuration issues on S...

ALORA · ‎01-29-2009

Operating System - Solaris 9

Storage Foundation HA for Oracle - 5.0 MP1 RP5

Hardware - Sun V890

NIC - 2 X Quad CE Network Interfaces

I'm testing a VCS installation, and I need some expert input in identifying the reason why the IP failed to come up on my standby interface “ce4” interface in my IPMP group, after a reboot of the switch connected to the primary interface.

I have included a diagram of how my servers are physically connected to the network infrastructure.

http://www.scribd.com/doc/11501852/Veritas-Cluster-IPMP

Problem Description:

The primary Network switch A was rebooted and was off line for about 4-1/2 minutes.
Network interface ce0 plugged into switch A went offline
Solaris IPMP Network services "in.mpathd" failed over the from ce0 to ce4 but the IP could not be configured ….

/var/adm/messages File

Jan 24 23:12:17 prodnode-n1 genunix: [ID 408789 kern.warning] WARNING: ce0: fault detected external to device; service degraded

Jan 24 23:12:17 prodnode-n1 genunix: [ID 451854 kern.warning] WARNING: ce0: xcvr addr:0x01 - link down

Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 215189 daemon.error] The link has gone down on ce0

Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 594170 daemon.error] NIC failure detected on ce0 of group oracle_NICB

Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 832587 daemon.error] Successfully failed over from NIC ce0 to NIC ce4

Jan 24 23:12:19 prodnode-n1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (ce0) node 1 in trouble

Jan 24 23:12:26 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 8 sec (4226787)

Jan 24 23:12:27 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 9 sec (4226787)

Jan 24 23:12:28 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 10 sec (4226787)

Jan 24 23:12:29 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 11 sec (4226787)

Jan 24 23:12:30 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 12 sec (4226787)

Jan 24 23:12:31 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 13 sec (4226787)

Jan 24 23:12:32 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 14 sec (4226787)

Jan 24 23:12:33 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 15 sec (4226787)

Jan 24 23:12:33 prodnode-n1 llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 2 (ce0) node 1 expired

Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(3) Agent is calling clean for resource(oracle_IPB) because the resource became OFFLINE unexpectedly, on its own.

Jan 24 23:12:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 (prodnode-n1) Agent is calling clean for resource(oracle_IPB) because the resource became OFFLINE unexpectedly, on its own.

Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(3) Resource(oracle_IPB) - clean completed successfully.

Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 Thread(3) Resource(oracle_IPB) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 1) the resource.

Jan 24 23:12:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 (prodnode-n1) Resource(oracle_IPB) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 1) the resource.

Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-5005 IPMultiNICB:oracle_IPB:online:Error in configuring IP address

Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13066 Thread(3) Agent is calling clean for resource(oracle_IPB) because the resource is not up even after online completed.

Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(3) Resource(oracle_IPB) - clean completed successfully.

Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13071 Thread(3) Resource(oracle_IPB): reached OnlineRetryLimit(0).

Jan 24 23:13:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13066 (prodnode-n1) Agent is calling clean for resource(oracle_IPB) because the resource is not up even after online completed.

Jan 24 23:13:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10303 Resource oracle_IPB (Owner: unknown, Group: listener_SG) is FAULTED (timed out) on sys prodnode-n1

Host Networking Files:

(prodnode-n1)# cat hostname.ce0
prodnode-n1-ce0 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover up \
addif prodnode-n1 netmask 255.255.255.0 broadcast + failover up

(prodnode-n1)# cat hostname.ce1
prodnode-n1-tsm

(prodnode-n1)# cat hostname.ce4
prodnode-n1-ce4 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover standby up

# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
        inet 20.46.222.219 netmask ffffff00 broadcast 20.46.222.255
        groupname oracle_NICB
        ether 0:14:4f:43:f1:58
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 20.46.222.221 netmask ffffff00 broadcast 20.46.222.255
ce0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 20.46.222.225 netmask ffffff00 broadcast 20.46.222.255
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 20.46.214.37 netmask ffffff00 broadcast 20.46.214.255
        ether 0:14:4f:43:f1:59
ce4:flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 4
        inet 20.46.222.220 netmask ffffff00 broadcast 20.46.222.255
        groupname oracle_NICB
        ether 0:14:4f:44:9:b4

(prodnode-n1)# cat /etc/hosts
# Internet host table
#
127.0.0.1           localhost
20.46.222.221   prodnode-n1 prodnode-n1.mycompany.com loghost
#
20.46.222.1        defaultrouter
#
20.46.222.225    prodnode11
#
20.46.222.219    prodnode-n1-ce0
20.46.222.220    prodnode-n1-ce4
#
20.46.214.37      prodnode-n1-tsm
#

Main.cf Entries:

IPMultiNICB oracle_IPB (
                BaseResName = oracle_NICB
                Address = "20.46.222.225"
                NetMask = "255.255.255.0"
     )

MultiNICB oracle_NICB (
     UseMpathd = 1
     MpathdCommand = "/usr/lib/inet/in.mpathd -a"
     Device = { ce0 = "", ce4 = "" }
     DefaultRouter = "20.46.222.1"
     Failback = 1
     )

Message Edited by ALORA on 01-29-2009 06:19 PM

ALORA · ‎02-06-2009

The issue I encountered was caused by the MultiNICB RestartLimit set to 0 (The default).

There is a warning note on page 48 of the Agent guides which applies to my case. I do not understand why an important configuration like this is set to 0 by default.

ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf.

Note: On Solaris systems, Symantec recommends that you set the RestartLimit for IPMultiNIC resources to a greater-than-zero value. This helps to prevent the spurious faulting of IPMultiNIC resources during local failovers of MultiNICA. A local failover is an interface-to- interface failover of MultiNICA. See the VCS User’s Guide for more information.

View solution in original post

M__Braun · ‎01-30-2009

Hi,

Based on the info from the VCS Bundled Agents Reference Guide it seems DefaultRouter and Failback are only optional attributes for MultiNICB base mode and not for Multipathing mode which you are using (UseMpathd = 1).

ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf

MultiNICB oracle_NICB (
     UseMpathd = 1
     MpathdCommand = "/usr/lib/inet/in.mpathd -a"
     Device = { ce0 = "", ce4 = "" }
     DefaultRouter = "20.46.222.1"
     Failback = 1
     )

In Multipathing mode the failback should be completely controlled by the mpathd. See default "FAILBACK=yes" in /etc/default/mpathd file. The following MultiNICB configuration should be enough:

MultiNICB oracle_NICB (
     UseMpathd = 1
     MpathdCommand = "/usr/lib/inet/in.mpathd -a"
     Device = { ce0 = "", ce4 = "" }
     )

You can easily check if the error is VCS or mpathd related by just stopping vcs with "hastop -all -force" and observing the mpathd behavior by unplugging the primary ce0 interface. The virtual IP failover should be completely handled by mpathd.

Regards

Manuel

ALORA · ‎01-30-2009

Manuel,

Not sure if this is an issue. The IPMP group name is defined as PROD in the /etc/hostname.ce* files but it

is showing up as the MultiNICB name oracle_NICB in the "ifconfig -a" output.

I'll test out your earlier recommendations on my lab servers.

Thanks

(prodnode-n1)# cat hostname.ce0
prodnode-n1-ce0 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover up \
addif prodnode-n1 netmask 255.255.255.0 broadcast + failover up

(prodnode-n1)# cat hostname.ce1
prodnode-n1-tsm

(prodnode-n1)# cat hostname.ce4
prodnode-n1-ce4 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover standby up

# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
        inet 20.46.222.219 netmask ffffff00 broadcast 20.46.222.255
        groupname oracle_NICB
        ether 0:14:4f:43:f1:58
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 20.46.222.221 netmask ffffff00 broadcast 20.46.222.255
ce0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 20.46.222.225 netmask ffffff00 broadcast 20.46.222.255
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 20.46.214.37 netmask ffffff00 broadcast 20.46.214.255
        ether 0:14:4f:43:f1:59
ce4:flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 4
        inet 20.46.222.220 netmask ffffff00 broadcast 20.46.222.255
        groupname oracle_NICB
        ether 0:14:4f:44:9:b4

M__Braun · ‎02-02-2009

Hi,

>The IPMP group name is defined as PROD in the /etc/hostname.ce* files but it is showing up as the MultiNICB name oracle_NICB in the "ifconfig -a" output.

This works as designed if the optional GroupName attribute is not used. Just add the following to your MultiNICB resource oracle_NIC to get rid of the inconsistent naming:

GroupName = "PROD"

> I'll test out your earlier recommendations on my lab servers.

Ok.

Regards

Manuel

drosera · ‎02-04-2009

Here is my main.cf for IPMP with MultiNICB and IPMUltiNICB. I have my hostname.* setup the same way as you with IPMP in active/passive. Your group name in your ifconfig output doesn't match your config files. Looks like someone setup IPMP manually and used the wrong group name. You will need to fix that then configure your resources as below.

Just a word of warning on IPMP and VCS about something that burned me. By default IPMP will probe the gateway address with a failure detection time of 10 seconds. If you are running Cisco HSRP to protect your gateway the default failure time is also 10 seconds. In the event of a router failure, IPMP will have a group failure as it won't be able to probe your gateway and your MulitNICB will fault on both nodes and your cluster will be toast.

To get around this you can change your HSRP hold timer to less than 10 seconds, raise your IPMP failure detection timer above 10 seconds, or add host routes to the HSRP addresses of the switch as well as the default gateway. To find your HSRP addresesses:

[root@bub log]# snoop -d dmfe0 udp from port 1985 and to port 1985

Set your hosts routes

[root@bub /]# route add -host 172.18.0.2 172.18.0.2 -static

[root@bub /]# route add -host 172.18.0.3 172.18.0.3 -static

You will need a rc script to make it permanant.

Main.cf configuration

-----------------------------

[root@vcsbox config]# perl -00 -wnle 'print if /IPMultiNICB|MultiNICB/' main.cf
        IPMultiNICB pmip (
                BaseResName = pmnic
                Address = "1.2.3.4"
                NetMask = "255.255.254.0"
                )

        MultiNICB pmnic (
                UseMpathd = 1
                MpathdCommand = "/usr/lib/inet/in.mpathd -a"
                Device @vcsbox = { bge0 = 0, ce0 = 1 }
                Device @vcsbox2 = { bge0 = 0, ce0 = 1 }
                GroupName = prod_ipmp
                )

pmip requires pmnic

Message Edited by drosera on 02-04-2009 11:18 AM

ALORA · ‎02-06-2009

The issue I encountered was caused by the MultiNICB RestartLimit set to 0 (The default).

There is a warning note on page 48 of the Agent guides which applies to my case. I do not understand why an important configuration like this is set to 0 by default.

ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf.

Note: On Solaris systems, Symantec recommends that you set the RestartLimit for IPMultiNIC resources to a greater-than-zero value. This helps to prevent the spurious faulting of IPMultiNIC resources during local failovers of MultiNICA. A local failover is an interface-to- interface failover of MultiNICA. See the VCS User’s Guide for more information.

bartalamu · ‎04-14-2009

Hi fellows...Sorry that broke current topic but i have just the same problem:

SUN T2000 servers on Solaris 10:
Apr 13 11:20:56 dsu2b in.mpathd[1806]: [ID 594170 daemon.error] NIC failure detected on e1000g2 of group Admin_ipmp
Apr 13 13:52:51 dsu2b in.mpathd[1806]: [ID 299542 daemon.error] NIC repair detected on e1000g2 of group Admin_ipmp
Apr 13 13:53:12 dsu2b in.mpathd[1806]: [ID 975029 daemon.error] No test address configured on interface e1000g2; disabling probe-based failure detection on it
Apr 13 13:53:51 dsu2b in.mpathd[1806]: [ID 594170 daemon.error] NIC failure detected on e1000g2 of group Admin_ipmp
Apr 13 13:54:11 dsu2b in.mpathd[1806]: [ID 276194 daemon.error] Test address now configured on interface e1000g2; enabling probe-based failure detection on it

lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 172.17.137.172 netmask ffffff00 broadcast 172.17.137.255
groupname Data_ipmp
ether 0:14:4f:7e:4c:3d
e1000g2: flags=11000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FAILED> mtu 1500 index 3
inet 172.17.138.178 netmask ffff0000 broadcast 172.17.255.255
groupname Admin_ipmp
ether 0:14:4f:7e:4c:3e

all *.hostname files are empty...Please help what is the problem

francoisp · ‎04-21-2009

I was having the same issue as your original descirption : Error in configuring IP address
S8 IPMP VCS 41MP1
I modified the devicechoice attribute in the IPmultinicb resource, so that it uses the main interface in the IPMP group, and then it worked
# hares -value mnic_res Device
ce0 0 ce2 1
# hares -value ipmnicB_res BaseResName
mnic_res
# hares -value ipmnicB_res DeviceChoice
1
# hares -value mnic_res Device
ce0 0 ce2 1
# cat /etc/hostname.ce2
group mnic_res
set <addrX>/23 broadcast + -failover deprecated up
addif <addrY>/23 broadcast + up
# cat /etc/hostname.ce0
group mnic_res
set <addrZ>/23 broadcast + -failover deprecated up
standby

VOX

Need Help VCS & IPMP Configuration issues on Solaris 9