01-29-2009 03:18 PM
Operating System - Solaris 9
Storage Foundation HA for Oracle - 5.0 MP1 RP5
Hardware - Sun V890
NIC - 2 X Quad CE Network Interfaces
I'm testing a VCS installation, and I need some expert input in identifying the reason why the IP failed to come up on my standby interface “ce4” interface in my IPMP group, after a reboot of the switch connected to the primary interface.
I have included a diagram of how my servers are physically connected to the network infrastructure.
http://www.scribd.com/doc/11501852/Veritas-Cluster-IPMP
Problem Description:
/var/adm/messages File
Jan 24 23:12:17 prodnode-n1 genunix: [ID 408789 kern.warning] WARNING: ce0: fault detected external to device; service degraded
Jan 24 23:12:17 prodnode-n1 genunix: [ID 451854 kern.warning] WARNING: ce0: xcvr addr:0x01 - link down
Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 215189 daemon.error] The link has gone down on ce0
Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 594170 daemon.error] NIC failure detected on ce0 of group oracle_NICB
Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 832587 daemon.error] Successfully failed over from NIC ce0 to NIC ce4
Jan 24 23:12:19 prodnode-n1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (ce0) node 1 in trouble
Jan 24 23:12:26 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 8 sec (4226787)
Jan 24 23:12:27 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 9 sec (4226787)
Jan 24 23:12:28 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 10 sec (4226787)
Jan 24 23:12:29 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 11 sec (4226787)
Jan 24 23:12:30 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 12 sec (4226787)
Jan 24 23:12:31 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 13 sec (4226787)
Jan 24 23:12:32 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 14 sec (4226787)
Jan 24 23:12:33 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 15 sec (4226787)
Jan 24 23:12:33 prodnode-n1 llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 2 (ce0) node 1 expired
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(3) Agent is calling clean for resource(oracle_IPB) because the resource became OFFLINE unexpectedly, on its own.
Jan 24 23:12:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 (prodnode-n1) Agent is calling clean for resource(oracle_IPB) because the resource became OFFLINE unexpectedly, on its own.
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(3) Resource(oracle_IPB) - clean completed successfully.
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 Thread(3) Resource(oracle_IPB) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 1) the resource.
Jan 24 23:12:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 (prodnode-n1) Resource(oracle_IPB) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 1) the resource.
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-5005 IPMultiNICB:oracle_IPB:online:Error in configuring IP address
Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13066 Thread(3) Agent is calling clean for resource(oracle_IPB) because the resource is not up even after online completed.
Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(3) Resource(oracle_IPB) - clean completed successfully.
Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13071 Thread(3) Resource(oracle_IPB): reached OnlineRetryLimit(0).
Jan 24 23:13:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13066 (prodnode-n1) Agent is calling clean for resource(oracle_IPB) because the resource is not up even after online completed.
Jan 24 23:13:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10303 Resource oracle_IPB (Owner: unknown, Group: listener_SG) is FAULTED (timed out) on sys prodnode-n1
Host Networking Files:
(prodnode-n1)# cat hostname.ce0
prodnode-n1-ce0 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover up \
addif prodnode-n1 netmask 255.255.255.0 broadcast + failover up
(prodnode-n1)# cat hostname.ce1
prodnode-n1-tsm
(prodnode-n1)# cat hostname.ce4
prodnode-n1-ce4 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover standby up
# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 20.46.222.219 netmask ffffff00 broadcast 20.46.222.255
groupname oracle_NICB
ether 0:14:4f:43:f1:58
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 20.46.222.221 netmask ffffff00 broadcast 20.46.222.255
ce0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 20.46.222.225 netmask ffffff00 broadcast 20.46.222.255
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 20.46.214.37 netmask ffffff00 broadcast 20.46.214.255
ether 0:14:4f:43:f1:59
ce4:flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 4
inet 20.46.222.220 netmask ffffff00 broadcast 20.46.222.255
groupname oracle_NICB
ether 0:14:4f:44:9:b4
(prodnode-n1)# cat /etc/hosts
# Internet host table
#
127.0.0.1 localhost
20.46.222.221 prodnode-n1 prodnode-n1.mycompany.com loghost
#
20.46.222.1 defaultrouter
#
20.46.222.225 prodnode11
#
20.46.222.219 prodnode-n1-ce0
20.46.222.220 prodnode-n1-ce4
#
20.46.214.37 prodnode-n1-tsm
#
Main.cf Entries:
IPMultiNICB oracle_IPB (
BaseResName = oracle_NICB
Address = "20.46.222.225"
NetMask = "255.255.255.0"
)
MultiNICB oracle_NICB (
UseMpathd = 1
MpathdCommand = "/usr/lib/inet/in.mpathd -a"
Device = { ce0 = "", ce4 = "" }
DefaultRouter = "20.46.222.1"
Failback = 1
)
Solved! Go to Solution.
02-06-2009 07:17 AM
The issue I encountered was caused by the MultiNICB RestartLimit set to 0 (The default).
There is a warning note on page 48 of the Agent guides which applies to my case. I do not understand why an important configuration like this is set to 0 by default.
ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf.
Note: On Solaris systems, Symantec recommends that you set the RestartLimit for IPMultiNIC resources to a greater-than-zero value. This helps to prevent the spurious faulting of IPMultiNIC resources during local failovers of MultiNICA. A local failover is an interface-to- interface failover of MultiNICA. See the VCS User’s Guide for more information.
01-30-2009 04:05 AM
Hi,
Based on the info from the VCS Bundled Agents Reference Guide it seems DefaultRouter and Failback are only optional attributes for MultiNICB base mode and not for Multipathing mode which you are using (UseMpathd = 1).
ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf
MultiNICB oracle_NICB (
UseMpathd = 1
MpathdCommand = "/usr/lib/inet/in.mpathd -a"
Device = { ce0 = "", ce4 = "" }
DefaultRouter = "20.46.222.1"
Failback = 1
)
In Multipathing mode the failback should be completely controlled by the mpathd. See default "FAILBACK=yes" in /etc/default/mpathd file. The following MultiNICB configuration should be enough:
MultiNICB oracle_NICB (
UseMpathd = 1
MpathdCommand = "/usr/lib/inet/in.mpathd -a"
Device = { ce0 = "", ce4 = "" }
)
You can easily check if the error is VCS or mpathd related by just stopping vcs with "hastop -all -force" and observing the mpathd behavior by unplugging the primary ce0 interface. The virtual IP failover should be completely handled by mpathd.
Regards
Manuel
01-30-2009 06:51 AM
Manuel,
Not sure if this is an issue. The IPMP group name is defined as PROD in the /etc/hostname.ce* files but it
is showing up as the MultiNICB name oracle_NICB in the "ifconfig -a" output.
I'll test out your earlier recommendations on my lab servers.
Thanks
(prodnode-n1)# cat hostname.ce0
prodnode-n1-ce0 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover up \
addif prodnode-n1 netmask 255.255.255.0 broadcast + failover up
(prodnode-n1)# cat hostname.ce1
prodnode-n1-tsm
(prodnode-n1)# cat hostname.ce4
prodnode-n1-ce4 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover standby up
# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 20.46.222.219 netmask ffffff00 broadcast 20.46.222.255
groupname oracle_NICB
ether 0:14:4f:43:f1:58
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 20.46.222.221 netmask ffffff00 broadcast 20.46.222.255
ce0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 20.46.222.225 netmask ffffff00 broadcast 20.46.222.255
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 20.46.214.37 netmask ffffff00 broadcast 20.46.214.255
ether 0:14:4f:43:f1:59
ce4:flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 4
inet 20.46.222.220 netmask ffffff00 broadcast 20.46.222.255
groupname oracle_NICB
ether 0:14:4f:44:9:b4
02-02-2009 02:47 AM
Hi,
>The IPMP group name is defined as PROD in the /etc/hostname.ce* files but it is showing up as the MultiNICB name oracle_NICB in the "ifconfig -a" output.
This works as designed if the optional GroupName attribute is not used. Just add the following to your MultiNICB resource oracle_NIC to get rid of the inconsistent naming:
GroupName = "PROD"
> I'll test out your earlier recommendations on my lab servers.
Ok.
Regards
Manuel
02-04-2009 11:17 AM
Here is my main.cf for IPMP with MultiNICB and IPMUltiNICB. I have my hostname.* setup the same way as you with IPMP in active/passive. Your group name in your ifconfig output doesn't match your config files. Looks like someone setup IPMP manually and used the wrong group name. You will need to fix that then configure your resources as below.
Just a word of warning on IPMP and VCS about something that burned me. By default IPMP will probe the gateway address with a failure detection time of 10 seconds. If you are running Cisco HSRP to protect your gateway the default failure time is also 10 seconds. In the event of a router failure, IPMP will have a group failure as it won't be able to probe your gateway and your MulitNICB will fault on both nodes and your cluster will be toast.
To get around this you can change your HSRP hold timer to less than 10 seconds, raise your IPMP failure detection timer above 10 seconds, or add host routes to the HSRP addresses of the switch as well as the default gateway. To find your HSRP addresesses:
[root@bub log]# snoop -d dmfe0 udp from port 1985 and to port 1985
Set your hosts routes
[root@bub /]# route add -host 172.18.0.2 172.18.0.2 -static
[root@bub /]# route add -host 172.18.0.3 172.18.0.3 -static
You will need a rc script to make it permanant.
Main.cf configuration
-----------------------------
[root@vcsbox config]# perl -00 -wnle 'print if /IPMultiNICB|MultiNICB/' main.cf
IPMultiNICB pmip (
BaseResName = pmnic
Address = "1.2.3.4"
NetMask = "255.255.254.0"
)
MultiNICB pmnic (
UseMpathd = 1
MpathdCommand = "/usr/lib/inet/in.mpathd -a"
Device @vcsbox = { bge0 = 0, ce0 = 1 }
Device @vcsbox2 = { bge0 = 0, ce0 = 1 }
GroupName = prod_ipmp
)
pmip requires pmnic
02-06-2009 07:17 AM
The issue I encountered was caused by the MultiNICB RestartLimit set to 0 (The default).
There is a warning note on page 48 of the Agent guides which applies to my case. I do not understand why an important configuration like this is set to 0 by default.
ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf.
Note: On Solaris systems, Symantec recommends that you set the RestartLimit for IPMultiNIC resources to a greater-than-zero value. This helps to prevent the spurious faulting of IPMultiNIC resources during local failovers of MultiNICA. A local failover is an interface-to- interface failover of MultiNICA. See the VCS User’s Guide for more information.
04-14-2009 12:43 AM
04-21-2009 08:07 AM