Need Help VCS & IPMP Configuration issues on Solaris 9
Operating System - Solaris 9
Storage Foundation HA for Oracle - 5.0 MP1 RP5
Hardware - Sun V890
NIC - 2 X Quad CE Network Interfaces
I'm testing a VCS installation, and I need some expert input in identifying the reason why the IP failed to come up on my standby interface “ce4” interface in my IPMP group, after a reboot of the switch connected to the primary interface.
I have included a diagram of how my servers are physically connected to the network infrastructure.
http://www.scribd.com/doc/11501852/Veritas-Cluster-IPMP
Problem Description:
- The primary Network switch A was rebooted and was off line for about 4-1/2 minutes.
- Network interface ce0 plugged into switch A went offline
- Solaris IPMP Network services "in.mpathd" failed over the from ce0 to ce4 but the IP could not be configured ….
/var/adm/messages File
Jan 24 23:12:17 prodnode-n1 genunix: [ID 408789 kern.warning] WARNING: ce0: fault detected external to device; service degraded
Jan 24 23:12:17 prodnode-n1 genunix: [ID 451854 kern.warning] WARNING: ce0: xcvr addr:0x01 - link down
Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 215189 daemon.error] The link has gone down on ce0
Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 594170 daemon.error] NIC failure detected on ce0 of group oracle_NICB
Jan 24 23:12:17 prodnode-n1 in.mpathd[1903]: [ID 832587 daemon.error] Successfully failed over from NIC ce0 to NIC ce4
Jan 24 23:12:19 prodnode-n1 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (ce0) node 1 in trouble
Jan 24 23:12:26 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 8 sec (4226787)
Jan 24 23:12:27 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 9 sec (4226787)
Jan 24 23:12:28 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 10 sec (4226787)
Jan 24 23:12:29 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 11 sec (4226787)
Jan 24 23:12:30 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 12 sec (4226787)
Jan 24 23:12:31 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 13 sec (4226787)
Jan 24 23:12:32 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 14 sec (4226787)
Jan 24 23:12:33 prodnode-n1 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce0) node 1 inactive 15 sec (4226787)
Jan 24 23:12:33 prodnode-n1 llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 2 (ce0) node 1 expired
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 Thread(3) Agent is calling clean for resource(oracle_IPB) because the resource became OFFLINE unexpectedly, on its own.
Jan 24 23:12:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13067 (prodnode-n1) Agent is calling clean for resource(oracle_IPB) because the resource became OFFLINE unexpectedly, on its own.
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(3) Resource(oracle_IPB) - clean completed successfully.
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 Thread(3) Resource(oracle_IPB) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 1) the resource.
Jan 24 23:12:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13073 (prodnode-n1) Resource(oracle_IPB) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 1) the resource.
Jan 24 23:12:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-5005 IPMultiNICB:oracle_IPB:online:Error in configuring IP address
Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13066 Thread(3) Agent is calling clean for resource(oracle_IPB) because the resource is not up even after online completed.
Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(3) Resource(oracle_IPB) - clean completed successfully.
Jan 24 23:13:49 prodnode-n1 AgentFramework[4684]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13071 Thread(3) Resource(oracle_IPB): reached OnlineRetryLimit(0).
Jan 24 23:13:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13066 (prodnode-n1) Agent is calling clean for resource(oracle_IPB) because the resource is not up even after online completed.
Jan 24 23:13:49 prodnode-n1 Had[4339]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10303 Resource oracle_IPB (Owner: unknown, Group: listener_SG) is FAULTED (timed out) on sys prodnode-n1
Host Networking Files:
(prodnode-n1)# cat hostname.ce0
prodnode-n1-ce0 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover up \
addif prodnode-n1 netmask 255.255.255.0 broadcast + failover up
(prodnode-n1)# cat hostname.ce1
prodnode-n1-tsm
(prodnode-n1)# cat hostname.ce4
prodnode-n1-ce4 netmask 255.255.255.0 broadcast + \
group PROD deprecated -failover standby up
# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 20.46.222.219 netmask ffffff00 broadcast 20.46.222.255
groupname oracle_NICB
ether 0:14:4f:43:f1:58
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 20.46.222.221 netmask ffffff00 broadcast 20.46.222.255
ce0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 20.46.222.225 netmask ffffff00 broadcast 20.46.222.255
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 20.46.214.37 netmask ffffff00 broadcast 20.46.214.255
ether 0:14:4f:43:f1:59
ce4:flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 4
inet 20.46.222.220 netmask ffffff00 broadcast 20.46.222.255
groupname oracle_NICB
ether 0:14:4f:44:9:b4
(prodnode-n1)# cat /etc/hosts
# Internet host table
#
127.0.0.1 localhost
20.46.222.221 prodnode-n1 prodnode-n1.mycompany.com loghost
#
20.46.222.1 defaultrouter
#
20.46.222.225 prodnode11
#
20.46.222.219 prodnode-n1-ce0
20.46.222.220 prodnode-n1-ce4
#
20.46.214.37 prodnode-n1-tsm
#
Main.cf Entries:
IPMultiNICB oracle_IPB (
BaseResName = oracle_NICB
Address = "20.46.222.225"
NetMask = "255.255.255.0"
)
MultiNICB oracle_NICB (
UseMpathd = 1
MpathdCommand = "/usr/lib/inet/in.mpathd -a"
Device = { ce0 = "", ce4 = "" }
DefaultRouter = "20.46.222.1"
Failback = 1
)
The issue I encountered was caused by the MultiNICB RestartLimit set to 0 (The default).
There is a warning note on page 48 of the Agent guides which applies to my case. I do not understand why an important configuration like this is set to 0 by default.
ftp://exftpp.symantec.com/pub/support/products/ClusterServer_UNIX/283871.pdf.
Note: On Solaris systems, Symantec recommends that you set the RestartLimit for IPMultiNIC resources to a greater-than-zero value. This helps to prevent the spurious faulting of IPMultiNIC resources during local failovers of MultiNICA. A local failover is an interface-to- interface failover of MultiNICA. See the VCS User’s Guide for more information.