We have a problem with MultiNICA resources failing on one node when the other is disconnected/failed. E.g. if we disconnect the NIC e1000g1 from the switch on node A, then e1000g1 on node B is also failed.
We have a 1+1 symetric cluster setup with two T5220 nodes.
These are connected to the LAN with 3 NICs, 1 for operation and maintenance and 2 for traffic. The 2 traffic NICs (e1000g1, nxge1) are handled by Veritas, and only one of them is active at one time; the other is redundant.
The Network resource looks like this in main.cf:
group Network (
SystemList = { node1 = 0, node2 = 1 }
Parallel = 1
AutoStartList = { node1, node2 }
)
MultiNICA Multi-NIC (
Device @node1 = { e1000g1 = "10.240.204.197",
nxge1 = "10.240.204.197" }
Device @node2 = { e1000g1 = "10.240.204.198",
nxge1 = "10.240.204.198" }
NetMask = "255.255.255.240"
RetestInterval = 2
RouteOptions = "default 10.240.204.193"
IfconfigTwice = 1
)
Phantom Network_Phantom (
)
// resource dependency tree
//
// group Network
// {
// MultiNICA Multi-NIC
// Phantom Network_Phantom
// }
e1000g1 and nxge1 are connected to 2 separate switches for both nodes.
Typically, a resource using this interface looks like:
group SentinelLM (
SystemList = { node1 = 0, node2 = 1 }
AutoStartList = { node1, node2 }
FailOverPolicy = Load
Load = 5
)
IPMultiNIC SentinelLM_ip (
Address = "10.240.204.200"
NetMask = "255.255.255.240"
MultiNICResName = Multi-NIC
IfconfigTwice = 1
)
Process SentinelLM (
PathName = "/application/sentinel/bin/lserv"
Arguments = "-s /application/sentinel/bin/lservrc"
)
Proxy SentinelLM_nic (
TargetResName = Multi-NIC
)
requires group ApplicationDG online local firm
SentinelLM requires SentinelLM_ip
SentinelLM_ip requires SentinelLM_nic
// resource dependency tree
//
// group SentinelLM
// {
// Process SentinelLM
// {
// IPMultiNIC SentinelLM_ip
// {
// Proxy SentinelLM_nic
// }
// }
// }
We are running tests where we are disconnecting the cables from both interfaces on one node to see that it fails over to the other.
But when we disconnect both cables from node1, then the same interface fails on node2. This is what we're getting in engine_A.log:
2009/10/09 13:47:03 VCS WARNING V-16-10001-6004 (node2) MultiNICA:Multi-NIC:monitor:Device e1000g1 FAILED
2009/10/09 13:47:03 VCS WARNING V-16-10001-6005 (node2) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:03 VCS WARNING V-16-10001-6006 (node2) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:03 VCS ERROR V-16-10001-6018 (node2) MultiNICA:Multi-NIC:monitor:Error in 'ifconfig' command execution:
ifconfig: SIOCSLIFNAME for ip: nxge1: already exists
2009/10/09 13:47:04 VCS WARNING V-16-10001-6019 (node2) MultiNICA:Multi-NIC:monitor:Device nxge1 could not be brought up
2009/10/09 13:47:04 VCS ERROR V-16-10001-6014 (node2) MultiNICA:Multi-NIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE
2009/10/09 13:47:05 VCS WARNING V-16-10001-6004 (node2) MultiNICA:Multi-NIC:monitor:Device FAILED
2009/10/09 13:47:05 VCS WARNING V-16-10001-6005 (node2) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:05 VCS WARNING V-16-10001-6006 (node2) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:05 VCS WARNING V-16-10001-6007 (node2) MultiNICA:Multi-NIC:monitor:Trying to online Device e1000g1
2009/10/09 13:47:07 VCS INFO V-16-10001-6008 (node2) MultiNICA:Multi-NIC:monitor:Sleeping 2 seconds
2009/10/09 13:47:09 VCS WARNING V-16-10001-6009 (node2) MultiNICA:Multi-NIC:monitor:Pinging Broadcast address 10.240.204.207 on Device e1000g1, iteration 1
2009/10/09 13:47:10 VCS WARNING V-16-10001-6004 (node1) MultiNICA:Multi-NIC:monitor:Device e1000g1 FAILED
2009/10/09 13:47:10 VCS WARNING V-16-10001-6005 (node1) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:10 VCS WARNING V-16-10001-6006 (node1) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:10 VCS ERROR V-16-10001-6018 (node1) MultiNICA:Multi-NIC:monitor:Error in 'ifconfig' command execution:
ifconfig: SIOCSLIFNAME for ip: nxge1: already exists
2009/10/09 13:47:11 VCS WARNING V-16-10001-6019 (node1) MultiNICA:Multi-NIC:monitor:Device nxge1 could not be brought up
2009/10/09 13:47:11 VCS ERROR V-16-10001-6014 (node1) MultiNICA:Multi-NIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE
2009/10/09 13:47:12 VCS WARNING V-16-10001-6004 (node1) MultiNICA:Multi-NIC:monitor:Device FAILED
2009/10/09 13:47:12 VCS WARNING V-16-10001-6005 (node1) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:12 VCS WARNING V-16-10001-6006 (node1) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:12 VCS WARNING V-16-10001-6007 (node1) MultiNICA:Multi-NIC:monitor:Trying to online Device e1000g1
As you can see: even though we disconnected node1, the first node to have a failed network resource is node2.
We are also seeing a situation where all cables are connected and the MultiNIC resource fails on the active node when we stop/start the passive node. I believe this could be related.
Any theory of what is happening here?