MultiNICA failed issue

Manger · ‎10-13-2009

We have a problem with MultiNICA resources failing on one node when the other is disconnected/failed. E.g. if we disconnect the NIC e1000g1 from the switch on node A, then e1000g1 on node B is also failed.

We have a 1+1 symetric cluster setup with two T5220 nodes.

These are connected to the LAN with 3 NICs, 1 for operation and maintenance and 2 for traffic. The 2 traffic NICs (e1000g1, nxge1) are handled by Veritas, and only one of them is active at one time; the other is redundant.

The Network resource looks like this in main.cf:

group Network (
    SystemList = { node1 = 0, node2 = 1 }
    Parallel = 1
    AutoStartList = { node1, node2 }
    )

    MultiNICA Multi-NIC (
        Device @node1 = { e1000g1 = "10.240.204.197",
             nxge1 = "10.240.204.197" }
        Device @node2 = { e1000g1 = "10.240.204.198",
             nxge1 = "10.240.204.198" }
        NetMask = "255.255.255.240"
        RetestInterval = 2
        RouteOptions = "default 10.240.204.193"
        IfconfigTwice = 1
        )

    Phantom Network_Phantom (
        )

    // resource dependency tree
    //
    //    group Network
    //    {
    //    MultiNICA Multi-NIC
    //    Phantom Network_Phantom
    //    }

e1000g1 and nxge1 are connected to 2 separate switches for both nodes.

Typically, a resource using this interface looks like:

group SentinelLM (
    SystemList = { node1 = 0, node2 = 1 }
    AutoStartList = { node1, node2 }
    FailOverPolicy = Load
    Load = 5
    )

    IPMultiNIC SentinelLM_ip (
        Address = "10.240.204.200"
        NetMask = "255.255.255.240"
        MultiNICResName = Multi-NIC
        IfconfigTwice = 1
        )

    Process SentinelLM (
        PathName = "/application/sentinel/bin/lserv"
        Arguments = "-s /application/sentinel/bin/lservrc"
        )

    Proxy SentinelLM_nic (
        TargetResName = Multi-NIC
        )

    requires group ApplicationDG online local firm
    SentinelLM requires SentinelLM_ip
    SentinelLM_ip requires SentinelLM_nic

    // resource dependency tree
    //
    //    group SentinelLM
    //    {
    //    Process SentinelLM
    //        {
    //        IPMultiNIC SentinelLM_ip
    //            {
    //            Proxy SentinelLM_nic
    //            }
    //        }
    //    }

We are running tests where we are disconnecting the cables from both interfaces on one node to see that it fails over to the other.

But when we disconnect both cables from node1, then the same interface fails on node2. This is what we're getting in engine_A.log:

2009/10/09 13:47:03 VCS WARNING V-16-10001-6004 (node2) MultiNICA:Multi-NIC:monitor:Device e1000g1 FAILED
2009/10/09 13:47:03 VCS WARNING V-16-10001-6005 (node2) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:03 VCS WARNING V-16-10001-6006 (node2) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:03 VCS ERROR V-16-10001-6018 (node2) MultiNICA:Multi-NIC:monitor:Error in 'ifconfig' command execution:
ifconfig: SIOCSLIFNAME for ip: nxge1: already exists
2009/10/09 13:47:04 VCS WARNING V-16-10001-6019 (node2) MultiNICA:Multi-NIC:monitor:Device nxge1 could not be brought up
2009/10/09 13:47:04 VCS ERROR V-16-10001-6014 (node2) MultiNICA:Multi-NIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE
2009/10/09 13:47:05 VCS WARNING V-16-10001-6004 (node2) MultiNICA:Multi-NIC:monitor:Device FAILED
2009/10/09 13:47:05 VCS WARNING V-16-10001-6005 (node2) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:05 VCS WARNING V-16-10001-6006 (node2) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:05 VCS WARNING V-16-10001-6007 (node2) MultiNICA:Multi-NIC:monitor:Trying to online Device e1000g1
2009/10/09 13:47:07 VCS INFO V-16-10001-6008 (node2) MultiNICA:Multi-NIC:monitor:Sleeping 2 seconds
2009/10/09 13:47:09 VCS WARNING V-16-10001-6009 (node2) MultiNICA:Multi-NIC:monitor:Pinging Broadcast address 10.240.204.207 on Device e1000g1, iteration 1
2009/10/09 13:47:10 VCS WARNING V-16-10001-6004 (node1) MultiNICA:Multi-NIC:monitor:Device e1000g1 FAILED
2009/10/09 13:47:10 VCS WARNING V-16-10001-6005 (node1) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:10 VCS WARNING V-16-10001-6006 (node1) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:10 VCS ERROR V-16-10001-6018 (node1) MultiNICA:Multi-NIC:monitor:Error in 'ifconfig' command execution:
ifconfig: SIOCSLIFNAME for ip: nxge1: already exists
2009/10/09 13:47:11 VCS WARNING V-16-10001-6019 (node1) MultiNICA:Multi-NIC:monitor:Device nxge1 could not be brought up
2009/10/09 13:47:11 VCS ERROR V-16-10001-6014 (node1) MultiNICA:Multi-NIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE
2009/10/09 13:47:12 VCS WARNING V-16-10001-6004 (node1) MultiNICA:Multi-NIC:monitor:Device FAILED
2009/10/09 13:47:12 VCS WARNING V-16-10001-6005 (node1) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:12 VCS WARNING V-16-10001-6006 (node1) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:12 VCS WARNING V-16-10001-6007 (node1) MultiNICA:Multi-NIC:monitor:Trying to online Device e1000g1

As you can see: even though we disconnected node1, the first node to have a failed network resource is node2.

We are also seeing a situation where all cables are connected and the MultiNIC resource fails on the active node when we stop/start the passive node. I believe this could be related.

Any theory of what is happening here?

M__Braun · ‎10-15-2009

Hi,

Could it be the case that the networks you are currently using do not contain any other hosts besides the cluster nodes and the router? Routers are mostly configured for not replying to broadcast pings which are used by the MultiNICA Agent. If the other nodes NIC is disconnected the broadcast pings are not answered anymore.

I would try adding the NetworkHosts attribute. It should point to the router or any other host in the network. Here is the excerpt from the VCS Bundled Agents Guide:

NetworkHosts

The list of hosts on the network that are pinged to determine if the network connection is alive. Enter the IP address of the host, instead of the host name, to prevent the monitor from timing out—DNS causes the ping to hang. If this attribute is unspecified, the monitor tests the NIC by pinging the broadcast address on the NIC. If more than one network host is listed, the monitor returns online if at least one of the hosts is alive. Type and dimension: string-vector Example: "128.93.2.1", "128.97.1.2"

I hope this helps.

Regards

Manuel

VOX

MultiNICA failed issue