17 years ago

MultiNIC failed and resource went offline

Need your help to troubleshoot this issue. One of our servers in the cluster [All servers are on Solaris 8 , patch level 117350-46 and VCS 3.5P2] had a failed multiNIC resource. The service group was frozen and hence it did not failover. (was done earlier due to another maintenance). The resource faulted and when we logged in to the server via the console we could not ping any server in the same network. This server had quad port qfe cards.

The link_status , link_mode and link_speed showed 1 meaning that it was 100 Mbps FD and connected. I checked the switch side and it showed connected too with no Align/FCS/other errors.

I do not see any message in the system log which says that the link went down.

engine_A.log showed the following information

TAG_E 2008/01/31 18:12:35 VCS:10307:Resource mnic_3000 (Owner: unknown, Group: sg300) is offline on fisher
        (Not initiated by VCS.)

There has been no user who initiated this as we do not see any information from the authlogs who would have logged in during that time to initiate this. And, we have an authentication audit system which also do not show any user bringing the resource offline during the time the issue happened. Also, if any user would have initiated this, it would have said the "following command was run" and the user which initiated.

Interesting, since this is a teamed interface [qfe0 + qfe4] ... intersting if qfe0 was bad or defunct, why did not qfe4 take over. Meaning, the connection would have been active.

MultiNICA mnic_3000 (
                NetworkHosts @kronos = { "", "", "", "" }
                NetworkHosts @helios = { "", "", "", "" }
                NetworkHosts @fisher = { "", "", "", "" }
                Device @kronos = { qfe0 = "", qfe2 = "" }
                Device @helios = { qfe0 = "", qfe2 = "" }
                Device @fisher = { qfe0 = "", qfe4 = "" }
                RouteOptions = "default"
                NetMask = ""

I do not see any other information from the system or the cluster and hence request your help to debug this issue. Any pointers on this will be greatful. I fixed this temporarily by unplumbing/ plumbing the interface and then clearing the fault

