Solved: Failover issues after SAN disc loss.

James_Latimer · ‎08-17-2011

Storage failed on one node of the HP4500 (Lefthand) SAN. A Solaris 10 clustered system running VCS 5.0 and using ZFS lost view of the storage, although the resource was not marked bad in the VCS GUI. The Solaris system was not failed over by VCS. Database applications on it stopped working. Failing over "manually" via the VCS GUI did not seem to have an effect. The "active" (but broken) Solaris node was reset to force the system onto the previously-inactive Solaris node.

From the time of the issue becoming apparent, these (first few) messages appear in the VCS logs:

TAG_B 17-Aug-2011 09:46:05 V-16-2-13027 (sys_name) Resource(poolname) - monitor procedure did not complete within the expected time. V-16-2-13027 (sys_name) Resource(poolname) - monitor procedure did not complete within the expected time.

TAG_B 17-Aug-2011 09:52:06 V-16-2-13210 (sys_name) Agent is calling clean for resource(poolname) because 4 successive invocations of the monitor procedure did not complete within the expected time. V-16-2-13210 (sys_name) Agent is calling clean for resource(poolname) because 4 successive invocations of the monitor procedure did not complete within the expected time.

TAG_E 17-Aug-2011 09:52:07 V-16-2-13068 (sys_name) Resource(poolname) - clean completed successfully. V-16-2-13068 (sys_name) Resource(poolname) - clean completed successfully.

TAG_B 17-Aug-2011 09:53:08 V-16-2-13077 (sys_name) Agent is unable to offline resource(poolname). Administrative intervention may be required. V-16-2-13077 (sys_name) Agent is unable to offline resource(poolname). Administrative intervention may be required.

TAG_E 17-Aug-2011 09:53:08 V-16-6-15004 (sys_name) hatrigger:Failed to send trigger for resnotoff; script doesn't exist V-16-6-15004 (sys_name) hatrigger:Failed to send trigger for resnotoff; script doesn't exist

Is it possible to tinker with some sort of time interval value in VCS so that transient storage outages are allowed more time to resolve themselves? Are there any other useful suggestions?

Regards,

TonyGriffiths · ‎08-18-2011

Hi,

The message "monitor procedure did not complete within the expected time" is essentially saying the routine used to determine the resource state, could not execute&complete within the allocated time.

The VCS resource type attribute that define this threshold is MonitorTimeout

i.e hatype -display Mount | grep MonitorTimeout

By default, VCS will tolerate four of MonitorTimeouts before invoking a clean routine. This is controlled by FaultOnMonitorTimeouts

Whilst you can investigate increasing the timeout to tolerate the event, you should also look into why the monitor could not run in the allocated time. Did the event cause the system to slow down/hang, Were i/o's hung in the disk drivers waiting to timeout etc.

cheers

View solution in original post

TonyGriffiths · ‎08-18-2011

Hi,