Re: VCS Halts system when Disk Groups are disabled

Magesh · ‎12-02-2008

I have a VCS 5.0 MP3 setup over RHEL 5.2 servers. I have configured a simple Service Group with Veritas Disk group, Volume and a mounted filesystem. When the paths to the disks go down for some reason with IO running on the filesystem, the disk group is Disabled and VCS halts the system. I could see the below message in the system /var/log/messages.

Dec 2 14:28:25 sqa-rhel5-80 kernel: end_request: I/O error, dev sdg, sector 53728
Dec 2 14:28:25 sqa-rhel5-80 kernel: sd 0:0:0:6: SCSI error: return code = 0x08000002
Dec 2 14:28:25 sqa-rhel5-80 kernel: sdg: Current: sense key: Aborted Command
Dec 2 14:28:25 sqa-rhel5-80 kernel: Add. Sense: No additional sense information
Dec 2 14:28:25 sqa-rhel5-80 kernel:
Dec 2 14:28:25 sqa-rhel5-80 kernel: end_request: I/O error, dev sdg, sector 54753
Dec 2 14:28:52 sqa-rhel5-80 Had[9211]: VCS CRITICAL V-16-1-1541 (172.21.164.80) DiskGroup:dg1:monitor:Disk group (dg1) is *DISABLED* on the system . Halting the system.
Dec 2 14:28:53 sqa-rhel5-80 kernel: md: stopping all md devices.
Dec 2 14:28:54 sqa-rhel5-80 kernel: Synchronizing SCSI cache for disk sdaj:

Once this happens, the Service group fails over to the other node successfully and I am able to resume the IO from the failover node.

I tried searching the forum/ support docs for such an error but am unable to find.

My queries:

- Is this an expected behaviour? When the disk group fails/ or goes to disabled state for some reason, will VCS halt the system?

- If this is a genuine issue, is there any fixor work around available?

- Any other suggestions to avoid such a scenario?

Mayank_Vasa · ‎12-02-2008

Hello Magesh:

This is behaviour by design. The reason for halting the system is to ensure a failover takes place and there is no data corruption due to 2 hosts wanting to write to the shared storage. The DiskGroup resource has an attribute called PanicSystemOnDGLoss which is 1 by default. Please refer to the VCS documentation (Bundled Agents Reference Guide) for details of this attribute. If you set it to 0, and in the event there is loss of storage access from the hosts or the DG gets disabled (writes start failing), no failover will occur.

Regards,

+ Mayank.

Magesh · ‎12-02-2008

Hi Mayank,

Thanks a lot for a quick response.

I have one more query. With PanicSystemonDGLoss parameter set to 0, in case of a momentary loss of the paths to the back-end shared storage, will VCS automatically enable the DG once the paths are back? Or will VCS mark the DG as faulted and failover the service group to the other node?

- Magesh

Mayank_Vasa · ‎12-02-2008

Hi Magesh:

The state of the DG is maintained by the Volume Manager. VCS just queries the state and uses it to take the appropriate decision. From the app's perspective, a disabled DG means data writes are not making it to disk and eventually the app will start receiving write errors (there are a few layers in between such as volume manager & filesystem). In order to keep the app's service available, VCS halts the faulted node and moves the service group to the other node.

If you are also using the dynamic multipathing (DMP) feature of VxVM, it does storage path management besides other things. Specifically it provides path failover.

Hope this clarifies things,

+ Mayank.

VOX

VCS Halts system when Disk Groups are disabled