Cluster Disk timed out, then failed event ID 1069

hatif · ‎08-23-2010

Hi.
I have a six nodes Microsoft cluster nodes, Symantec Storage Foundation with SP1 is running. Recently I added a new node. Whenever I do a failover, the disk resources times out and failed then I have bring the disk resources online manually. Two error messages reported. One for timing out, 2nd for failure.

Eror #1 (Event ID 1045) - Cluster resource <DiskResourceName> timed out. If the pending timeout is too short for this resource, consider increasing the pending timeout value
Error #2 (Event ID 1069) - Cluster resource '<DiskName>' in Resource Group '<GroupName> Group' failed.

Your help would be appreciated.

Thanks
Hatif

Marianne · ‎08-23-2010

Which version of Storage Foundation?
Which O/S?
Multiple paths to storage? If so, which DSM modules installed?
StorPort patches/hotfixes up to date?
Are all clustered nodes installed/configured identically? Same firmware/driver versions on HBA's? Same O/S patches? Same SFW version, patches, DSM?

Are there any Vx... errors in Event Viewer?

Handy NetBackup Links

David_Honeycutt · ‎08-24-2010

Hi Hatif,

The Product version (e.g. SFW 5.1 SP1) and Platform version (e.g. Windows server 2008 x64), etc. are important to know when posting to help narrow down a solution. There are some known issues with VMDg Resources having issues running under SFW 5.1 SP1. I would recommend checking out each of the Hotfixes listed on the Veritas Operations Services link below to determine which apply. There are also some Private fixes that are not available in the list, so if you have applied the fixes that seem to be related, but the issue still occurs, please open a Support Case with Symantec Storage Foundation for Windows & Cluster Support Team.

Veritas Operations Services Hot fix for Veritas Storage Foundation for Windows
https://vos.symantec.com/patch/searchmatrix/21/2/1

hatif · ‎08-25-2010

Hi Marianne and David,
I am so sorry for not providing enough information about this issue. Following is the information about the server and veritas software.

OS: Windows 2003 Ent. SP2 32-bit running MS cluster for file servers.
VERITAS Software: Veritas Storage Foundation 5.1 with SP1 and using DMP.
HBA: Emulex LPe11000
Storport Drive: Latest installed with firmware from EMULEX website.
Veritas Storage foundation patches applied:
- sfw-win-Hotfix_5_1_10003_584_1923059-patches
- sfw-win-Hotfix_5_1_10007_584_1959339-patches
- sfw-win-Hotfix_5_1_10016_584_1992881-patches

Currently 5 of the cluster servers are older HP blade G3 servers running Veritas 5.1 MP2. One server is DL380 G5. I have not updated the veritas 5.1 Sp1 yet because our plan is to replace all blade servers to DL380 G5 one by one.

So veritas ptaches level is not same but added in the cluster successfully. Disk failover for some groups work fine. I did not find any vx error in the event log.

Hope this help.

Thank you.

David_Honeycutt · ‎08-27-2010

Hi Hatif,

Based on the list of Hotfixes, you should be up-to-date, unless I am missing one. You may want to go ahead and open a Symantec Support Team Case on this issue.

You mentioned that you added a new MSCS Cluster node to an existing 6 node cluster. Have you attempted to manually import / deport the Disk Group(s) on the newly added cluster node outside MSCS Clustering? I would recommend this in an attempt to try to isolate the issue. If the Disk Group(s) can be imported / deported manually using the Veritas Enterprise Administrator (VEA), you may simply try to reinstall the Microsoft Clustering Option (if is not already installed, install it) using the Symantec Product Installer, accessed from the Add/Remove Programs list when selecting Change on (Server Components).

jlockley · ‎08-28-2010

The third fix is not generally available, implies that you've had some issues in this area before that needed a support case? These cluster errors are fairly generic, but I think there should be something more from VxSvc in the application event log. Check for disk group import, volume arrival, and maybe errors about the disk group arriving. Typically the error relates to volume arrival as you have seen in the fix sfw-win-Hotfix_5_1_10016_584_1992881.

Additionally to the isolation testing suggested by David, you can try having the cluster online the disk resources one by one.

A typical failure scenario can be where you have many diskgroups and volumes in many service groups, and when the cluster tries to online them all at once some of them timeout. If you online one service group or disk group resource at one time it should succeed. Is it that the service groups with a lesser amount of disk group resrouces succeed?

Workarounds include modifying the pending timeout or creating dependencies between the disk group resouces to allow the resources time to online.

James.

VOX

Cluster Disk timed out, then failed event ID 1069