We are trying to implement a 2 Node SQL server cluster on Windows 2008 R2 for last few days. We have so far successfully completed Windows cluster part and have tested quorum failover and single basic disk failover.
Our challenge is that for shared storage between two nodes where we have got 30 LUNs from SAN but we need to have single drive. Since merging LUNs at SAN level is not an option for us and MSCS does not understand dynamic disk group created using winodws disk manager we have to use Veritas Storage Foundation for windows 5.1.
We have successfully installed SFW 5.1 and created Cluster Dynamic Disk group and Volume on this disk group. This volume is visible in Windows Explorer without any issue.
To add this disk group in cluster we followed below steps
1. In "Failover Cluster Manager" we created "Empty Service or Application"
2. Added "Volume Manager Disk Group" resource in the application.
3. After right clicking on the resource and selecting "Bring this resource online option" the disk group was brought online.
To test the failover we rebooted the first node. The disk group failed over to the second node without any issue.
To bring back disk group on node one again we restarted the second node, however this time on the first node the disk group could not come online with status was shown as "Failed".
Since then we have tried rebooting both nodes alternatively, refreshed , rescaned disks but everything has failed to bring back the resource online.
We have again followed all the steps from the start but same result.
Where exactly do you see the 'failed' status - is it in the Cluster GUI?
I'm not familiar with MSCS - is there a log that indicates the reason for the failure?
Are the resources (disk group and volume) still online on the 2nd node?
Please supply output of the following on both nodes (from cmd - please remember to 'run as administrator' when opening cmd):
You mentioned that you are using SFW 5.1 but did not mention if you are running SP1 for SFW 5.1. If not you should be running it because it contains many fixes for the Windows Failover Cluster. You can get SFW 5.1 SP1 from the fileconnect website.
Stop the Cluster service on both nodes. Make sure that you can import and deport the Clustered Dynamic Disk group and move it from node to node without any issues.
You did not mention that the diskgroups that you created were Clusterd diskgroups. With the diskgroup import check in VEA to see if the diskgroup show "(Clustered)" when selcted. If the didk group shows a "(Secondary)" then you will need to deport the diskgroup, then while importing it in VEA select the check box option to make clustered disk group.
If everything is working fine at this point, reenable the Cluster service and start it on both nodes.
In Cluster Administrator, display the properites of the Veritas Dynamic Diskgroup resource and go to the parameters tab. This tab has an attribute for the diskgroup name. Ensure that this diskgroup name matches the diskgroup name in VEA (case does matter.)
From there if everything still looks fine but failover is not working as expected, you might want to open a case with Symantec Technical Support. We will need a set of VxExplorer logs to determine what is going on. VxExplorer logs will gather logs related to SFW and the OS including the cluster.log for Windows Failover Cluster.
We are running SFW 5.1 SP1 already.
We have applied suggested patch but the issue persisted.
The diskgroups are clustered dynamic diskgroups. The diskgroup name in Cluster Administrator is also correct.
We have also done manual deport and import of diskgroups which is working fine.
One point we have observed in last two days is that in VEA on the active cluster node the diskgroup is sometimes shown in imported state or in deported state. Sometimes even if diskgroup is in imported state the volume is shown in stopped state.
If we do refresh and then rescan the disks the diskgroup is shown in imported state and the volume state becomes healthy. After this we are able to bring the resource online. However this operation takes hours, sometimes 4 – 5 hours.
The other point to note is that after we do MSCS cluster failover the created volume is shown correctly in Windows Explorer on active nodes but fails to show online in MSCS cluster screen.