Solved: Got it working

Para · ‎06-19-2012

I'm new to VCS so please bear with me.

I'm running VCS 6.0 on W2K8 Server Ent R2 using HP 3PAR SAN using Microsoft Multipath I/O. I'm trying to get a file share working but I can't get it online.

I have a 3-node cluster and I ran the Add Resource Group wizard and chose the FileShareGroup template. I am able to bring the NIC, IP and LANMAN resources online. I can bring these resources up on any node and back and it works fine. The problem is on the DiskRes.

We are using an HP 3PAR SAN and the Microsoft Multipath I/O that comes with W2K8. When I add the signature of one of the exported volumes, I can bring it online on a node but then less than a minute later it reboots itself. After reboot, the Cluster Explorer shows DISKRES as Offline on the first node but Faulted on the other two nodes.

These are no recent log entries on DiskRes_A.txt.

Here's the entry on engine_A.txt that shows that it was online at one point:
2012/06/19 14:11:32 VCS NOTICE V-16-1-10301 Initiating Online of Resource FS_DISKRES (Owner: Unspecified, Group: MYGROUP) on System MYNODE1
2012/06/19 14:11:32 VCS INFO V-16-1-10298 Resource FS_DISKRES (Owner: Unspecified, Group: MYGROUP) is online on MYNODE1 (VCS initiated)

I can post the logs after it rebooted if that will help.

On the node that just rebooted, running C:\Program Files\Veritas\cluster server\bin\getdrive.bat I get:
Could not gather all the disk info. Error : 170

Sure enough, the disk that was skipped was the one I was trying to get online. When I bring up the Windows Disk Manager, it says that disk has to be initialized but when I do, it says the resource is in use.

When I run getdrive.bat on the second node, the output for that drive is:
Harddisk Number = 1
Harddisk Type    = Basic Disk
Disk Signature   = 2264237497
Valid Partitions = 1
Access Test      = FAILED

What am I doing wrong?

Rim · ‎06-26-2012

Here's what I needed to do:

In Windows DiskManagement on each node:

Put the SAN disk online
Set up the local mount point
Remove the drive letter
Make partition active if it isn't already
Take it offline - IMPORTANT!

For DiskRes:

Run C:\Program Files\Veritas\cluster server\bin\getdrive.bat
Note the signature and enter it in properties

For Mount:

Since we are using MountPath (instead of drive letters), enter the mount path
Change PartitionNo from 0 to 1
Enter same signature as DiskRes

For FileShare:

Since we are using MountPaths (instead of drive letters), use a subdirectory in the SAN drive as the PathName. I was trying to use "\" as the PathName but that doesn't work for MountPaths.
Set the ShareName

While troubleshooting, I ended up un-exporting the SAN volume from all nodes and then checked that the settings above were correct. But it didn't seem to work until I tried to restart VCS on all nodes via (run command prompt as admin):

hastop -all

It wouldn't let me stop the nodes until I save and close the config so I did via (run command prompt as admin):

haconf -dump -makero

I was then able to stop the nodes then restart them via (run command prompt as admin)"

hastart -all

I'm not sure if it's the restart or saving/closing of the config that got it to work but it's working now.

View solution in original post

Wally_Heim · ‎06-19-2012

Hi Rim,

The first thing that you might want to do is to run getdrive from all 3 nodes to ensure that each one sees the disks with the same Disk Signature. if not you should do a rescan on them in Disk Manager so all three servers see the disks the same.

From there is it is matter of how do you want to proceed. I would recommend simplifying the environment but going down to a single path on all nodes and removing the Microsoft MPIO feature. Then test to make sure that you can mount the partition on each server one at a time. Do not mount the partition on more than one server at a time as this case lead to data corruption.

Once you are able mount the partition on each node manually. Then move to VCS to use the DiskRes and Mount resource to see if you can put a reservation on the drive (DiskRes) and Mount the partition from VCS.

If everything is working so far, then move forward with adding Microsoft's MPIO back into the configuration.

If all else fails, we are available in Symantec Technical Support to assist you 24x7x365.

Thank you,

Wally

Para · ‎06-19-2012

I removed MS MPIO and rescanned the disks and getdrive results were consistent between the three nodes.
I remembered that I had set the SAN policy on each node to bring all disks online on boot so I removed that via DISKPART and then "san policy=offlineshared". I also manually set all the SAN disks as offline on each node. On each node, I was able to put it online (one node at a time) as both a drive letter E: or a local path C:\MYPATH.

I moved to VCS and I was able to mount the SAN disk with DiskRes on each node.

When I tried to Online the MOUNT to NODE1, it FAULTED. I offlined DiskRes on NODE1 and onlined it to NODE2 and then I was able to online Mount to NODE2 as well. But when I tried to online FileShare on NODE2, it FAULTED:
Resource FS_FILESHARE (Owner: Unspecified, Group: MYGROUP) is FAULTED on sys NODE2

Now, when I try to online any of the resources, I get: "Cannot online: resource's group is frozen waiting for dependency to be satisfied". But all the resources are Offline and Not Waiting (except NIC which is Online).

How do I unfreeze the resource?

Wally_Heim · ‎06-20-2012

Hi Rim,

Each resource type has a log that is stored in the %vcs_home%\log folder. The most recent one is named <agent type>_A.txt. For example, the Mount resource's log would be Mount_A.txt and the FileShare resources log would be FileShare_A.txt. These logs will have debug information in them that will point you to why the resource was not able to online during its online process. Most of the messages are in readable format that you can determine what the problem is in most cases.

In the same location is a cluster wide log call engine_A.txt. It logs all cluster operations and would show the "Cannot online: resource group is frozen" type of messages.

You can also run "hastatus -sum" from command prompt to get a summary of the current cluster state, This can point to you a resource that is not probing or some other issue.

To unfreeze a service group you can right click on it in the Java GUI and select Unfreeze from the popup menu.

If you need more one on one assistance, please open a Symantec Technical Support case. We are here to help you with any problems with our software that you may have.

Thank you,

Wally

Para · ‎06-20-2012

When I right-click on the Resource Group, the Unfreeze option is greyed out which seems contradictory to the error message I was getting. Also, hastatus -sum shows all three nodes as running and not frozen and the resource group as having been probed on all three nodes (but offline on all nodes).

Mount_A.txt shows:

2012/06/19 19:03:22 VCS ERROR V-16-10051-8018 Mount:FS_MOUNT:monitor:Failed to create the Volume object for DiskNo = 5, PartitionNo = 1. Error : 110
2012/06/19 19:03:22 VCS DBG_21 V-16-50-0 Mount:FS_MOUNT:monitor:*** Start of debug information dump for troubleshooting ***
   LibLogger.cpp:VLibThreadLogQueue::Dump[206]
2012/06/19 19:03:22 VCS DBG_21 V-16-50-0 Mount:FS_MOUNT:monitor:Number of valid partition on Disk (5) are 1.
   LibDisk.cpp:VLibDisk::GetNumberOfValidPartitions[942]
2012/06/19 19:03:22 VCS DBG_21 V-16-50-0 Mount:FS_MOUNT:monitor:Mount path C: is not a reparse point
   LibStorage.cpp:VLibStorage::IsSuitablePath[859]
2012/06/19 19:03:22 VCS DBG_21 V-16-50-0 Mount:FS_MOUNT:monitor:(2) IOCTL_MOUNTMGR_QUERY_POINTS failed
   LibStorage.cpp:VLibStorage::QueryMountManager[660]
2012/06/19 19:03:22 VCS DBG_21 V-16-50-0 Mount:FS_MOUNT:monitor:QueryMountManager() failed. Invalid volume information specified.
   LibVolume.cpp:VLibVolume::Open[268]
2012/06/19 19:03:22 VCS DBG_21 V-16-50-0 Mount:FS_MOUNT:monitor:*** End of debug information dump for troubleshooting ***
   LibLogger.cpp:VLibThreadLogQueue::Dump[217]
2012/06/19 19:11:42 VCS INFO V-16-10051-30003 Mount:FS_MOUNT:imf_register:Un-registering with IMF for offline monitoring
2012/06/19 19:11:57 VCS ERROR V-16-10051-8018 Mount:FS_MOUNT:monitor:Failed to create the Volume object for DiskNo = 5, PartitionNo = 1. Error : 110

Thanks for your help. I will submit a case.

Wally_Heim · ‎06-20-2012

Hi Rim,

Error 110 is a windows error which means:

C:\>net helpmsg 110

The system cannot open the device or file specified.

Can you provide the Mount resource configuration from the main.cf? The main.cf is in the %vcs_home%\conf\config\ folder.

I'm thinking that you have the partition number incorrectly defined. Its been awhile since I've touched basic disk resources in a cluster but I seem to remember that the partition numbers start at 0 and not 1. So if you only have 1 partition on the drive then the PartitionNo attribute should be set to 0.

thank you,

Wally

Para · ‎06-20-2012

From main.cf:

    Mount FS_MOUNT (
       MountPath = "C:\\MY\\PATH"
       PartitionNo = 1
       Signature = 2264237497
       )

According to Veritas Cluster Server Bundled Agents Reference Guide:

"The partition on the disk configured for mounting. Note that the base index for the partition number is 1. Default is 0."

I'm pretty sure I could not bring that resource online until I set the PartitionNo = 1.

Rim · ‎06-26-2012

Here's what I needed to do:

In Windows DiskManagement on each node:

Put the SAN disk online
Set up the local mount point
Remove the drive letter
Make partition active if it isn't already
Take it offline - IMPORTANT!

For DiskRes:

Run C:\Program Files\Veritas\cluster server\bin\getdrive.bat
Note the signature and enter it in properties

For Mount:

Since we are using MountPath (instead of drive letters), enter the mount path
Change PartitionNo from 0 to 1
Enter same signature as DiskRes

For FileShare:

Since we are using MountPaths (instead of drive letters), use a subdirectory in the SAN drive as the PathName. I was trying to use "\" as the PathName but that doesn't work for MountPaths.
Set the ShareName

While troubleshooting, I ended up un-exporting the SAN volume from all nodes and then checked that the settings above were correct. But it didn't seem to work until I tried to restart VCS on all nodes via (run command prompt as admin):

hastop -all

It wouldn't let me stop the nodes until I save and close the config so I did via (run command prompt as admin):

haconf -dump -makero

I was then able to stop the nodes then restart them via (run command prompt as admin)"

hastart -all

I'm not sure if it's the restart or saving/closing of the config that got it to work but it's working now.

VOX

Can't get file share working - DiskRes problem? (VCS6, HP 3PAR, MS MIO)