Solved: quorum disks not online, looking for the best way ...

infinitiguy · ‎03-09-2011

Hi,

We had some major maintenance last weekend that involved taking down our SAN and re-iping a lot of stuff. One of our VCS clusters came online without it's quorom disks. I unmounted the shared storage on the secondary node, so services can't start on there but now I'm looking for the best way to resolve without having to restart the cluster. It's a MYSQL cluster, so I don't want to have to restart all of the applications that depend on it if I don't have to. We use iscsi to connect to the quorum drives, and the issue was with the IP changed, iscsi failed to connect to the luns. This was resolved after we got the DB's back online, so now iscsi can connect, but I don't know how to fix vcs.

Has this ever happened to anyone and have any recommendations in terms of steps? Is this something I can do without affecting mysql?

problem cluster

[root@linsql01u35 ~]# vxfenadm -d

VXFEN vxfenadm ERROR V-11-2-1101 Open failed for device: /dev/vxfen

working cluster
[root@linsvn02u32 ~]# vxfenadm -d

I/O Fencing Cluster Information:
================================

Fencing Protocol Version: 201
Fencing Mode: SCSI3
Fencing SCSI3 Disk Policy: dmp
Cluster Members:

          0 (linsvn01u32.prod.domain.com)
        * 1 (linsvn02u32.prod.domain.com)

RFSM State Information:
        node   0 in state 8 (running)
        node   1 in state 8 (running)

Marianne · ‎03-10-2011

About vxdisk list - it simply lists what is seen.

we need -o alldgs option because fencing disks must be visible, but not imported.

If fencing/coordinator disks can be seen, we should be able to start fencing:
/etc/init.d/vxfen start

The whole process of configuring fencing is described in VCS Install Guide: https://sort.symantec.com/public/documents/sf/5.0MP3/linux/pdf/vcs_install.pdf

Handy NetBackup Links

View solution in original post

Gaurav_S · ‎03-09-2011

can you paste below output from both the nodes

# gabconfig -a

also, what is VCS & OS versions ?

Gaurav

infinitiguy · ‎03-09-2011

VCS 5.0.3 i believe. OS is RHEL 5.2 64bit

I included output from the broken cluster, as well as a working cluster for comparison. I didn't set these up so I'm unsure as to what exactly port b is referring to (which is missing in the sql cluster).

sql cluster - not working

[root@linsql01u35 ~]# gabconfig -a

GAB Port Memberships

===============================================================

Port a gen f8df01 membership 01

Port h gen f8df04 membership 01

[root@linsql02u35 ~]# gabconfig -a

GAB Port Memberships

===============================================================

Port a gen f8df01 membership 01

Port h gen f8df04 membership 01

[root@linsql02u35 ~]#

svn cluster - working

[root@linsvn01u32 ~]# gabconfig -a

GAB Port Memberships

===============================================================

Port a gen 8fea04 membership 01

Port b gen 8fea03 membership 01

Port h gen 8fea05 membership 01

[root@linsvn01u32 ~]#

[root@linsvn02u32 ~]# gabconfig -a

GAB Port Memberships

===============================================================

Port a gen 8fea04 membership 01

Port b gen 8fea03 membership 01

Port h gen 8fea05 membership 01

Marianne · ‎03-09-2011

I was trying to figure out what quorum disks were doing in a vcs cluster....

So, it seems the issue is with fencing disks.

Can the disk be seen using 'vxdisk -o alldgs list' ?

Run 'vxdctl enable' to rescan before doing vxdisk list.

Please also check contents of /etc/vxfendg

If vxdisk list does not the fencing diskgroup name (from previous command) in deported state for all coordinator disk, you will have to investigate and ensure disks are seen at this level.

Handy NetBackup Links

infinitiguy · ‎03-10-2011

my apologies. My total experience with VCS is about 10 days :) trying to come up to speed. Other people on my team have referred to them as quorum disks.

What will vxdctl enable actually do? Will it cause any issue with the running application or cause any harm? Do I have to run it on both VCS nodes? - same question for vxdisk list Currently linsql01u35 is the active node (running mysql) and linsql02u35 is the passive node - not doing anything.

If vxdisk list does show the disks in a deported state I assume that is good and then we could get them re-registered somehow?

Working and broken cluster(all nodes) all have this in vxfendg.

[root@linsql02u35 data]# cat /etc/vxfendg

vxfencoorddg

Gaurav_S · ‎03-10-2011

vxdctl enable command will be rescanning all the disks on the server... It would be good to run on both the nodes... however if you see 3 coordinator disks on each node then disks may not be an issue... vxdisk list command will just print the output of howmany disks are available in the server.. no harm..

gabconfig -a shows that fencing module is not loaded ...& since port h is already loaded ... now it won't be good to start the VCS .. if you want to start VCS on SQL cluster.. you should shutdown cluster services (hastop -all) or if you want to keep applications running hastop - local -force..

Can you paste following outputs more from SQL cluster:

# cat /etc/vxfenmode

# cat /etc/VRTSvcs/conf/config/main.cf | grep -i usefence

Gaurav

Marianne · ‎03-10-2011

About vxdisk list - it simply lists what is seen.

we need -o alldgs option because fencing disks must be visible, but not imported.

If fencing/coordinator disks can be seen, we should be able to start fencing:
/etc/init.d/vxfen start

The whole process of configuring fencing is described in VCS Install Guide: https://sort.symantec.com/public/documents/sf/5.0MP3/linux/pdf/vcs_install.pdf

Handy NetBackup Links

infinitiguy · ‎03-10-2011

I'm going to run a disk rescan now and see what happens. I restarted the iscsi service and fdisk -l shows a bunch of /dev/sda/sdb/sdc disks that appear on our other veritas servers and aren't filesystem disks that we use, so I have to assume these are the guys im looking for... stay tuned.. :)

Output of vxfenmode and main.cf There was no usefence found inside of main.cf.

[root@linsql01u35 ~]# cat /etc/vxfenmode

#

# vxfen_mode determines in what mode VCS I/O Fencing should work.

#

# available options:

# scsi3 - use scsi3 persistent reservation disks

# customized - use script based customized fencing

# disabled - run the driver but don't do any actual fencing

#

vxfen_mode=scsi3

#

# scsi3_disk_policy determines the way in which I/O Fencing communicates with

# the coordination disks.

#

# available options:

# dmp - use dynamic multipathing

# raw - connect to disks using the native interface

#

scsi3_disk_policy=dmp

[root@linsql01u35 ~]# cat /etc/VRTSvcs/conf/config/main.cf | grep -i usefence

[root@linsql01u35 ~]#

[root@linsql02u35 data]# cat /etc/vxfenmode

#

# vxfen_mode determines in what mode VCS I/O Fencing should work.

#

# available options:

# scsi3 - use scsi3 persistent reservation disks

# customized - use script based customized fencing

# disabled - run the driver but don't do any actual fencing

#

vxfen_mode=scsi3

#

# scsi3_disk_policy determines the way in which I/O Fencing communicates with

# the coordination disks.

#

# available options:

# dmp - use dynamic multipathing

# raw - connect to disks using the native interface

#

scsi3_disk_policy=dmp

[root@linsq;02u35 data]# cat /etc/VRTSvcs/conf/config/main.cf | grep -i usefence

[root@linsql02u35 data]#

Gaurav_S · ‎03-10-2011

If the main.cf is not having line UseFence=SCSI3 that means cluster is started or setup without use if IOFencing..

If you want to use IOFencing, you will need to stop the cluster, add the usefence attribute & restart cluster services. Then only VCS will understand that cluster has to use IOFencing & while importing the diskgroup it will import the groups with reservations.

Going back to original problem, I would suggest these steps:

-- Take a downtime (preferred) & shutdown entire SQL cluster.

-- Once port h is gone from gabconfig -a output, & you see no "had" or "hashadow" process running, try to start fencing on both the nodes using /etc/init.d/vxfen start

-- Modify the main.cf & add "UseFence=SCSI3" attribute, refer to VCS admin guide OR users guide for details

-- Start cluster using "hastart" command..

In case you do not intend to use fencing, in /etc/vxfenmode, you should use "disabled" mode... you can see the sample of vxfenmode disabled in /etc/vxfen.d/vxfenmode_disabled file...

Gaurav

infinitiguy · ‎03-10-2011

allright... it looks like I'm back online.

I'm not sure if iscsi was able to see the disks all along, but I switched the config back to what it needed to be, and ran a restart. I got an error saying it was already connected to the target which makes me think that the change I made had no affect. fdisk -l showed all the disks.

I then did a vxdctl enable and a vxdisk -o alldgs list on each node and saw 3 disks - online.

Then I ran /etc/init.d/vxfen start on the node running mysql to get it online as the master, then ran it on the 2nd node (which was running postgres). vxfenadm -d showed both nodes having joined (took about 15 seconds for the 2nd node to join), and hastatus -sum showed that postgres had faulted and moved over to the mysql node, which I'm a tiny bit surprised by because I thought that I had postgres configured to have the 2nd node as it's primary... but maybe because we were restoring fencing the primary node said that it was master and needed to run all the applications.

I am thankful that I didn't decide to run vxfen start on the second node first as that would've caused mysql to flop... which was what I wanted to avoid in the first place!!! :)

VOX

quorum disks not online, looking for the best way to fix.