β08-12-2013 02:21 AM
Hi all,
I'm running the Veritas Storage Foundation Standard HA 5.0MP3 under Suse Linux Enterprise Server 11 on two Oracle X6270 servers.
There was a power outage, causing an immediate brutal shutdown of both servers. After power was restored, the server on which the Oracle service group was active ("db1-hasc") could not boot at all (mainboard failure). The other server ("db2-hasc") booted, but reported during boot that cluster cannot start, and that manual reseeding might be needed to start it, so I started the cluster from the working server db2-hasc using command gabconfig -x (found it after some googling).
In the meantime, the failed server db1-hasc was fixed and cluster is now working (all service groups online, but on the db2-hasc, the one which started successfully after power outage). No attempt has been made yet to try to switchover any of the service groups (except network service groups which are online) to the db1-hasc server.
However, I have noticed some problems with several volumes in disk group "oracledg":
db2-hasc# vxprint -g oracledg
TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0
dg oracledg oracledg - - - - - -
dm oracled01 - - - - NODEVICE - -
dm oracled02 sdb - 335462144 - - - -
v archvol fsgen ENABLED 62914560 - ACTIVE - -
pl archvol-01 archvol DISABLED 62914560 - NODEVICE - -
sd oracled01-02 archvol-01 DISABLED 62914560 0 NODEVICE - -
pl archvol-02 archvol ENABLED 62914560 - ACTIVE - -
sd oracled02-02 archvol-02 ENABLED 62914560 0 - - -
v backupvol fsgen ENABLED 167772160 - ACTIVE - -
pl backupvol-01 backupvol DISABLED 167772160 - NODEVICE - -
sd oracled01-03 backupvol-01 DISABLED 167772160 0 NODEVICE - -
pl backupvol-02 backupvol ENABLED 167772160 - ACTIVE - -
sd oracled02-03 backupvol-02 ENABLED 167772160 0 - - -
v dbovol fsgen ENABLED 62914560 - ACTIVE - -
pl dbovol-01 dbovol DISABLED 62914560 - NODEVICE - -
sd oracled01-01 dbovol-01 DISABLED 62914560 0 NODEVICE - -
pl dbovol-02 dbovol ENABLED 62914560 - ACTIVE - -
sd oracled02-01 dbovol-02 ENABLED 62914560 0 - - -
db2-hasc# vxprint -htg oracledg
DG NAME NCONFIG NLOG MINORS GROUP-ID
ST NAME STATE DM_CNT SPARE_CNT APPVOL_CNT
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
CO NAME CACHEVOL KSTATE STATE
VT NAME RVG KSTATE STATE NVOLUME
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO
EX NAME ASSOC VC PERMS MODE STATE
SR NAME KSTATE
dg oracledg default default 0 1265259474.12.db1-HASc
dm oracled01 - - - - NODEVICE
dm oracled02 sdb auto 65536 335462144 -
v archvol - ENABLED ACTIVE 62914560 SELECT - fsgen
pl archvol-01 archvol DISABLED NODEVICE 62914560 CONCAT - WO
sd oracled01-02 archvol-01 oracled01 62914560 62914560 0 - NDEV
pl archvol-02 archvol ENABLED ACTIVE 62914560 CONCAT - RW
sd oracled02-02 archvol-02 oracled02 62914560 62914560 0 sdb ENA
v backupvol - ENABLED ACTIVE 167772160 SELECT - fsgen
pl backupvol-01 backupvol DISABLED NODEVICE 167772160 CONCAT - WO
sd oracled01-03 backupvol-01 oracled01 125829120 167772160 0 - NDEV
pl backupvol-02 backupvol ENABLED ACTIVE 167772160 CONCAT - RW
sd oracled02-03 backupvol-02 oracled02 125829120 167772160 0 sdb ENA
v dbovol - ENABLED ACTIVE 62914560 SELECT - fsgen
pl dbovol-01 dbovol DISABLED NODEVICE 62914560 CONCAT - WO
sd oracled01-01 dbovol-01 oracled01 0 62914560 0 - NDEV
pl dbovol-02 dbovol ENABLED ACTIVE 62914560 CONCAT - RW
sd oracled02-01 dbovol-02 oracled02 0 62914560 0 sdb ENA
Does anyone have some ideas on how to recover the disabled plexes/subdisks? Or which other commands to run to ascertain the current state of the cluster, in order to have a clear(er) picture what is wrong and which steps to take to remedy the problem?
If so, I would appreciate if you can share any tips/suggestions.
The physical disks seem fine (no errors reported in ILOM diagnostics).
Thanks,
/Hrvoje
Solved! Go to Solution.
β09-19-2013 01:10 PM
Hi Daniel & all,
following up on this thread, just to let you know that we ended up calling Veritas emergency support with this (in the meantime, one service group on the db1-hasc server went to "faulted" state, and there were also some network switch restarts following that; these two facts seem to have caused a failure of the db1-hasc server, and although the Veritas on other server (db2-hasc) was in state "running" - other shared disk groups got corrupted somehow, so neither server could could import them successfully). All this led to a complete failure of the whole node.
The fine guys from Veritas emergency support have connected to the system and after some troubleshooting, resyncing and fsck-ing the diskgroups (required because many inodes were missing), they managed to get the system up and running.
I want to thank you once again for your help and suggestions, although I didn't have a chance to test them - but the system is now recovered and finally working properly.
Cheers,
/Hrvoje
β08-12-2013 04:08 AM
Hi,
as you can see in the vxprint output one of the disks is missing:
dg oracledg oracledg - - - - - -
dm oracled01 - - - - NODEVICE - -
dm oracled02 sdb - 335462144 - - - -
So all plexes that reside on this disk are disabled with nodevice state as well.
Please check you hardware disk configuration, that the disk is visible and accessible on the OS and then you can refresh dmp if disk is still not seen.
First I would start with:
#fdisk -l
#vxdisk -oalldgs list
Also grep for scsi and udev messages in the syslog.
Please post the outputs here if you need further help.
Thanks,
Dan
β08-12-2013 06:12 AM
Hi Dan,
thanks for the tips.
Checking the syslog, I do see some (I suppose subdisk-related) errors (please check attached log).
Grepping for both udev or scsi does not produce any hits in syslog.
I also attached the output of fdisk -l .
Also, here is the requested vxdisk output:
db1-HASc:/var/log # vxdisk -oalldgs list
DEVICE TYPE DISK GROUP STATUS
sda auto:none - - online invalid
sdb auto:cdsdisk - (oracledg) online
sdc auto:none - - online invalid
sdd auto:cdsdisk - (sunasdg) online
sde auto:cdsdisk - (sunasdg) online
From the VCS documentation, this last output seems to indicate that the physical disk is OK but it is not under VCS control (online invalid)?
Regards,
/Hrvoje
β08-12-2013 07:39 AM
Hello Hrvoje,
It seems that the private region has been corrupted or the disk has been physically replaced.
I guess sda will be your boot disk, and sdc is your missing disk.
Looking at the fdisk output there are no partitions anymore.
As this is a mirrored volume you could simply initialize and readd the disk to the diskgroup and resync the mirror.
I have performed the steps on my test system (to reproduce the issue I deleted the VxVM partitions):
[root@Server101 ~]# vxdisk list
DEVICE TYPE DISK GROUP STATUS
disk_0 auto:none - - online invalid
disk_1 auto:cdsdisk disk_1 testdg online
disk_2 auto:cdsdisk - - online
disk_3 auto:cdsdisk - - online
disk_4 auto:cdsdisk - - online
disk_5 auto:cdsdisk - - online
disk_6 auto:cdsdisk - - online
disk_7 auto:cdsdisk - - online
disk_8 auto:cdsdisk - - online
disk_9 auto:cdsdisk - - online
sda auto:none - - online invalid
- - disk_0 testdg failed was:disk_0
[root@Server101 ~]# /opt/VRTS/bin/vxdisksetup -i disk_0
[root@Server101 ~]# vxdisk list
DEVICE TYPE DISK GROUP STATUS
disk_0 auto:cdsdisk - - online
disk_1 auto:cdsdisk disk_1 testdg online
disk_2 auto:cdsdisk - - online
disk_3 auto:cdsdisk - - online
disk_4 auto:cdsdisk - - online
disk_5 auto:cdsdisk - - online
disk_6 auto:cdsdisk - - online
disk_7 auto:cdsdisk - - online
disk_8 auto:cdsdisk - - online
disk_9 auto:cdsdisk - - online
sda auto:none - - online invalid
- - disk_0 testdg failed was:disk_0
Important to use -k option to tell VxVM that you want to add a disk in place for the failed disk
[root@Server101 ~]# vxdg -g testdg -k adddisk disk_0
[root@Server101 ~]#
[root@Server101 ~]# vxdisk list
DEVICE TYPE DISK GROUP STATUS
disk_0 auto:cdsdisk disk_0 testdg online
disk_1 auto:cdsdisk disk_1 testdg online
disk_2 auto:cdsdisk - - online
disk_3 auto:cdsdisk - - online
disk_4 auto:cdsdisk - - online
disk_5 auto:cdsdisk - - online
disk_6 auto:cdsdisk - - online
disk_7 auto:cdsdisk - - online
disk_8 auto:cdsdisk - - online
disk_9 auto:cdsdisk - - online
sda auto:none - - online invalid
dg testdg default default 23000 1376316721.50.Server102
dm disk_0 disk_0 auto 65536 2027264 -
dm disk_1 disk_1 auto 65536 2027264 -
v testvol1 - ENABLED ACTIVE 2025472 SELECT - fsgen
pl testvol1-01 testvol1 DISABLED RECOVER 2025472 CONCAT - WO
sd disk_0-01 testvol1-01 disk_0 0 2025472 0 disk_0 ENA
pl testvol1-02 testvol1 ENABLED ACTIVE 2025472 CONCAT - RW
sd disk_1-01 testvol1-02 disk_1 0 2025472 0 disk_1 ENA
Recover in background
[root@Server101 ~]# vxrecover -b
Check vxtask (with option monitor you can watch the progress)
[root@Server101 ~]# vxtask list
TASKID PTID TYPE/STATE PCT PROGRESS
162 PARENT/R 0.00% 1/0(1) VXRECOVER
163 162 ATCOPY/R 19.01% 0/2025472/385024 PLXATT testvol1 testvol1-01stdg
Once finished your volume should be in good shape again:
dg testdg default default 23000 1376316721.50.Server102
dm disk_0 disk_0 auto 65536 2027264 -
dm disk_1 disk_1 auto 65536 2027264 -
v testvol1 - ENABLED ACTIVE 2025472 SELECT - fsgen
pl testvol1-01 testvol1 ENABLED ACTIVE 2025472 CONCAT - RW
sd disk_0-01 testvol1-01 disk_0 0 2025472 0 disk_0 ENA
pl testvol1-02 testvol1 ENABLED ACTIVE 2025472 CONCAT - RW
sd disk_1-01 testvol1-02 disk_1 0 2025472 0 disk_1 ENA
β08-13-2013 06:22 AM
Hi Dan,
thank you very much for the effort!
I will check this on the system, will let you know about the result - hopefully goes fine.
β08-13-2013 10:43 PM
Hi Hrvoje,
Except the above disk missing issue, I think you should restart GAB/VCS or reboot OS to make GAB seeking with each other to make VCS full functional after recovery. Because the db2-hasc is up only when db1-hasc down, and gab started with -x option which will not seeking for peer nodes, you need to do so.
The simply way is to reboot db2-hasc. But it needs stop application. The other way is to freeze all service groups on db2-hasc and restart gab/had.
1. hagrp -freeze <group name>
2. hastop -local -force // this will stop had on local node without stop service group or application
3. gabconfig -U // this will stop gab
4. sh /etc/gabtab // start gab
5. hastart
6. hagrp -unfreeze <group name>
After that, you can switchover or failover the service group as normal.
β08-14-2013 01:13 AM
Hi Stinsong,
that is not necessary. starting GAB with -x does a force seed if other nodes in the cluster are not available.
But once other nodes join the cluster, they will join the cluster normally. There is no need to reboot or restart GAB.
β08-14-2013 07:08 AM
Hi Dan,
(I know you cannot guarantee it) would there be any risks for the existing configuration in case the procedure fails? The servers are live so the recovery should cause minimal downtime (if any).
[edit]: I have noticed another thing: so the db1-hasc server was the one where mainboard was replaced, and the db2-hasc server was fine (cluster was started from it using gabconfig -x).
When running vxdisk list command on db1-hasc I get this:
DEVICE TYPE DISK GROUP STATUS
sda auto:none - - online invalid
sdb auto:cdsdisk - - online
sdc auto:none - - online invalid
sdd auto:cdsdisk - - online
sde auto:cdsdisk - - online
However when running the same command on db2-hasc:
DEVICE TYPE DISK GROUP STATUS
sda auto:none - - online invalid
sdb auto:cdsdisk oracled02 oracledg online
sdc auto:none - - online invalid
sdd auto:cdsdisk sunasd01 sunasdg online
sde auto:cdsdisk sunasd02 sunasdg online
- - oracled01 oracledg failed was:sdc
i.e. output is similar to what you did above. Should I run the commands you indicated above on the db2-hasc, or db1-hasc?
Thanks,
/Hrvoje
β08-14-2013 07:33 AM
Hi Hrvoje,
there shouldn't be any risk to existing config as you are just mirroring the existing volume.
You can perform the operation online, if you have heavy I/O on the volume you might see a slight performance decrease during the resync.
You need to perform the steps on the node which has the diskgroup imported.
So per your outputs db2-hasc.
You could also first try using vxreattach:
https://sort.symantec.com/public/documents/sf/5.0/linux/html/vxvm_tshoot/ts_ch_hardwarerecovery_vm10.html
β09-19-2013 01:10 PM
Hi Daniel & all,
following up on this thread, just to let you know that we ended up calling Veritas emergency support with this (in the meantime, one service group on the db1-hasc server went to "faulted" state, and there were also some network switch restarts following that; these two facts seem to have caused a failure of the db1-hasc server, and although the Veritas on other server (db2-hasc) was in state "running" - other shared disk groups got corrupted somehow, so neither server could could import them successfully). All this led to a complete failure of the whole node.
The fine guys from Veritas emergency support have connected to the system and after some troubleshooting, resyncing and fsck-ing the diskgroups (required because many inodes were missing), they managed to get the system up and running.
I want to thank you once again for your help and suggestions, although I didn't have a chance to test them - but the system is now recovered and finally working properly.
Cheers,
/Hrvoje