Solved: Hi Dan, thanks for the

Hrvoje_Vojnic · ‎08-12-2013

Hi all,

I'm running the Veritas Storage Foundation Standard HA 5.0MP3 under Suse Linux Enterprise Server 11 on two Oracle X6270 servers.

There was a power outage, causing an immediate brutal shutdown of both servers. After power was restored, the server on which the Oracle service group was active ("db1-hasc") could not boot at all (mainboard failure). The other server ("db2-hasc") booted, but reported during boot that cluster cannot start, and that manual reseeding might be needed to start it, so I started the cluster from the working server db2-hasc using command gabconfig -x (found it after some googling).

In the meantime, the failed server db1-hasc was fixed and cluster is now working (all service groups online, but on the db2-hasc, the one which started successfully after power outage). No attempt has been made yet to try to switchover any of the service groups (except network service groups which are online) to the db1-hasc server.

However, I have noticed some problems with several volumes in disk group "oracledg":

db2-hasc# vxprint -g oracledg

TY NAME         ASSOC        KSTATE   LENGTH   PLOFFS   STATE    TUTIL0 PUTIL0
dg oracledg     oracledg     -        -        -        -        -       -

dm oracled01    -            -        -        -        NODEVICE -       -
dm oracled02    sdb          -        335462144 -       -        -       -

v archvol      fsgen        ENABLED 62914560 -        ACTIVE   -       -
pl archvol-01   archvol      DISABLED 62914560 -        NODEVICE -       -
sd oracled01-02 archvol-01   DISABLED 62914560 0        NODEVICE -       -
pl archvol-02   archvol      ENABLED 62914560 -        ACTIVE   -       -
sd oracled02-02 archvol-02   ENABLED 62914560 0        -        -       -

v backupvol    fsgen        ENABLED 167772160 -       ACTIVE   -       -
pl backupvol-01 backupvol    DISABLED 167772160 -       NODEVICE -       -
sd oracled01-03 backupvol-01 DISABLED 167772160 0       NODEVICE -       -
pl backupvol-02 backupvol    ENABLED 167772160 -       ACTIVE   -       -
sd oracled02-03 backupvol-02 ENABLED 167772160 0       -        -       -

v dbovol       fsgen        ENABLED 62914560 -        ACTIVE   -       -
pl dbovol-01    dbovol       DISABLED 62914560 -        NODEVICE -       -
sd oracled01-01 dbovol-01    DISABLED 62914560 0        NODEVICE -       -
pl dbovol-02    dbovol       ENABLED 62914560 -        ACTIVE   -       -
sd oracled02-01 dbovol-02    ENABLED 62914560 0        -        -       -

db2-hasc# vxprint -htg oracledg

DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY DATAVOLS SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         RVG          KSTATE   STATE    NVOLUME
V NAME         RVG/VSET/CO KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO
EX NAME         ASSOC        VC                       PERMS    MODE     STATE
SR NAME         KSTATE

dg oracledg     default      default 0        1265259474.12.db1-HASc

dm oracled01    -            -        -        -        NODEVICE
dm oracled02    sdb          auto     65536    335462144 -

v archvol      -            ENABLED ACTIVE   62914560 SELECT    -        fsgen
pl archvol-01   archvol      DISABLED NODEVICE 62914560 CONCAT    -        WO
sd oracled01-02 archvol-01   oracled01 62914560 62914560 0        -        NDEV
pl archvol-02   archvol      ENABLED ACTIVE   62914560 CONCAT    -        RW
sd oracled02-02 archvol-02   oracled02 62914560 62914560 0        sdb      ENA

v backupvol    -            ENABLED ACTIVE   167772160 SELECT   -        fsgen
pl backupvol-01 backupvol    DISABLED NODEVICE 167772160 CONCAT   -        WO
sd oracled01-03 backupvol-01 oracled01 125829120 167772160 0      -        NDEV
pl backupvol-02 backupvol    ENABLED ACTIVE   167772160 CONCAT   -        RW
sd oracled02-03 backupvol-02 oracled02 125829120 167772160 0      sdb      ENA

v dbovol       -            ENABLED ACTIVE   62914560 SELECT    -        fsgen
pl dbovol-01    dbovol       DISABLED NODEVICE 62914560 CONCAT    -        WO
sd oracled01-01 dbovol-01    oracled01 0       62914560 0         -        NDEV
pl dbovol-02    dbovol       ENABLED ACTIVE   62914560 CONCAT    -        RW
sd oracled02-01 dbovol-02    oracled02 0       62914560 0         sdb      ENA

Does anyone have some ideas on how to recover the disabled plexes/subdisks? Or which other commands to run to ascertain the current state of the cluster, in order to have a clear(er) picture what is wrong and which steps to take to remedy the problem?

If so, I would appreciate if you can share any tips/suggestions.

The physical disks seem fine (no errors reported in ILOM diagnostics).

Thanks,

/Hrvoje

Hrvoje_Vojnic · ‎09-19-2013

Hi Daniel & all,

following up on this thread, just to let you know that we ended up calling Veritas emergency support with this (in the meantime, one service group on the db1-hasc server went to "faulted" state, and there were also some network switch restarts following that; these two facts seem to have caused a failure of the db1-hasc server, and although the Veritas on other server (db2-hasc) was in state "running" - other shared disk groups got corrupted somehow, so neither server could could import them successfully). All this led to a complete failure of the whole node.

The fine guys from Veritas emergency support have connected to the system and after some troubleshooting, resyncing and fsck-ing the diskgroups (required because many inodes were missing), they managed to get the system up and running.

I want to thank you once again for your help and suggestions, although I didn't have a chance to test them - but the system is now recovered and finally working properly.

Cheers,

/Hrvoje

View solution in original post

Daniel_Matheus · ‎08-12-2013

Hi,

as you can see in the vxprint output one of the disks is missing:

dg oracledg oracledg - - - - - -

dm oracled01 - - - - NODEVICE - -
dm oracled02 sdb - 335462144 - - - -

So all plexes that reside on this disk are disabled with nodevice state as well.

Please check you hardware disk configuration, that the disk is visible and accessible on the OS and then you can refresh dmp if disk is still not seen.

First I would start with:

#fdisk -l

#vxdisk -oalldgs list

Also grep for scsi and udev messages in the syslog.

Please post the outputs here if you need further help.

Thanks,
Dan

Hrvoje_Vojnic · ‎08-12-2013

Hi Dan,

thanks for the tips.

Checking the syslog, I do see some (I suppose subdisk-related) errors (please check attached log).

Grepping for both udev or scsi does not produce any hits in syslog.

I also attached the output of fdisk -l .

Also, here is the requested vxdisk output:

db1-HASc:/var/log # vxdisk -oalldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
sda          auto:none       -            -            online invalid
sdb          auto:cdsdisk    -            (oracledg)   online
sdc          auto:none       -            -            online invalid
sdd          auto:cdsdisk    -            (sunasdg)    online
sde          auto:cdsdisk    -            (sunasdg)    online

From the VCS documentation, this last output seems to indicate that the physical disk is OK but it is not under VCS control (online invalid)?

Regards,

/Hrvoje

Daniel_Matheus · ‎08-12-2013

Hello Hrvoje,

It seems that the private region has been corrupted or the disk has been physically replaced.

I guess sda will be your boot disk, and sdc is your missing disk.

Looking at the fdisk output there are no partitions anymore.

As this is a mirrored volume you could simply initialize and readd the disk to the diskgroup and resync the mirror.

I have performed the steps on my test system (to reproduce the issue I deleted the VxVM partitions):

[root@Server101 ~]# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
disk_0       auto:none       -            -            online invalid
disk_1       auto:cdsdisk    disk_1       testdg       online
disk_2       auto:cdsdisk    -            -            online
disk_3       auto:cdsdisk    -            -            online
disk_4       auto:cdsdisk    -            -            online
disk_5       auto:cdsdisk    -            -            online
disk_6       auto:cdsdisk    -            -            online
disk_7       auto:cdsdisk    -            -            online
disk_8       auto:cdsdisk    -            -            online
disk_9       auto:cdsdisk    -            -            online
sda          auto:none       -            -            online invalid
-            -         disk_0       testdg       failed was:disk_0

[root@Server101 ~]# /opt/VRTS/bin/vxdisksetup -i disk_0

[root@Server101 ~]# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
disk_0       auto:cdsdisk    -            -            online
disk_1       auto:cdsdisk    disk_1       testdg       online
disk_2       auto:cdsdisk    -            -            online
disk_3       auto:cdsdisk    -            -            online
disk_4       auto:cdsdisk    -            -            online
disk_5       auto:cdsdisk    -            -            online
disk_6       auto:cdsdisk    -            -            online
disk_7       auto:cdsdisk    -            -            online
disk_8       auto:cdsdisk    -            -            online
disk_9       auto:cdsdisk    -            -            online
sda          auto:none       -            -            online invalid
-            -         disk_0       testdg       failed was:disk_0

Important to use -k option to tell VxVM that you want to add a disk in place for the failed disk
[root@Server101 ~]# vxdg -g testdg -k adddisk disk_0
[root@Server101 ~]#

[root@Server101 ~]# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
disk_0       auto:cdsdisk    disk_0       testdg       online
disk_1       auto:cdsdisk    disk_1       testdg       online
disk_2       auto:cdsdisk    -            -            online
disk_3       auto:cdsdisk    -            -            online
disk_4       auto:cdsdisk    -            -            online
disk_5       auto:cdsdisk    -            -            online
disk_6       auto:cdsdisk    -            -            online
disk_7       auto:cdsdisk    -            -            online
disk_8       auto:cdsdisk    -            -            online
disk_9       auto:cdsdisk    -            -            online
sda          auto:none       -            -            online invalid

dg testdg       default      default 23000    1376316721.50.Server102

dm disk_0       disk_0       auto     65536    2027264 -
dm disk_1       disk_1       auto     65536    2027264 -

v testvol1     -            ENABLED ACTIVE   2025472 SELECT    -        fsgen
pl testvol1-01 testvol1     DISABLED RECOVER 2025472 CONCAT    -        WO
sd disk_0-01    testvol1-01 disk_0   0        2025472 0         disk_0   ENA
pl testvol1-02 testvol1     ENABLED ACTIVE   2025472 CONCAT    -        RW
sd disk_1-01    testvol1-02 disk_1   0        2025472 0         disk_1   ENA

Recover in background
[root@Server101 ~]# vxrecover -b

Check vxtask (with option monitor you can watch the progress)
[root@Server101 ~]# vxtask list
TASKID PTID TYPE/STATE    PCT   PROGRESS
   162           PARENT/R 0.00% 1/0(1) VXRECOVER
   163   162     ATCOPY/R 19.01% 0/2025472/385024 PLXATT testvol1 testvol1-01stdg

Once finished your volume should be in good shape again:

dg testdg       default      default 23000    1376316721.50.Server102

dm disk_0       disk_0       auto     65536    2027264 -
dm disk_1       disk_1       auto     65536    2027264 -

v testvol1     -            ENABLED ACTIVE   2025472 SELECT    -        fsgen
pl testvol1-01 testvol1     ENABLED ACTIVE   2025472 CONCAT    -        RW
sd disk_0-01    testvol1-01 disk_0   0        2025472 0         disk_0   ENA
pl testvol1-02 testvol1     ENABLED ACTIVE   2025472 CONCAT    -        RW
sd disk_1-01    testvol1-02 disk_1   0        2025472 0         disk_1   ENA

Hrvoje_Vojnic · ‎08-13-2013

Hi Dan,

thank you very much for the effort!

I will check this on the system, will let you know about the result - hopefully goes fine.

stinsong · ‎08-13-2013

Hi Hrvoje,

Except the above disk missing issue, I think you should restart GAB/VCS or reboot OS to make GAB seeking with each other to make VCS full functional after recovery. Because the db2-hasc is up only when db1-hasc down, and gab started with -x option which will not seeking for peer nodes, you need to do so.

The simply way is to reboot db2-hasc. But it needs stop application. The other way is to freeze all service groups on db2-hasc and restart gab/had.

1. hagrp -freeze <group name>

2. hastop -local -force // this will stop had on local node without stop service group or application

3. gabconfig -U // this will stop gab

4. sh /etc/gabtab // start gab

5. hastart

6. hagrp -unfreeze <group name>

After that, you can switchover or failover the service group as normal.

Daniel_Matheus · ‎08-14-2013

Hi Stinsong,

that is not necessary. starting GAB with -x does a force seed if other nodes in the cluster are not available.

But once other nodes join the cluster, they will join the cluster normally. There is no need to reboot or restart GAB.

Hrvoje_Vojnic · ‎08-14-2013

Hi Dan,

(I know you cannot guarantee it) would there be any risks for the existing configuration in case the procedure fails? The servers are live so the recovery should cause minimal downtime (if any).

[edit]: I have noticed another thing: so the db1-hasc server was the one where mainboard was replaced, and the db2-hasc server was fine (cluster was started from it using gabconfig -x).

When running vxdisk list command on db1-hasc I get this:

DEVICE       TYPE            DISK         GROUP        STATUS
sda          auto:none       -            -            online invalid
sdb          auto:cdsdisk    -            -            online
sdc          auto:none       -            -            online invalid
sdd          auto:cdsdisk    -            -            online
sde          auto:cdsdisk    -            -            online

However when running the same command on db2-hasc:

DEVICE       TYPE            DISK         GROUP        STATUS
sda          auto:none       -            -            online invalid
sdb          auto:cdsdisk    oracled02    oracledg     online
sdc          auto:none       -            -            online invalid
sdd          auto:cdsdisk    sunasd01     sunasdg      online
sde          auto:cdsdisk    sunasd02     sunasdg      online
-            -         oracled01    oracledg     failed was:sdc

i.e. output is similar to what you did above. Should I run the commands you indicated above on the db2-hasc, or db1-hasc?

Thanks,

/Hrvoje

Daniel_Matheus · ‎08-14-2013

Hi Hrvoje,

there shouldn't be any risk to existing config as you are just mirroring the existing volume.

You can perform the operation online, if you have heavy I/O on the volume you might see a slight performance decrease during the resync.

You need to perform the steps on the node which has the diskgroup imported.

So per your outputs db2-hasc.

You could also first try using vxreattach:

https://sort.symantec.com/public/documents/sf/5.0/linux/html/vxvm_tshoot/ts_ch_hardwarerecovery_vm10.html

Hrvoje_Vojnic · ‎09-19-2013

Hi Daniel & all,

following up on this thread, just to let you know that we ended up calling Veritas emergency support with this (in the meantime, one service group on the db1-hasc server went to "faulted" state, and there were also some network switch restarts following that; these two facts seem to have caused a failure of the db1-hasc server, and although the Veritas on other server (db2-hasc) was in state "running" - other shared disk groups got corrupted somehow, so neither server could could import them successfully). All this led to a complete failure of the whole node.

The fine guys from Veritas emergency support have connected to the system and after some troubleshooting, resyncing and fsck-ing the diskgroups (required because many inodes were missing), they managed to get the system up and running.

I want to thank you once again for your help and suggestions, although I didn't have a chance to test them - but the system is now recovered and finally working properly.

Cheers,

/Hrvoje

VOX

Some VCS plexes disabled after power outage