Forum Discussion

Neobane's avatar
Neobane
Level 2
10 years ago

Replacing/Restoring Failed Drive

Yestarday, I had two drives in my storage san fail.  The san monitor reports that all the physical drives are fine.

bash-2.03# vxdisk list | grep AMS
AMS_WMS0_0   auto:cdsdisk    remote996006  remote9960   online
AMS_WMS0_1   auto:cdsdisk    remote996004  remote9960   online
AMS_WMS0_2   auto:cdsdisk    remote996013  remote9960   online
AMS_WMS01_0  auto            -            -            error
AMS_WMS01_1  auto            -            -            error
AMS_WMS01_2  auto:cdsdisk    remote996012  remote9960   online
AMS_WMS012_0 auto:cdsdisk    remote996008  remote9960   online
AMS_WMS012_1 auto:cdsdisk    remote996002  remote9960   online
AMS_WMS012_2 auto:cdsdisk    remote996011  remote9960   online
AMS_WMS0123_0 auto:cdsdisk    remote996007  remote9960   online
AMS_WMS0123_1 auto:cdsdisk    remote996001  remote9960   online
AMS_WMS0123_2 auto:cdsdisk    remote996010  remote9960   online
-            -         remote996003 remote9960   failed was:AMS_WMS01_1
-            -         remote996005 remote9960   failed was:AMS_WMS01_0

The devices on lines 5/6 are new.  A result of me attempting to repair the two failed drives at the bottom.

So far i've unmounted the volumes associated with the drives, but there are 11 more mounts attached to the Disk Group.  Being new, I'm not sure what processes and systems are using those volumes and am reluctant to unmount them at the moment.

I followed several guides trying to determine the cause of the problem and/or to restore the two disks.  Here's some of the results of my effort so far.

> vxprint -htg remote9960

DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         NVOLUME      KSTATE   STATE
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO

dg remote9960   default      default  126000   1032204684.1590.aeneas

dm remote996001 AMS_WMS0123_1 auto    2048     1677616896 -
dm remote996002 AMS_WMS012_1 auto     2048     1677616896 -
dm remote996003 -            -        -        -        NODEVICE
dm remote996004 AMS_WMS0_1   auto     2048     1677616896 -
dm remote996005 -            -        -        -        NODEVICE
dm remote996006 AMS_WMS0_0   auto     2048     1677616896 -
dm remote996007 AMS_WMS0123_0 auto    2048     1258212096 -
dm remote996008 AMS_WMS012_0 auto     2048     1258212096 -
dm remote996010 AMS_WMS0123_2 auto    2048     2094367888 -
dm remote996011 AMS_WMS012_2 auto     2048     2094367888 -
dm remote996012 AMS_WMS01_2  auto     2048     2094367888 -
dm remote996013 AMS_WMS0_2   auto     2048     2094367888 -

v  rem-01       -            DISABLED ACTIVE   1921724416 SELECT  -        fsgen
pl rem-01-02    rem-01       DISABLED NODEVICE 1921843200 STRIPE  4/64     RW
sd remote996004-01 rem-01-02 remote996004 36096 480460800 0/0     AMS_WMS0_1 ENA
sd remote996003-01 rem-01-02 remote996003 36096 480460800 1/0     -        NDEV
sd remote996002-01 rem-01-02 remote996002 36096 480460800 2/0     AMS_WMS012_1 ENA
sd remote996001-01 rem-01-02 remote996001 36096 480460800 3/0     AMS_WMS0123_1 ENA

v  rem-02       -            DISABLED ACTIVE   1921724416 SELECT  -        fsgen
pl rem-02-02    rem-02       DISABLED NODEVICE 1921843200 STRIPE  4/64     RW
sd remote996004-02 rem-02-02 remote996004 480496896 480460800 0/0 AMS_WMS0_1 ENA
sd remote996003-02 rem-02-02 remote996003 480496896 480460800 1/0 -        NDEV
sd remote996002-02 rem-02-02 remote996002 480496896 480460800 2/0 AMS_WMS012_1 ENA
sd remote996001-02 rem-02-02 remote996001 480496896 480460800 3/0 AMS_WMS0123_1 ENA

v  rem-03       -            DISABLED ACTIVE   1000341504 SELECT  -        fsgen
pl rem-03-01    rem-03       DISABLED NODEVICE 1000396800 STRIPE  2/64     RW
sd remote996006-02 rem-03-01 remote996006 106826496 419443200 0/0 AMS_WMS0_0 ENA
sd remote996008-03 rem-03-01 remote996008 1176381696 80755200 0/419443200 AMS_WMS012_0 ENA
sd remote996005-02 rem-03-01 remote996005 106826496 419443200 1/0 -        NDEV
sd remote996007-03 rem-03-01 remote996007 1176381696 80755200 1/419443200 AMS_WMS0123_0 ENA

v  rem-04       -            DISABLED ACTIVE   1921724416 SELECT  -        fsgen
pl rem-04-02    rem-04       DISABLED NODEVICE 1921843200 STRIPE  4/64     RW
sd remote996004-03 rem-04-02 remote996004 960957696 480460800 0/0 AMS_WMS0_1 ENA
sd remote996003-03 rem-04-02 remote996003 960957696 480460800 1/0 -        NDEV
sd remote996002-03 rem-04-02 remote996002 960957696 480460800 2/0 AMS_WMS012_1 ENA
sd remote996001-03 rem-04-02 remote996001 960957696 480460800 3/0 AMS_WMS0123_1 ENA

v  rem-08       -            ENABLED  ACTIVE   2097152000 SELECT  rem-08-01 fsgen
pl rem-08-01    rem-08       ENABLED  ACTIVE   2097177600 STRIPE  2/64     RW
sd remote996008-01 rem-08-01 remote996008 36096 1048588800 0/0    AMS_WMS012_0 ENA
sd remote996007-01 rem-08-01 remote996007 36096 1048588800 1/0    AMS_WMS0123_0 ENA

v  rem-30       -            DISABLED ACTIVE   1887436800 SELECT  -        fsgen
pl rem-30-01    rem-30       DISABLED NODEVICE 1887436800 STRIPE  2/64     RW
sd remote996006-03 rem-30-01 remote996006 526269696 943718400 0/0 AMS_WMS0_0 ENA
sd remote996005-03 rem-30-01 remote996005 526269696 943718400 1/0 -        NDEV

v  rem-40       -            ENABLED  ACTIVE   2097152000 SELECT  rem-40-01 fsgen
pl rem-40-01    rem-40       ENABLED  ACTIVE   2097254400 STRIPE  4/64     RW
sd remote996013-01 rem-40-01 remote996013 36096 524313600 0/0     AMS_WMS0_2 ENA
sd remote996012-01 rem-40-01 remote996012 36096 524313600 1/0     AMS_WMS01_2 ENA
sd remote996011-01 rem-40-01 remote996011 36096 524313600 2/0     AMS_WMS012_2 ENA
sd remote996010-01 rem-40-01 remote996010 36096 524313600 3/0     AMS_WMS0123_2 ENA

v  rem-41       -            ENABLED  ACTIVE   2097152000 SELECT  rem-41-01 fsgen
pl rem-41-01    rem-41       ENABLED  ACTIVE   2097254400 STRIPE  4/64     RW
sd remote996013-02 rem-41-01 remote996013 524349696 524313600 0/0 AMS_WMS0_2 ENA
sd remote996012-02 rem-41-01 remote996012 524349696 524313600 1/0 AMS_WMS01_2 ENA
sd remote996011-02 rem-41-01 remote996011 524349696 524313600 2/0 AMS_WMS012_2 ENA
sd remote996010-02 rem-41-01 remote996010 524349696 524313600 3/0 AMS_WMS0123_2 ENA

v  rem-42       -            ENABLED  ACTIVE   2097152000 SELECT  rem-42-01 fsgen
pl rem-42-01    rem-42       ENABLED  ACTIVE   2097254400 STRIPE  4/64     RW
sd remote996013-03 rem-42-01 remote996013 1048663296 524313600 0/0 AMS_WMS0_2 ENA
sd remote996012-03 rem-42-01 remote996012 1048663296 524313600 1/0 AMS_WMS01_2 ENA
sd remote996011-03 rem-42-01 remote996011 1048663296 524313600 2/0 AMS_WMS012_2 ENA
sd remote996010-03 rem-42-01 remote996010 1048663296 524313600 3/0 AMS_WMS0123_2 ENA

v  rem-43       -            ENABLED  ACTIVE   2085427200 SELECT  rem-43-01 fsgen
pl rem-43-01    rem-43       ENABLED  ACTIVE   2085427200 STRIPE  4/64     RW
sd remote996013-04 rem-43-01 remote996013 1572976896 521356800 0/0 AMS_WMS0_2 ENA
sd remote996012-04 rem-43-01 remote996012 1572976896 521356800 1/0 AMS_WMS01_2 ENA
sd remote996011-04 rem-43-01 remote996011 1572976896 521356800 2/0 AMS_WMS012_2 ENA
sd remote996010-04 rem-43-01 remote996010 1572976896 521356800 3/0 AMS_WMS0123_2 ENA

v  rimg02       -            DISABLED ACTIVE   944793600 SELECT   -        fsgen
pl rimg02-01    rimg02       DISABLED NODEVICE 944793600 STRIPE   4/64     RW
sd remote996004-04 rimg02-01 remote996004 1441418496 236198400 0/0 AMS_WMS0_1 ENA
sd remote996003-04 rimg02-01 remote996003 1441418496 236198400 1/0 -       NDEV
sd remote996002-04 rimg02-01 remote996002 1441418496 236198400 2/0 AMS_WMS012_1 ENA
sd remote996001-04 rimg02-01 remote996001 1441418496 236198400 3/0 AMS_WMS0123_1 ENA

> dxadm

Select an operation to perform: 5

Select a removed or failed disk [<disk>,list,q,?] remote996003
  VxVM  ERROR V-5-2-1985 No devices are available as replacements for remote996003.

 

Select a removed or failed disk [<disk>,list,q,?] remote996005
  VxVM  ERROR V-5-2-1985 No devices are available as replacements for remote996005.

I attemped to reattached the failed disk:

bash-2.03# /etc/vx/bin/vxreattach -c remote996003
VxVM vxdisk ERROR V-5-1-537 Device remote996003: Not in the configuration
VxVM vxdisk ERROR V-5-1-558 Disk remote996003: Disk not in the configuration

bash-2.03# /etc/vx/bin/vxreattach -c remote996005
VxVM vxdisk ERROR V-5-1-537 Device remote996005: Not in the configuration
VxVM vxdisk ERROR V-5-1-558 Disk remote996005: Disk not in the configuration

bash-2.03# vxdisk clearimport AMS_WMS01_1
VxVM vxdisk ERROR V-5-1-531 Device AMS_WMS01_1: clearimport failed:
        Disk device is offline

I think this part is where I created the two duplicate devices.

From this point, I'm going to have to step back and seek out guidance before I cause further problems.  

 

 

  • Thanks to the help of Shane in Symantec support, we found and resolved the problem.

    #> vxdmpadm listctlr  all

    CTLR-NAME       ENCLR-TYPE      STATE      ENCLR-NAME
    =====================================================
    c1              Disk            ENABLED      Disk
    c8              AMS_WMS         ENABLED      AMS_WMS012
    c6              AMS_WMS         DISABLED     AMS_WMS012
    c8              AMS_WMS         ENABLED      AMS_WMS0
    c6              AMS_WMS         DISABLED     AMS_WMS0
    c8              AMS_WMS         ENABLED      AMS_WMS01
    c6              AMS_WMS         DISABLED     AMS_WMS01

    c8              AMS_WMS         ENABLED      AMS_WMS0123
    c6              AMS_WMS         DISABLED     AMS_WMS0123
    c8              GENESIS         ENABLED      GENESIS0

     

    The c8 was disabled because the Fiber port was bad.  After we replaced the port, we ran:

    #> vxdctl enable

    This rescanned everything and brought the c8 controller back online.  From there, we just had to unmount the other filesystems that were on the device, export the entire disk group, then reimport them.

    We did have to fix the plex as well, but as far as others searching for possible solutions to similar problems, this will hopefully get them a long way towards a fast fix.

     

  • Hello,

    This doesn't looks right, are we sure that OS is able to see the disk correctly ?

    If OS is able to see the disk correctly (you can access the label from OS), can you try running a

    # vxdctl enable

    this will rescan your devices at Veritas layer.

    I am still not convinced though that there is no issue at storage layer.

    vxreattach & all will work like charm once the devices are accessible ...

    If you see # vxdisk list AMS_WMS01_1   do you see that paths to disks are enabled ?

     

    G

  • This is the response I recieved when I ran both commands.  The Storage Monitor shows all the drives are good.  This is what has been confusing me.  VEA disabled both disks, but there is no alerts/alarms of a bad drive.  I even went over to the SAN and checked for indicator lights (all green).

    # vxdctl enable
    # vxdisk list AMS_WMS01_1
    Device:    AMS_WMS01_1
    devicetag: AMS_WMS01_1
    type:      auto
    flags:     online error private autoconfig
    errno:     Device path not valid
    Multipathing information:
    numpaths:   2
    c6t50060E801062A4E0d1s2 state=disabled  type=primary
    c8t50060E801062A4E2d1s2 state=disabled  type=secondary

     

  • My turn to be confused...

    You say:

    The devices on lines 5/6 are new.  

    Physically new/replaced luns/drives?

    If these are new devices, you will need to initialize the disks and add to diskgroup with -k option.
    (We can provide detailed steps if we know that these are indeed new devices.)

    If devices are new and need to be replaced in VxVM, you will need to restore the data from backup.
    (Striped volumes without any redundancy in VxVM.) 

  • They are 'new' only in the console.  I have no idea how I created them or how they got there.  I suspect they were created when I ran the 'clearimport' command.  It wasn't my intent to create them.  At this point, I'm not sure if they need to be there to get the two offline drives back up or I need to remove them as unnecessarily created items.

  • Seems there is still a problem at OS level with device access.

    What is VxVM version?

    What does 'vxdisk -e list |grep AMS' show as device name?

    If same as 'vxdisk list AMS_WMS01_1', see what this command says:

    prtvtoc /dev/rdsk/c6t50060E801062A4E0d1s2

    or 

    prtvtoc /dev/rdsk/c8t50060E801062A4E2d1s2

    Curious to know what happened yesterday:

    Yestarday, I had two drives in my storage san fail.  

    Any errors in /var/adm/messages? 

  • Thanks to the help of Shane in Symantec support, we found and resolved the problem.

    #> vxdmpadm listctlr  all

    CTLR-NAME       ENCLR-TYPE      STATE      ENCLR-NAME
    =====================================================
    c1              Disk            ENABLED      Disk
    c8              AMS_WMS         ENABLED      AMS_WMS012
    c6              AMS_WMS         DISABLED     AMS_WMS012
    c8              AMS_WMS         ENABLED      AMS_WMS0
    c6              AMS_WMS         DISABLED     AMS_WMS0
    c8              AMS_WMS         ENABLED      AMS_WMS01
    c6              AMS_WMS         DISABLED     AMS_WMS01

    c8              AMS_WMS         ENABLED      AMS_WMS0123
    c6              AMS_WMS         DISABLED     AMS_WMS0123
    c8              GENESIS         ENABLED      GENESIS0

     

    The c8 was disabled because the Fiber port was bad.  After we replaced the port, we ran:

    #> vxdctl enable

    This rescanned everything and brought the c8 controller back online.  From there, we just had to unmount the other filesystems that were on the device, export the entire disk group, then reimport them.

    We did have to fix the plex as well, but as far as others searching for possible solutions to similar problems, this will hopefully get them a long way towards a fast fix.