Server crashing due to inconsistency in opt (manag...

Socrates · ‎03-01-2012

We have a Sun Server running Solaris 10 and Veritas Cluster Server. The RAID Volumes in the Server (/ , swap, opt, var, usr) are managed by VxVm and UFS is grown on all these volumes.

Lately the system has been crashing due to an inconsistency in the opt filesystem. Upon reboot we did a fsck on the the opt multiple times and booted the system to multiuser mode. But again the system is crashing once the cluster is ok. The following is the panic message:-

panic[cpu1]/thread=3000d19a6c0: alloccgblk: can't find blk in cyl, pos:0, i:377, fs:/opt bno: 300

000002a102c50fb0 ufs:real_panic_v+60 (0, 19017f8, 2a102c51250, 30003bea000, 0, 600080e7d40)

%l0-3: 000006000832a000 0000000000090000 000006000e5d64c0 0000000000000300

%l4-7: 0000000000000180 0000000000000000 0000000000000064 0000000001826c00

000002a102c51060 ufs:ufs_fault_v+c8 (600085c7180, 19017f8, 2a102c51250, 6000d957648, 60006b2a2a8, 0)

%l0-3: 000006000832a000 0000000000090000 000006000e5d64c0 0000000000000300

%l4-7: 0000000000000180 0000000000000000 0000060006b2a200 0000000000000000

000002a102c51110 ufs:ufs_fault+1c (600085c7180, 19017f8, 0, 179, 6000832a0d4, 300)

%l0-3: 000006000832a000 0000000000090000 000006000e5d64c0 0000000000000300

%l4-7: 0000000000000180 0000000000000479 0000000000000179 000006000832a560

000002a102c511c0 ufs:alloccgblk+4c8 (1901400, 6000e5d6000, 0, 6000d957648, 2188, 0)

%l0-3: 000006000832a000 0000000000090000 000006000e5d64c0 0000000000000300

%l4-7: 0000000000000180 0000000000000479 0000000000000179 000006000832a560

000002a102c51270 ufs:alloccg+144 (90000, 60006b2a2a8, 662188, 2000, 90255, 6000e5d64c0)

%l0-3: 000006000e5d6000 000006000d957648 000006000832a2d8 0000000000000880

%l4-7: 0000000000000088 0000060006b2a200 000006000832a000 0000000000090255

000002a102c51320 ufs:hashalloc+24 (6000e8db878, 88, 662188, 2000, 122e8d0, 2a102c51480)

%l0-3: 0000060006b2a200 000006000e8db878 000006000832a000 0000000000000003

%l4-7: 0000060006b2a200 0000000000002000 0000000000000088 0000000000000088

000002a102c513d0 ufs:alloc+128 (0, 662188, 34d4f00, 2a102c51690, 600004040b8, 6000832a000)

%l0-3: 0000060006b2a200 000006000e8db878 0000000001e6a130 0000000000000003

%l4-7: 0000000000000000 0000000000002000 0000000000002000 0000000000000010

000002a102c51490 ufs:bmap_write+c40 (0, 2000, 2a102c515e8, 10, 0, 6000e8db878)

%l0-3: 0000060006b2a200 0000000000000000 0000000000662188 000002a102c515e8

%l4-7: 000000000000001c 0000000000661f48 0000000000000007 000006000d237d10

000002a102c516a0 ufs:wrip+448 (0, 2a102c51a98, ffffffffff, 2000, 6000e8db878, 8000)

%l0-3: 0000000000026000 0000000000000001 0000000000000000 0000000000000000

%l4-7: 0000060006b2a2a8 0000000000028000 0000000000000000 0000000000002000

000002a102c51810 ufs:ufs_write+580 (6000e8cdb80, 2a102c51a98, 8, 60006b2a248, 1, 6000e8db878)

%l0-3: 000006000e8db898 000006000e8db958 000006000e8db960 0000000000000001

%l4-7: 00000000019004f4 000006000e8db9b8 0000060006b2a200 0000000000000000

000002a102c51930 genunix:fop_write+20 (6000e8cdb80, 2a102c51a98, 8, 600004040b8, 0, 123ed74)

%l0-3: 0000000000002000 000006000e8cdb80 0000000000000000 000000000104db10

%l4-7: 0000000000002000 0000000000026000 0000000000000008 000000000000210a

000002a102c519e0 genunix:write+268 (1, 8058, 600155cb008, 2000, 210a, 1)

%l0-3: 0000000000000000 000006000e8cdb80 0000000000000000 000000000104db10

%l4-7: 0000000000002000 0000000000026000 0000000000000008 000000000000210a

syncing file systems... [1] 34 [1] 28 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 [1] 5 done (not all i/o completed)

From the vxprint -th output i see some parameter of the 2 plexes of opt has some different value as shown below:-

dm rootdisk c0t0d0s2 auto 20351 143328960 -

dm rootmirror c0t1d0s2 auto 9919 143328960 -

v opt - ENABLED SYNC 110796288 ROUND - fsgen

pl opt-01 opt ENABLED ACTIVE 110796288 CONCAT - RW

sd rootdisk-03 opt-01 rootdisk 32532672 110796288 0 c0t0d0 ENA

pl opt-02 opt ENABLED ACTIVE 110796288 CONCAT - RW

sd rootmirror-05 opt-02 rootmirror 32512320 110796288 0 c0t1d0 ENA

Please help me with a fix.

arangari · ‎03-01-2012

what is the error given by VCS?

Socrates · ‎03-01-2012

Hi Amit

I dont see any error coming from VCS. But after we did fsck on opt and repaired it, and when VCS was started few resource wasnt online. But the problem is hanging of the system. Everytime the OPT goes to "needs sync" state.

If you require any specific output i can produce immediately. As you can see below its unable to read from the opt and it crashes. And opt is managed by VxVm with UFS grown over it.

panic[cpu1]/thread=3000d19a6c0: alloccgblk: can't find blk in cyl, pos:0, i:377, fs:/opt bno: 300

Socrates · ‎03-01-2012

Also i forgot to mention that there is no scsi errors or anything from the internal disks, please see the iostat output:-

sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0

Vendor: FUJITSU Product: MAY2073RCSUN72G Revision: 0501 Serial No: 0727S0C038

Size: 73.40GB <73400057856 bytes>

Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0

Illegal Request: 0 Predictive Failure Analysis: 0

sd2 Soft Errors: 1 Hard Errors: 0 Transport Errors: 22

Vendor: MATSHITA Product: CD-RW CW-8124 Revision: DZ15 Serial No:

Size: 0.00GB <0 bytes>

Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0

Illegal Request: 1 Predictive Failure Analysis: 0

sd3 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0

Vendor: FUJITSU Product: MAY2073RCSUN72G Revision: 0501 Serial No: 0727S0C0FD

Size: 73.40GB <73400057856 bytes>

Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0

Illegal Request: 0 Predictive Failure Analysis: 0

g_lee · ‎03-02-2012

It sounds like this is a server crash issue (ie: the root issue is with the /opt ufs filesystem, which is on a VxVM volume, not with VCS itself) - moving to from Cluster Server to Storage Foundation forum (although from the messages provided so far, it appears the error may be related to the ufs fs rather than the volume)

For the server crash issue - do you have crash dumps enabled? If so, please provide the following mdb output for more information about the crash:

# mdb -k unix.0 vmcore.0
> ::stack
> ::msgbuf
> ::panicinfo

refer to the mdb manpage for additional options.

Regarding the supposed opt size discrepancy - the sizes are consistent - with -t option, the length is the 6th field (as seen in the header key)

V NAME         RVG/VSET/CO KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE

so based on your output:

dm rootdisk c0t0d0s2 auto 20351 143328960 -

dm rootmirror c0t1d0s2 auto 9919 143328960 -

v opt - ENABLED SYNC 110796288 ROUND - fsgen

pl opt-01 opt ENABLED ACTIVE 110796288 CONCAT - RW

sd rootdisk-03 opt-01 rootdisk 32532672 110796288 0 c0t0d0 ENA

pl opt-02 opt ENABLED ACTIVE 110796288 CONCAT - RW

sd rootmirror-05 opt-02 rootmirror 32512320 110796288 0 c0t1d0 ENA

the length of the volume, plexes and subdisks is consistent (all are 110796288). The difference you mentioned is the subdisk diskoffset which is the device offset in sectors (ie: where the subdisk starts on the device) - this can be due to various factors such as different disk geometry / different options used at disk setup / volumes created in different order - it would not cause a panic/issue with the volume.

Marianne · ‎03-02-2012

"UFS is grown on all these volumes."

Not a good idea and not supported.

https://sort.symantec.com/public/documents/sf/5.0/solaris/html/vxvm_admin/ag_ch_disks_vm23.html

You cannot grow or shrink any volume (rootvol, usrvol, varvol, optvol, swapvol, and so on) that is associated with an encapsulated root disk. This is because these volumes map to physical partitions on the disk, and these partitions must be contiguous..

Handy NetBackup Links

VOX

Server crashing due to inconsistency in opt (managed by VxvM)