Solved: BasicDisk going down / up every few minutes

Andy_Welburn · ‎01-30-2018

Hey Kids!

Old guy here with an *old* system on its last legs that's causing me a little grief before it's shown the door permanently & me being me I just want to get to the bottom of it before our final farewells.

Master / Media server Solaris 9
NetBackup Version 6.5.6

We only have a few old OS's backed up to Basic Disk storage & every now & again backups fail (inc. Catalog) with "Disk Volume is down(2074)". They retry, more often than not successfully, 10 minutes or so later but every now & again failures are constant until a service restart or system re-boot, or latterly I could only get backups to work with an nbrbutil -resetAll.

The storage is simply an NFS mount from one of our NetApp filers on the master/media that has been successfully used for the last few years, but only recently has started to display 'issues'.

As it currently stands:
# nbdevquery -listdv -stype BasicDisk -U
Disk Pool Name      : ERYCSV02_Disk
Disk Type           : BasicDisk
Disk Volume Name    : Internal_16
Disk Media ID       : @aaaah
Total Capacity (GB) : 3072.00
Free Space (GB)     : 1624.73
Use%                : 47
Status              : UP
Flag                : OkOnRoot
Flag                : ReadOnWrite
Flag                : AdminUp
Flag                : InternalUp
Num Read Mounts     : 0
Num Write Mounts    : 1
Cur Read Streams    : 0
Cur Write Streams   : 0

bperror -disk shows entries such as:
1517281124 1 1536 8 cream 0 0 0 *NULL* nbemm Volume ERYCSV02_Disk:Internal_16 marked down, Storage server cream
1517281872 1 1536 8 cream 0 0 0 *NULL* nbemm Volume cream:ERYCSV02_Disk:Internal_16 marked up

and there are frequent restarts as the following time stamps convey:
1517281124 = Tue Jan 30 02:58:44 2018
1517281872 = Tue Jan 30 03:11:12 2018
1517282547 = Tue Jan 30 03:22:27 2018
1517283367 = Tue Jan 30 03:36:07 2018
1517284040 = Tue Jan 30 03:47:20 2018
1517284383 = Tue Jan 30 03:53:03 2018
1517285714 = Tue Jan 30 04:15:14 2018
1517285758 = Tue Jan 30 04:15:58 2018
1517287107 = Tue Jan 30 04:38:27 2018
1517287868 = Tue Jan 30 04:51:08 2018
1517289141 = Tue Jan 30 05:12:21 2018
1517289208 = Tue Jan 30 05:13:28 2018
1517289816 = Tue Jan 30 05:23:36 2018
1517290568 = Tue Jan 30 05:36:08 2018
1517291245 = Tue Jan 30 05:47:25 2018
1517291519 = Tue Jan 30 05:51:59 2018
1517292136 = Tue Jan 30 06:02:16 2018
1517292755 = Tue Jan 30 06:12:35 2018
1517294084 = Tue Jan 30 06:34:44 2018
1517294358 = Tue Jan 30 06:39:18 2018
1517295021 = Tue Jan 30 06:50:21 2018
1517295083 = Tue Jan 30 06:51:23 2018
1517295753 = Tue Jan 30 07:02:33 2018
1517296025 = Tue Jan 30 07:07:05 2018

There is certainly something 'weird' going on, although not sure for how long as things have been running very smoothly for some years without need for interaction, as additionally I get "error connecting to oprd on cream: network protocol error(39)" when initially connecting to the STU via the Admin Console (remember that?) .... but that's probably another story, as it does connect once the pop-up has been dismissed.

If I've missed any info that you feel may be useful, let me know as I've been out of the loop for sooooo long. But if you guys have any clue as to where to look further it would be much appreciated.

Nicolai · ‎01-30-2018

Hey - a old frind has showed up

Take a look in OID 220 - dps - disk polling service.

vxlogview - if you still remember :D

View solution in original post

Marianne · ‎01-30-2018

Hello 'Stranger'!

My suspicion would be NFS.
Have you perhaps had a look at messages file for errors/clues?

Handy NetBackup Links

Nicolai · ‎01-30-2018

Hey - a old frind has showed up

Take a look in OID 220 - dps - disk polling service.

vxlogview - if you still remember :D

Andy_Welburn · ‎01-30-2018

& nothing stranger than me eh?!!

Yeah, been looking at the NFS and filer side of things (even the phases of the moon), but to all intents & purposes the mount is still visible & accessible during times of duress (mine!)

It's actually 'down' now as we speak. Can see all of its contents (.img, .info) & absolutely nothing in /var/adm/messages. "Problems" report just duplicates the output of bperror, so nothing new there. Same with "All Log Entries"

..... and ..... it's just come back up again ~4 mins later.

More of an annoyance than anything else given my 'estate' is now only a mere 5 servers (one essentially defunct, another a print server hosting a serially attached line printer, a desktop & 2 Linux boxes that are probably backed up elsewhere or are probably now not being used but haven't been decomm'd or I haven't been notified!).

Maybe it's just me that doesn't want to let go of the finer things in life!

Andy_Welburn · ‎01-30-2018

vxlogview - jeesh, where's that b**** manual?

Thanks guys. Will dig deeper.

Glad you're still around .... wish I could be too, but soooooo out of my depth now.

EDIT
Who needs the manual when "Google is your friend" (good times)
Essentially the same info (but at least no need to bpdbm -ctime the output!):

30/01/18 10:02:19.732 [Warning] V-220-62 volume is DOWN due to storage server related error: ERYCSV02_Disk:Internal_16 cream
30/01/18 10:04:32.082 [Info] V-220-60 volume is UP: ERYCSV02_Disk:Internal_16 cream

Andy_Welburn · ‎01-30-2018

Ooh, ooh, ooh! Nicolai I could kiss you but think the missus might get jealous!

vxlogview for OID220 led me to this: https://www.veritas.com/support/en_US/article.000041813
which led me to:
vxlogview for OID222 which gave me ...... nothing.

But, BUT, the link also stated "The bpstsinfo process will log in to the Admin log".
So a quick look in here gave me .... nothing. Well nothing (significant) to do with DPS and bpstsinfo anyway.

However, what I did notice were quite a few delays trying to connect to some very old SERVER entries for our oppos who no longer work here & so their SERVERs (i.e. PCs that were used to access the Admin console) have now been decommissioned. I have now (about 30 minutes ago) removed all of these from bp.conf (did not restart services) and:
- have not had the DSU marked as down (yet, touch wood, step on a cat or whatever)
- can connect to the STU via the Admin Console without getting the oprd 39 connection error

So two for one it seems Nicolai!!.

Been meaning to get rid of those entries for a while now ... remember they can cause all sorts of issues back in the day for long load times in the Admin Console etc but often overlooked as they can be 'hidden' in all sorts of places.

Awesome job as always guys. Love you to bits for the stellar work you continue to do.

Nicolai · ‎01-30-2018

Glad I could help a old frind :D

VOX

BasicDisk going down / up every few minutes