Solved: Drive Troubleshooting

H_Sharma · ‎11-10-2014

Hello Experts,

We are experirncing issue with the drive. We have 8 drives and out of 8 drives only 5 are writing and jobs are in queue in activity monitor.

We did the following troubleshooting.

We are using windows 2008 R2 server as our Master server

1:- Its showing all drives in OS (Vmoprcmd and tpconfig, scan -changer)

2:- all drives are there in device manager.

3:- Its showing 5 drives have tapes and 3 are empty in Robtest.

4:- ran nbrustil -listActiveDriveJobs but it shows no job activity for 3 drives.

Somebody told me to use the command nbrustil -reset drive or -relese drive so that it would release all the allocation.

1:- I want to know what are these allocation and why does it happen?

2:- what is the difference between reset drive and release drive.

3:- what is the best practise to avoid this?

4:- There is no alert for that if drive is not writing. So what should we do to know that drives are idle and other jobs are in queue?

Thanks,

RiaanBadenhorst · ‎11-10-2014

Hi

You can run 'nbrbutil -dump'

This will show you all the allocations that the resource broker has for your domain at that time. If there are no backups running and you find that there are allocations for any drives then you've got some allocations that need to be released.

An allocation is basically a reservation that gets created when a job is kicked of and needs a tape and a tape drive. So policy A runs and needs to write to tape. NBRB then says allocate MEDIA1 and put it in DRIVE1. This is what the dump will show you. Sometimes these allocations get orphaned and then they are stuck there. Because they are still there (for DRIVE1 and MEDIA1 for instance), no new backups can use the drive or media.

Thats and over simplified explanation of it.

View solution in original post

Marianne · ‎11-10-2014

Please show us what you see.
Vmoprcmd will show drive status
tpconfig will show us NBU config (not OS)
scan -changer only shows robot info and drives seen by the robot (not OS).

Show us output of these commands:

vmoprcmd
tpconfig -l
scan

Also run 'nbrbutil -dump' and post the bottom part that shows 'MDS Allocations'.

Handy NetBackup Links

RiaanBadenhorst · ‎11-10-2014

Hi

You can run 'nbrbutil -dump'

This will show you all the allocations that the resource broker has for your domain at that time. If there are no backups running and you find that there are allocations for any drives then you've got some allocations that need to be released.

An allocation is basically a reservation that gets created when a job is kicked of and needs a tape and a tape drive. So policy A runs and needs to write to tape. NBRB then says allocate MEDIA1 and put it in DRIVE1. This is what the dump will show you. Sometimes these allocations get orphaned and then they are stuck there. Because they are still there (for DRIVE1 and MEDIA1 for instance), no new backups can use the drive or media.

Thats and over simplified explanation of it.

Michael_G_Ander · ‎11-10-2014

What is the number of drives configured in the storage unit(s) ?

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

H_Sharma · ‎11-13-2014

Hi Experts,

Pls let me know how can we ensure we get alert in case drives are not wriinng. Is there any way?

mph999 · ‎11-13-2014

Further to the answers by the outstanding Riaan and the absolutley amazing Marianne ...

If a drive is reservered by NetBackup it should show in nbrbutil -dump output

For example:

MDS allocations in EMM:

MdsAllocation: allocationKey=540 jobType=1 mediaKey=4000003 mediaId=A00001 driveKey=2000006 driveName=HP.ULTRIUM1-SCSI.000 drivePath=/dev/rmt/0cbn stuName=tape masterServerName=womble mediaServerName=womble ndmpTapeServerName= diskVolumeKey=0 mountKey=0 linkKey=0 fatPipeKey=0 scsiResType=1 serverStateFlags=1

Once the backup finishes, and nbrb deallocates the resourses this line should disappear.

Now, I have seen the following case where there are x number of drives on a system, let's say 8 and only 5 are being used (so what you have). You would expect perhaps to see x3 drives showing with MDS allocations all the time (meaning only 5 others can be used). However, what I have seen is NO drives under MDS allocations (with no jobs running ) and then when jobs are run, only a max of x5 drives show under MDS allocations.

The strange thing, was that even though these x3 drives aren't showing under mds allocations, it was found the nbrbutil -reset all did resolve the issue, so the long and short of this story is run this anyway as it may fix things.

As to why it's an issue, I have never been able to determine for sure - I suspect that for some reason a flag was left unset in th NBDB database - there are a couple of entries in the devices NBDB table that are called

48 ,"AllocationHostKey" unsigned int NOT NULL DEFAULT 0
49 ,"AllocationState" unsigned int NOT NULL DEFAULT 0

I will hazard a guess, that these don't get unset for some reason, but do when nbrbutil -resetall is run.

I will add tha this situation is quite rare, I've only seen it a couple of times and I'm not aware that for anyone who reported it, it came back again.

If things still don't work you'll need to need to set vxlogs 111 (emm) and 143 (mds) and run vxlog view on them using vxlogview -p 51216 -i 111 -d all -t 00:30:00 (for the last 30 minutes) and around the area where you see a line like this (search for drive_list). This lists how many drives are available, and if it's x5 and not x8 then we have a potential problem

Example (only one drive on this system so this line is correct)

11/11/14 09:51:27.462 [Debug] NB 51216 mds 143 PID:25805 TID:1 File ID:111 [jobid=476] 2 [sql_select_ready_drives] drive_list has 1 drives

In this case I'd look through the log and see if it lists some reason the othger drives aren't in use (scsi reservation issues can cause this ) in which case I'd just power cycle the drives (not a soft reset) and if that fails, I'd delete and reconfigure the device, removing them all (nbemmcmd -deletealldevices -allrecords) to clear out any potential issue in the database.

mph999 · ‎11-13-2014

OpsCenter shows drive utilisation, so that is one way. On;y other way I can think of is to monitor the bptm log and ensure that all drives appear in there within a given time frame.

Marianne · ‎11-13-2014

It is almost impossible to try and help you when you do not respond for 3 days.

I asked 3 days ago: 'show us what you see'.

There are no 'built-in' alerts for drives not in use and there are tons of good reasons why all drives may not be in use, e.g.
STU config containing less drives that total number of physical drives
Not enough jobs to use all drives
Not enough media available for all drives
Limits per policy or clients
Orphaned device allocations
etc.
etc.

If you have good scripting skills, you can possibly use OS scheduler to run script at regular intervals during peak backup window (e.g. every 15 min) that will use vmoprcmd to check drive usage, check for queued jobs with bpdbjobs, and send email alert when there are queued jobs and not all drives are occupied.

Handy NetBackup Links

mph999 · ‎11-14-2014

ha ha ... yes, vmoprcmd would show drive useage also as Marianne points out) however the problem with this is that if the drive is used and then finishes the job, you may miss the activity in vmoprcmd - hence the log file(s) which show what happened in the past.

bptm log is the way to go - a bit of scripting and it's quite do-able

For example, these lines show the drive being used

11:05:49.613 [7848] <2> io_open: file /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM1-SCSI.000 successfully opened (mode 2)
11:05:49.613 [7848] <2> write_backup: media id A00001 mounted on drive index 0, drivepath /dev/rmt/0cbn, drivename HP.ULTRIUM1-SCSI.000, copy 1

or these ...

11:05:49.613 [7848] <2> send_job_file: job ID 489, ftype = 3 msg len = 133, msg = LOG 1415963149 4 bptm 7848 media id A00001 mounted on drive index 0, drivepath /dev/rmt/0cbn, drivename HP.
ULTRIUM1-SCSI.000, copy 1

11:05:49.613 [7848] <4> report_throughput: VBRT 1 7848 1 1 HP.ULTRIUM1-SCSI.000 A00001 0 1 0 0 0 (bptm.c.18468)

Martin

Marianne · ‎11-14-2014

And if more than one media server, ALL bptm logs will have to be searched...

Handy NetBackup Links

mph999 · ‎11-14-2014

Thats a good point also ...

Guess OpsCenter is the best way then.

H_Sharma · ‎11-17-2014

Hi Marianne,

Sorry for the delayed response thanks for highlighting the issues.. I understood these.

1:- We have ops center. Where is the option? Can we configure alrert for the same?

2:- Not aware with scripting :( Now question is who can help with the scripting part :(