Solved: NetBackup 7.6.0.1 status 12 on duplication jobs

m_karampasis · ‎05-28-2015

Dear all,

We are experiencing the following issue on a specific media server most of the duplication jobs are failing with the following status:

05/22/2015 10:06:23 - requesting resource LCM_nbumsrv2_ALL_LTO
05/22/2015 10:14:32 - Info nbrb (pid=27590880) Limit has been reached for the logical resource LCM_nbumsrv2_ALL_LTO
05/27/2015 18:16:47 - granted resource LCM_nbumsrv2_ALL_LTO
05/27/2015 18:16:48 - begin Duplicate
05/27/2015 18:16:48 - Error bpduplicate (pid=37880058) cannot open /usr/openv/var/global/nbstserv/bid_file_rs_1432278383_0_0, A file or directory in the path name does not exist.
05/27/2015 18:16:48 - end Duplicate; elapsed time 0:00:00
05/27/2015 18:16:48 - started process RUNCMD (pid=37880058)
file open failed (12)

Thank you in regards.

Marianne · ‎05-29-2015

You need nbstserv logs for 22 and 23 May. Looking in today's log won't help.

Were all of the jobs that you show in the attachment in queued or active state for all of those days?

SIX days in this case!!

05/23/2015 21:33:09 - requesting resource LCM_nbumsrv2_ALL_LTO
05/23/2015 21:37:24 - Info nbrb (pid=27590880) Limit has been reached for the logical resource LCM_nbumsrv2_ALL_LTO
05/29/2015 05:57:49 - begin Duplicate

What is weird here is that Martin's testing showed that SLP duplication failed with an error when no resources are available.

Do you have other duplication jobs in queued or active state waiting for resources?

You really need to have a good look at duplication resources - you should never have jobs waiting for tape drives this long.

As per my post of yesterday - work through the SLP Best Practice Guide to assess and tune your environment.
All duplication jobs should be finished before next backup window starts.

Handy NetBackup Links

View solution in original post

Marianne · ‎05-28-2015

Why would the file not exist? I can see a similar list of 'bid-files' on our inhouse Appliance.
The timestamp in the filename equates to:
1432278383 = Fri May 22 09:06:23 2015

Is this the date that the backup was taken?

Tell us more about the source - what type of STU?

What type of duplication? Manual or SLP?

Handy NetBackup Links

mikebounds · ‎05-28-2015

Run the following:

cd /usr/openv/var/global/nbstserv
df -h .
head *

This will check the filesystem is not full which may be preventing these files from being written, but if this is not the issue, then each of these files should contain one line which is the backup image id following by a number (0 in all my files). The backup image id should refer to the images currently being backed up, so if you look at active duplication jobs then the file in the "File List" for these should match the backup image id in the files in /usr/openv/var/global/nbstserv

Mike

Marianne · ‎05-28-2015

Any reason why duplication job only received resources 5 days later?

05/22/2015 10:06:23 - requesting resource LCM_nbumsrv2_ALL_LTO
05/22/2015 10:14:32 - Info nbrb (pid=27590880) Limit has been reached for the logical resource LCM_nbumsrv2_ALL_LTO
05/27/2015 18:16:47 - granted resource LCM_nbumsrv2_ALL_LTO

I cannot say what the exact purpose of files in /usr/openv/var/global/nbstserv are, but I can see that there are no files older than 3 days and that they contain image id's. (We do not have anything Active at the moment and have 7 files in this folder.)

The only reference I can find is the SLP cheatsheet: http://www.symantec.com/docs/TECH170086
that references the folder without any explanation.
Probably something to do with SLP sessions/timing or something to that effect.
I will ask an expert to comment...

Handy NetBackup Links

mikebounds · ‎05-28-2015

Hi Marianne , I wonder if your SLPs are not complete which is why you still have files in /usr/openv/var/global/nbstserv even though duplication job has finshed:

I have 8 tape drives and tape drives are ONLY used for duplications and there are only 7 running at the moment and I have 7 files in /usr/openv/var/global/nbstserv and the image id's correspond exactly to the running jobs. The last completed duplication job finished at 8:06 this morning which is less than 3 hours ago and the imageid for this is not in /usr/openv/var/global/nbstserv

In our environment the images are replicated as well as dupicated, but the replication is much quicker, so they have all completed, so I wonder if the image ids you see in /usr/openv/var/global/nbstserv have finished their duplication, but there are still not complete as there have some other COPY that has not completed - can you run:

nbstlutil list -U -rt I -backupid image_id

for one of the image ids in your files to see if the image record "Storage Lifecycle State" is complete?

Thanks

Mike

Marianne · ‎05-28-2015

Correct Mike. All SLPs completed.

Snippets from a 3-day-old file/image:

Backup Time                   : 1432542817 (2015/05/25 10:33:37)
...
 Storage Lifecycle State       : 3 (COMPLETE)
 Storage Lifecycle Is Inactive : false
 Time In Process               : 1432543086 (2015/05/25 10:38:06)
....
 Created Date Time             : 1432542817 (2015/05/25 10:33:37)

I think in above case that SLP secondary operation have a lifespan that times out at some point in time.
In this case it seems that there were no tape resources for 5 days.

It would be better if the duplication job simply failed with a timeout error.

I have asked one of our resident experts to have a look at this discussion....

Handy NetBackup Links

m_karampasis · ‎05-28-2015

Hello and thank you for your responses,

Answering your questions:

@Mike

File system is not full:

root@master:/usr/openv/var/global/nbstserv# df -g .
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/openvlv 1521.00 865.17 44% 469550 1% /usr/openv
root@master:/usr/openv/var/global/nbstserv#

Attached you will find the head * command output

@Marianne

The reasson that it takes more than 3 days to duplicate is due to a very heavy load environment (Duplication queue = 80 TB).

Thank you in regards.

mph999 · ‎05-28-2015

The files are the BID files, used to batch up the images.

For example, on my system I ran a coupe of backups to an SLP. nbstserv then added both images to the bidfile

bid_file_rs_1432807896_0_0

womble_1432807198 0
womble_1432807199 0

When the duplication job is kicked off, this gets submitted to nbpem.

Regards,

Martin

mph999 · ‎05-28-2015

... nbsterv log is the place to look to see what is going on (or not as the case maybe)...

nbstserv is vx log 226

vxlogcfg -a -p 51216 -o 226 - s DebugLevel=6 -s DiagnosticLevel=6

vxlogview -p 51216 -o 226 -d all -t 00:45:00 >/tmp/226.txt

(adjust time depending on how much log you want to see, relative to when you run the command). As a guess, 45 mins should be enough, so set log levels, leave for 46 mins, then run the vxlogview command,

(Lower log levels by using 1 and 1 or 0 and 0 or what levels you want).

Marianne · ‎05-28-2015

Thanks Martin.

Is there some sort of a timeout after which BID files are removed?

I can see that on our in-house test system that there is nothing older than 3 days.

In this case, the duplication job only received resources 5 days later and BID file was gone.

The solution will obviously be to secure sufficient resources for duplication along with performance tuning, but I would expect SLP to timeout rather than fail because BID file is gone.

Handy NetBackup Links

mph999 · ‎05-28-2015

Good question, and I don't know the answer to that, will see what I can find out ...

I would have though nbstserv recalculates and so recreates the bid files, but maybe not.

Marianne · ‎05-28-2015

In this case vxlogview will have to go back further as the missing BID file was created on 22 May.

Handy NetBackup Links

mph999 · ‎05-28-2015

Just to finish the story, when bpduplicate is run (which is wat actually makes the duplication), from admin log we can see the bid file passed to the command.

11:11:38.769 [5018] <2> bpduplicate: VERBOSE = 5
11:11:38.789 [5018] <2> bpduplicate: INITIATING (VERBOSE = 5): NetBackup 7.6.1 created: 2015021210
11:11:38.789 [5018] <2> logparams: -jm -jobid 498 -dcn 2 -dstunit tape -dp NetBackup -owner *ANY* -rl 10 -fail_on_error 0 -cn 1 -priority 0 -lcm -wcopt SFN -mtd 25600 -Bidfile /usr/openv/var/glob
al/nbstserv/bid_file_rs_1432807896_0_0

m_karampasis · ‎05-28-2015

Dear all,

We have changed ath the master server the log level:

root@master:/# vxlogcfg -a -p 51216 -o 226 -s DebugLevel=6 -s DiagnosticLevel=6
The configuration settings were added successfully.

And we are waiting for the next failed duplication job, although because we have cjanges the majority of ous SLP's to the other media servers which are not failing this will maybe take time..

Thank you.

mph999 · ‎05-28-2015

Hmm ...

My duplication job fails due to lack of resources (no tapes available on my test system) ...

Looking in the folder, nbstserv is creating a new bid file for each attempt :

-rw------- 1 root root 40 May 28 11:11 bid_file_rs_1432807896_0_0
-rw------- 1 root root 40 May 28 11:16 bid_file_rs_1432808196_0_0
-rw------- 1 root root 40 May 28 11:21 bid_file_rs_1432808497_0_0
-rw------- 1 root root 40 May 28 11:26 bid_file_rs_1432808797_0_0
-rw------- 1 root root 40 May 28 11:31 bid_file_rs_1432809097_0_0

... so, I don't think this is a xx days timeout issue.

nbstserv log (226) is the place to look.

Marianne · ‎05-28-2015

Just take care when logging levels are increased - keep an eye on /usr/openv disk space.

My gut feel is that the BID files are being removed after a certain amount of days/hours, but the real issue in your environment is with resources.

Check backlog often and ensure that the duplications can be completed before next backup window starts.

Have a look at the SLP Best Practice Guide (not sure if there is a newer one for 7.6) : http://symantec.com/docs/HOWTO73205

Handy NetBackup Links

Marianne · ‎05-28-2015

My duplication job fails due to lack of resources .....

Wondering why the job in this post seems to be hanging around for 5 days?

Or maybe a failed job was restarted? (not sure if it is possible for SLP duplication jobs...)

Handy NetBackup Links

mikebounds · ‎05-28-2015

Hi Marianne,

We are using NBU 7.5 and I don't have a "Storage Lifecycle Is Inactive" field and I don't understand what this means. If SLP is complete, then I would think SLP has finished and so no longer acitve, so it should be "inactive" so "inactive" should be true, but you show as false meaning the SLP is active (I hate double negatives !). So I don't know what this means, but perhaps that is why your files have not being deleted yet, if they are still active - if you run nbstlutil for a image id that is NOT in /usr/openv/var/global/nbstserv, does this show as inactive = true ?

Mike

mikebounds · ‎05-28-2015

m.karampasis, for the job that failed in your opening post, can you get the image id(s) from the file list and then run:

/opt/openv/netbackup/bin/admincmd/nbstlutil list -U -rt IC -backupid image_id

for these and post output to see if there are any clues in this as to what happened. I agree with Marianne that it probably timed out as you seem to have a lot of image ids in your /usr/openv/var/global/nbstserv directoty.

Mike

Marianne · ‎05-28-2015

Hi Mike

There were a couple of enhancements in 7.6 where SLPs can have schedules (similar to backup schedules with open windows).
Before 7.6, if SLPs needed to be halted during certain periods, the SLP or STU had to be deactivated (nbstlutil inactive -lifecycle name ) and reactivated (nbstlutil active ...) using something like cron.

I think this field in 7.6 output is just a further enhancement to display SLP status.

Handy NetBackup Links

VOX

NetBackup 7.6.0.1 status 12 on duplication jobs