cancel
Showing results for 
Search instead for 
Did you mean: 

Disk staging to tape causes status 196 on regular policies

JeanB
Level 4

Weekly full backups are staged to an EMC Data Domain basic disk staging storage unit.  Staging to a 6 drive LTO4 robot is triggered by a schedule at 3PM which takes more than the sheduled window to complete.  Tape-only policies that are sceduled at 11PM fail with status 196 because all drives are occupied by the staging.  How can I ensure that the staging has less priority than regular backup policies?  Is this recommended?   I want to avoid tweeking with policy priorities which are all at 0.  The staging schedule has a priority of 90000 which makes no sense to me.

22 REPLIES 22

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
Why start staging only at 3pm?
Best to schedule staging fairly often. Like every 2-3 hours.

I was thinking of setting it every hour with a 24 hour window.  We started having these problems when we transfered more than 3TB of backups to this staging disk unit.  Duplication takes ages.  How can the size of the duplication be evaluated?  Also,  has we are not using SLPs, I understand that high and lower water marks are not used. They are set at 90% and 75%, the unit is currently full to 72%.  My understanding is that all backups done using this storage unit since the last staging will be duplicated, disregarding the water marks.

Nicolai
Moderator
Moderator
Partner    VIP   

You can speed up tape transfer by setting SIZE_DATA_BUFFERS = 262144 and NUMBER_DATA_BUFFERS to 256. This tewak can increase performance up to 50% compared to default values.

Stage during day time and backup during night time. Mixing work load will effect performance.

Nicolai's proposal of not mixing work load somewhat contradicts Marianne's suggestion of doing duplication as often as possible.  Our solution is programmed to backup at night and do the dups in the day, but I'm currently facing the situation where weekend fulls take severals days to complete, right into monday night, which creates a lot of queueing and I'm struggling to understand where the bottlenecks are and what to change.

I played around with priorities only to confirm that once queued, a job's priority will not be affected by higher prority schedules. 

In essence, my goal is to have my storage units (2 data domains + 2 6-drive robots) available at night for differentials. Currently, this is not the case as the weekend's huge fulls load creates endless duplication to tape, which stresses the DD units, tape units, and media servers.

I could change the staging schedule to "every hour", during the day say form 8:AM to 20PM, 5 days a week (tuesday to saturday).... since we are only staging full backups which are on sunday and monday...  my worry is that the week's first staging at 8AM tuesday morning will take 24 hours to complete as it will attempt to duplicate the entire weekend's full load.... back to square one!

 

X2
Moderator
Moderator
   VIP   

This might take it off topic but why only weekends for FULLs? Can't the normal FULLs be spread during week nights and larger FULLs done during weekends?

We do have FULLs spread out during the week and starting 1700hrs. Larger FULLs (file shares) are usually done once a month with weekly CUMMs. We are able to do this probably because it is a small environment: about 1000 VMs, about 30 NDMP backup, 20 file share policies (with lots of data/change) and about 200 - 300 client based backups. The setup is of storage is similar though: 2 DDs (2 copies) and then duplication to LOT7 tapes for data which is not good for deduplication.

Edit: Forgot to tell that the media servers are connected to the DDs and tape library via SAN FC connections

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Let's get back to basics - 

You have backup requirement for xxx amount of data.
You have duplication requirement for xx amount of data.
You have x hours in which to complete these tasks.
You have resources capable of reading data at ?/sec. (this includes read speed from client disk and disk STU)
You have resources capable of writing data at ?/sec (including network transfer rate)

If you don't know what each of these points amount to, then you have you homework cut out.

How long do the backups to disk take to complete? What transfer rates are your seeing?
How long do duplications take to complete? At what transfer rate?
If you are not seeing all 6 tape drives writing duplications to 6 drives at more than 100MB/sec, then you need to look at the entire data path.
If you are using a single master/master server, you need to verify that this server has sufficient resources for each of the processes involving backups and duplications.

About disk staging - HWM does not play a role in duplication scheduling - this is the task of the schedule in the DSSU schedule that you configure.
HWM is what starts disk cleanup.

Herewith some reading matter about disk staging: 

DOCUMENTATION: Description of NetBackup Disk Staging Relocation Behavior 
http://www.veritas.com/docs/000030287

 

Disk Staging Storage Unit (DSSU) cleanup behavior  
http://www.veritas.com/docs/000036495

 

Nicolai
Moderator
Moderator
Partner    VIP   

Both X2 and Marianne posts are very valid and basically is the essence in ensuring backup windows.

You really need to figure out how bad or good duplication speed are.  Please look at the NUMBER_DATA_BUFFERS/SIZE_DATA_BUFFERS tuning I suggested before. It  may be simple as that in resolving the replication backlog you have.

How to configure the buffer settings is documented in "Netbackup backup planning and performance tuning guide"

 

 

I need to identify the problem that causes one of the media servers to take more than 43 hours to duplicate 18TB (still not completed as I'm writing). The other media server (identical hardware) took 26 hours for 12 TB, time which was mostly spent on one 4TB job.

I will look into SIZE_DATA_BUFFERS as all are default.  But first I want to attempt to pinpoint the problem as all this hardware has been running fine for years.... although we have been increasing the load steadily in the last year and perhaps we were'nt paying attention well enough.  Getting the big picture isn't always obvious.

Nicolai
Moderator
Moderator
Partner    VIP   

Increase the default value (32) for NUMBER_DATA_BUFFERS to 256

Look at the speed increase example given in : http://www.mass.dk/netbackup-guides/netbackup-buffer-tuning-2/

setting NUMBER_DATA_BUFFERS may not resolve all the problems, but the basic is then done.

 


@JeanB wrote:

I need to identify the problem that causes one of the media servers to take more than 43 hours to duplicate 18TB (still not completed as I'm writing). The other media server (identical hardware) took 26 hours for 12 TB, time which was mostly spent on one 4TB job.


You need to check real speed of NBU - try make the test task for 1 gb file (or, like in the my case , it was as small as 1 mb -
https://vox.veritas.com/t5/NetBackup/Netbackip-8-and-tape-HPE-StoreEver-MSL6480/m-p/837608#M229185 )

 

 

Thanks, for all the replies.

I will start with the BUFFER ajustements as duplication are scheduled fot tomorrow and impact should be clearly visible (I hope).

Otherwise, I noticed that our drives were upgraded to LTO-4 some years ago, but many LTO-3 tapes are still being re-inserted into the robots and it appears (I cannot tell from the NB GUI as all tapes are typed as DLT)  that we have a mix of LTO-3 and LTO-4 in production. I'm not sure what the impact is, but  itcertainly does not help.  Tapes are being sent off site and recalled regularly and operators do not care to retire the LTO-3 tapes.  I hope we are not running LTO-4 drives with LTO-4 tapes at LTO-3 speeds...

I have a box full of unused LTO-4 tapes and will force them in service.  Any suggestion on how to go about this will be appreciated.

>>(I cannot tell from the NB GUI as all tapes are typed as DLT)

-

most tapes usually labeled as nnnL3 / AB(NNNN) L4 - check media and devices - media-robots-

or click the right mouse button on DLT/TLD and choose "inventory".

Nicolai
Moderator
Moderator
Partner    VIP   

Using a LTO 3 tape in a LTO4 tape drive will cause the LTO4 drive to behave as a LTO3 drive - including tape write speed.

I can't see any other option than visual inspection of returning tapes to remove LTO3 tapes. The number of GB a tape contain, can be a indication as LTO3 contain 400GB vs LTO4 800GB uncompressed.

Why don't duplication jobs display kb/sec information?

Also, why are activity monitor job logs from last week's duplications lost?  I keep 10 days of logs in my activity monitor.  I know I can certainly acces the info from OpsCenter, but need I say that I don't like that GUI, especially for job logs!

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

There are 2 processes involved in duplication - read process from source and write process to tape.
So, Activity Monitor does not report on either.

The only way to check both read and write speeds, is to find the bptm PIDs for each of these in Activity Monitor and then trace each in bptm log. (level 3 log will be sufficient)

Where/how have you configured Activity Monitor for 10 days?
The default is 3 days.

Nicolai
Moderator
Moderator
Partner    VIP   

Use the time between a "begin read" and "end read" message on a duplication job to calculate the transfer speed.

If using a 50GB fragment size :

6:30 min = 136 MB/sec
10:00 min= 83MB/sec
20:00 min= 40MB/sec
40:00 min= 20MB/sec

Take more samples to ensure you see the right picture.

KEEP_JOBS_HOURS = 240 in the master's bp.conf  gives 10 days of activity monitor logs, but the duplication logs are not kept that long. why?

Below is details of one such log from activity monitor:

Nov 21, 2017 3:12:38 PM - requesting resource media_server1-SL500-1
Nov 21, 2017 3:12:38 PM - begin Duplicate
Nov 21, 2017 3:12:39 PM - awaiting resource media_server1-SL500-1. No drives are available.
Nov 21, 2017 3:14:42 PM - awaiting resource media_server1-SL500-1. Waiting for resources.
Reason: Drives are in use, Media server: media_server1,
Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A,
Volume Pool: AUTO, Storage Unit: media_server1-SL500-1, Drive Scan Host: N/A,
Disk Pool: N/A, Disk Volume: N/A
Nov 21, 2017 3:15:57 PM - awaiting resource media_server1-SL500-1. No drives are available.
Nov 21, 2017 3:43:00 PM - awaiting resource media_server1-SL500-1. Waiting for resources.
Reason: Drives are in use, Media server: media_server1,
Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A,
Volume Pool: AUTO, Storage Unit: media_server1-SL500-1, Drive Scan Host: N/A,
Disk Pool: N/A, Disk Volume: N/A
Nov 21, 2017 3:43:49 PM - awaiting resource media_server1-SL500-1. No drives are available.
Nov 21, 2017 3:43:52 PM - awaiting resource media_server1-SL500-1. Waiting for resources.
Reason: Drives are in use, Media server: media_server1,
Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A,
Volume Pool: AUTO, Storage Unit: media_server1-SL500-1, Drive Scan Host: N/A,
Disk Pool: N/A, Disk Volume: N/A
Nov 21, 2017 3:44:11 PM - awaiting resource media_server1-SL500-1. Maximum job count has been reached for the storage unit.
Nov 21, 2017 3:44:24 PM - awaiting resource media_server1-SL500-1. Waiting for resources.
Reason: Drives are in use, Media server: media_server1,
Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A,
Volume Pool: AUTO, Storage Unit: media_server1-SL500-1, Drive Scan Host: N/A,
Disk Pool: N/A, Disk Volume: N/A
Nov 21, 2017 3:47:30 PM - awaiting resource media_server1-SL500-1. No drives are available.
Nov 22, 2017 1:22:51 AM - Info bptm (pid=55566) start
Nov 22, 2017 1:22:51 AM - started process bptm (pid=55566)
Nov 22, 2017 1:22:51 AM - granted resource VL0043
Nov 22, 2017 1:22:51 AM - granted resource HP.ULTRIUM4-SCSI.005
Nov 22, 2017 1:22:51 AM - granted resource media_server1-SL500-1
Nov 22, 2017 1:22:52 AM - Info bptm (pid=55566) start backup
Nov 22, 2017 1:22:52 AM - Info bpdm (pid=55599) started
Nov 22, 2017 1:22:52 AM - started process bpdm (pid=55599)
Nov 22, 2017 1:22:52 AM - Info bpdm (pid=55599) reading backup image
Nov 22, 2017 1:22:52 AM - Info bpdm (pid=55599) using 256 data buffers
Nov 22, 2017 1:22:53 AM - Info bptm (pid=55566) media id VL0043 mounted on drive index 3, drivepath /dev/nst4, drivename HP.ULTRIUM4-SCSI.005, copy 2
Nov 22, 2017 1:22:53 AM - Info bptm (pid=55566) INF - Waiting for positioning of media id VL0043 on server media_server1 for writing.
Nov 22, 2017 1:23:36 AM - begin reading
Nov 22, 2017 5:36:14 AM - current media VL0043 complete, requesting next media Any
Nov 22, 2017 5:36:14 AM - current media -- complete, awaiting next media Any. Waiting for resources.
Reason: Drives are in use, Media server: media_server1,
Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A,
Volume Pool: AUTO, Storage Unit: media_server1-SL500-1, Drive Scan Host: N/A,
Disk Pool: N/A, Disk Volume: N/A
Nov 22, 2017 5:36:59 AM - Info bptm (pid=55566) Waiting for mount of media id VL0149 (copy 2) on server media_server1.
Nov 22, 2017 5:36:59 AM - started process bptm (pid=55566)
Nov 22, 2017 5:36:59 AM - mounting VL0149
Nov 22, 2017 5:36:59 AM - Info bptm (pid=55566) INF - Waiting for mount of media id VL0149 on server media_server1 for writing.
Nov 22, 2017 5:36:59 AM - granted resource VL0149
Nov 22, 2017 5:36:59 AM - granted resource HP.ULTRIUM4-SCSI.005
Nov 22, 2017 5:36:59 AM - granted resource media_server1-SL500-1
Nov 22, 2017 5:37:43 AM - Info bptm (pid=55566) media id VL0149 mounted on drive index 3, drivepath /dev/nst4, drivename HP.ULTRIUM4-SCSI.005, copy 2

sclind
Moderator
Moderator
   VIP   

I will look into SIZE_DATA_BUFFERS as all are default.  But first I want...

I would go for the low hanging fruit first. This are easy and safe to change.