Forum Discussion

BackupGuy2015's avatar
10 years ago

netbackup duplication from disk to tape using slp and destaging

hello all, this is my first post but have been looking at this forum for sometimes now.  basically here is my env:

3 sites (netbackup 7.6.1)

A (remote). Media server + MSDP

B (HQ). Master, Media server(B1) + MSDP, 4 tape drives LTO5, Basic disk for Destaging to tape.

             Media server(B2),  Basic disk for Destaging to tape on B1

C (remote). Media server + Basic disk

-backup for remote A goes to MSDP then duplicate to B using schedule slp.  

-There are backup that uses B1 as destination and there are also backup that uses B2 as destination as well

-All backup to disk eventually will destage to tape, this is hourly

-All SLP from remote site will replicate to B1 MSDP then duplicate to tape.  I have schedule slp setup for this. Only runs during offpeak (non business hours)

As you can see, all backup eventually will write to tape.  Because of destaging to tape AND duplication from MSDP to tape runs in serial, the tape drives will gets very busy and all jobs will be competing for it.  

the problems:

1. as more destaging jobs and slp jobs competing for tape drive resources, we have to keep increasing our the disk size for backup.  Backup images are not fast enough being dumped to tape causing bakcup to fail with disk storage is full.  

2. we are already using job priority on backup jobs.the problem is, those high priority job will use the tape drive and keeping those low priority job  queue, when next round backup comes in then writing to tape, it will goes into a queue.  So this is kind a snow ball effect eventually, duplication jobs and destaging jobs gets more and more in the queue but little gets completed.  

We have put mail, database,etc as high priority but they are also have large backup in size, hence it will take long to finish writing to tape (hogging the tape drives).

my questions:

1.from my observation, we are in need more of tape drives, but how many more? is there some sort of formula to calculate this (best practise some sort)

2.has anyone have similar environment and setup? what would be your solution or way around the tape drive bottle neck?

thank you everyone for your assistance.

 

 

  • True - it is best practice to NOT zone disk WWPN target and tape WWPN target to the same server/appliance initiator WWPN.

    It also used to be best practice - for some older HBAs - to also not mix tape and disk on the same dual port HBA - but this is less and less true these days with more modern HBAs.  So, most sites do now mix tape and disk on the same server side dual port HBA card, but just NOT on the same HBA port.

    If your server has two dual port HBAs, and thus four initiator ports, and you have dual (resilient) fabric SAN design, then perhaps you can do this:

    server HBA1 port 1 initiator - disk target - fabric A

    server HBA1 port 2 initiator - tape target - fabric B

    server HBA2 port 1 initiator - tape target - fabric A

    server HBA2 port 2 initiator - disk target - fabric B

     

17 Replies

  • thanks for reply guys.  sorry been a while since checking this again.

    anyway, as nicolai said, if the disk is slow to read and sure enough the data are not fast enough send to tape, then surely there is a way to adjust the tape drive. As I understand, there are 2 config needs adjusting:

    Tape drive:

    NUMBER_DATA_BUFFERS  = 32

    SIZE_DATA_BUFFERS = 262144

    Disk:

    NUMBER_DATA_BUFFERS  = 32

    SIZE_DATA_BUFFERS = 262144

    after changing both number_data_bufferes, here are the result:

    7/07/2015 2:57:18 AM - begin Duplicate
    7/07/2015 2:57:22 AM - requesting resource per1nbu04-hcart2-robot-tld-0
    7/07/2015 2:57:22 AM - awaiting resource per1nbu04-hcart2-robot-tld-0 - No drives are available
    7/07/2015 4:52:53 AM - granted resource 7490L5
    7/07/2015 4:52:53 AM - granted resource HP.ULTRIUM5-SCSI.001
    7/07/2015 4:52:53 AM - granted resource per1nbu04-hcart2-robot-tld-0
    7/07/2015 4:52:55 AM - Info bpdbm(pid=6044) catalogued 1232 entries          
    7/07/2015 4:52:55 AM - Info bptm(pid=9688) start            
    7/07/2015 4:52:55 AM - started process bptm (9688)
    7/07/2015 4:52:55 AM - Info bptm(pid=9688) start backup           
    7/07/2015 4:52:56 AM - Info bpdm(pid=5552) started            
    7/07/2015 4:52:56 AM - started process bpdm (5552)
    7/07/2015 4:52:56 AM - Info bpdm(pid=5552) reading backup image          
    7/07/2015 4:52:56 AM - Info bpdm(pid=5552) using 32 data buffers         
    7/07/2015 4:52:57 AM - Info bptm(pid=9688) media id 7490L5 mounted on drive index 1, drivepath {4,0,5,0}, drivename HP.ULTRIUM5-SCSI.001, copy 2
    7/07/2015 4:52:57 AM - begin reading
    7/07/2015 4:52:57 AM - Info bptm(pid=9688) INF - Waiting for positioning of media id 7490L5 on server per1nbu04 for writing.
    7/07/2015 5:19:31 AM - Info bptm(pid=9688) waited for full buffer 43488 times, delayed 97032 times    
    7/07/2015 5:19:39 AM - end reading; read time: 0:26:42
    7/07/2015 5:19:39 AM - Info bptm(pid=9688) EXITING with status 0 <----------        
    7/07/2015 5:19:39 AM - Info bpdm(pid=5552) completed reading backup image         
    7/07/2015 5:19:43 AM - end Duplicate; elapsed time: 2:22:25
    the requested operation was successfully completed (0)

     

    @genericus, we are doing disk staging which doesn't show thruput..where do you see it?

     

     

  • During duplications from DSSU to tape, open TaskMgr, click Performance tab, click Open Resource Monitor, maximize, click the Disk tab, expand the lower "Storage" panel...

    ...if you see Active Time (%) near 100% and a Disk Queue Length > 1.0  then this implies that the disk storage underneath the Windows NTFS drive/volume letter is not able to respond quickly enough to the reading/writing process(es).

    If you see the above, then use "PerfMon.msc" to view the disk queues in more detail, i.e. check for long (i.e. > 1) read disk queue and/or long (i.e. > 1) write disk queues.

    .

    There are some other tips related to checking NTFS volume characteristics here:

    https://www-secure.symantec.com/connect/forums/msdp-msdp-slp-duplications-slow-what-check

  • Sorry I meant:

    Disk:

    NUMBER_DATA_BUFFERS_DISK  = 32

    SIZE_DATA_BUFFERS_DISK = 262144

  • When this is seen:

    ...waited for full buffer 43488 times, delayed 97032 times  

    ...this means that of the 43,000 waits, that bptm was delayed, on average, twice by each wait.  The delay count of 97,000 * 15ms is just over 24 minutes of bptm waiting for data.  Reducing the buffer count probably won't improve the situation.  The issue is that the client (data producer) (i.e. the media server acting as DSSU) is unable to move data quickly enough to bptm (data consumer).  There is another layer to this, in that there are usually bptm parent and bptm child processes on the server, but there's probably no need to dig in to that layer, because the data just isn't getting to bptm quickly enough.

  • Regardless of what feed and speed you get, NUMBER_DATA_BUFFERS should be at least 128 for tape.

  • To determine throughput speed, I have to do the math, 1TB processed in 3 hours for example is just under 100MB/Sec.

    Some devices have limits on how many "restore" jobs can run at once.

    Some vendors catagorize output from disk as a restore, even if you are writing to tape.

     

    Check the media server for memory, since number of buffers * size of buffers * number of jobs = real memory used.

     

    Here are my tape media server values: (note - buffers can have NDMP, DISK and RESTORE options)

    NDMP must match filer for best results!

    Now checking NUMBER_DATA_BUFFERS:  128
    Now checking SIZE_DATA_BUFFERS:  262144
    Now checking NUMBER_DATA_BUFFERS_NDMP:   64
    Now checking SIZE_DATA_BUFFERS_NDMP:  262144
    Now checking NET_BUFFER_SZ:  262144
     

  • hi sdo

    thanks for that info. I checked performance monitor as per your link..and here are the result while duplicating is running:

    Avg disk queue length: 7

    avg read disk queue length: 6

    avg write disk queue length: 0.018

    disk activity : avg >95%

    I would conclude that the disk performance is under performing for sure.  Last time I ran camel tool from netbackup, it was not even reacing 130mb/s (recommended) only up to 90 mb/s.

    Also, how much effect if the dedupe database and data reside on same drive letter? and if we want to split the msdp database and data on different drive letter, will that only involve changing the configuration or is there technote out there?

    thank you