cancel
Showing results for 
Search instead for 
Did you mean: 

netbackup duplication from disk to tape using slp and destaging

BackupGuy2015
Level 3

hello all, this is my first post but have been looking at this forum for sometimes now.  basically here is my env:

3 sites (netbackup 7.6.1)

A (remote). Media server + MSDP

B (HQ). Master, Media server(B1) + MSDP, 4 tape drives LTO5, Basic disk for Destaging to tape.

             Media server(B2),  Basic disk for Destaging to tape on B1

C (remote). Media server + Basic disk

-backup for remote A goes to MSDP then duplicate to B using schedule slp.  

-There are backup that uses B1 as destination and there are also backup that uses B2 as destination as well

-All backup to disk eventually will destage to tape, this is hourly

-All SLP from remote site will replicate to B1 MSDP then duplicate to tape.  I have schedule slp setup for this. Only runs during offpeak (non business hours)

As you can see, all backup eventually will write to tape.  Because of destaging to tape AND duplication from MSDP to tape runs in serial, the tape drives will gets very busy and all jobs will be competing for it.  

the problems:

1. as more destaging jobs and slp jobs competing for tape drive resources, we have to keep increasing our the disk size for backup.  Backup images are not fast enough being dumped to tape causing bakcup to fail with disk storage is full.  

2. we are already using job priority on backup jobs.the problem is, those high priority job will use the tape drive and keeping those low priority job  queue, when next round backup comes in then writing to tape, it will goes into a queue.  So this is kind a snow ball effect eventually, duplication jobs and destaging jobs gets more and more in the queue but little gets completed.  

We have put mail, database,etc as high priority but they are also have large backup in size, hence it will take long to finish writing to tape (hogging the tape drives).

my questions:

1.from my observation, we are in need more of tape drives, but how many more? is there some sort of formula to calculate this (best practise some sort)

2.has anyone have similar environment and setup? what would be your solution or way around the tape drive bottle neck?

thank you everyone for your assistance.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

sdo
Moderator
Moderator
Partner    VIP    Certified

True - it is best practice to NOT zone disk WWPN target and tape WWPN target to the same server/appliance initiator WWPN.

It also used to be best practice - for some older HBAs - to also not mix tape and disk on the same dual port HBA - but this is less and less true these days with more modern HBAs.  So, most sites do now mix tape and disk on the same server side dual port HBA card, but just NOT on the same HBA port.

If your server has two dual port HBAs, and thus four initiator ports, and you have dual (resilient) fabric SAN design, then perhaps you can do this:

server HBA1 port 1 initiator - disk target - fabric A

server HBA1 port 2 initiator - tape target - fabric B

server HBA2 port 1 initiator - tape target - fabric A

server HBA2 port 2 initiator - disk target - fabric B

 

View solution in original post

17 REPLIES 17

Nicolai
Moderator
Moderator
Partner    VIP   

You know:

  • how many TB being backuped by Netbackup
  • The staging windows (e.g. from 08:00 to 17:00).

A LOT6 and write 432GB per hours (120MB/sec *60 *60). So non-stop writing from 08:00 to 17:00 is 3.8TB. Then divide 3.8TB with the number of TB bein protected and you have the number of drives required.

That said, its a very optimistic value. Diffrent schedules and volume pool reduceses the 3.8TB plus its very hard to streem a tape drive with 120MB/sec continuous. So a better value would be 2.5 TB.

Please also ensure NUMBER_DATA_BUFFERS is set to 256 or more for best tape drive performance.

Marianne
Level 6
Partner    VIP    Accredited Certified
Any reason why you don't want to allow duplication to tape 24x7? Disk should ideally be large enough to store at least 2 sets of full backups. The main reason for disk backups is to allow for quick restores. If you need all disk backups to be duplicated to tape and emptied out before next backup, I don't see any point of writing to disk first. You may as well backup directly to tape and put multiplexing and multistreaming to good use. I am also wondering if your single master/media server has enough physical resources (memory, cpu, hba's ) to stream all the devices attached to it?

BackupGuy2015
Level 3

Thanks @nicolai for the response.  for lto5 i believe the writing speed should be 140MB/s.  After looking at the some of the completed duplication jobs, i can see that the tape drive was not optimize:

13/06/2015 4:00:16 PM - Info bptm(pid=5404) media id 7570L5 mounted on drive index 1, drivepath {4,0,5,0}, drivename HP.ULTRIUM5-SCSI.001, copy 2
13/06/2015 4:00:16 PM - Info bptm(pid=5404) INF - Waiting for positioning of media id 7570L5 on server per1nbu04 for writing.
13/06/2015 4:08:06 PM - Info bptm(pid=5404) waited for full buffer 14061 times, delayed 27940 times    

it seems all completed duplicate jobs showing as waiting for full buffer... currently our setting are as follows:

NUMBER_DATA_BUFFERS  = 64

SIZE_DATA_BUFFERS = 262144

I will try chaing the the NUMBER_DATA_BUFFERS to 256. Do i need to change the SIZE_DATA_BUFFERS?  It seems this performance will need to be addressed first and see if number of tape drives are actually coping with data being backed up.

@marianne, not allowing duplication for 24x7?

apparently netbackup is taking too much WAN bandwidth during the day and affecting production. I have used the bandwidth limitting option in netbackup but has been confirmed by symantec support that bandwidth limitting for optimized duplication will not limit as configured as there are other overhead which will be there regardless.  For example, if I limit optimized duplication to 5 MB/s (WAN link is 20MB/s) then the actual bandwidth usage will be 5MB/s + overhead.  The actual BW usage was reported by WAN applicance (riverbed).  Has anyone experience using the BW limitting for optimized duplication? does it work as configured? 

I am also wondering if your single master/media server has enough physical resources (memory, cpu, hba's ) to stream all the devices attached to it?

Yes, the media server we are using that has 4 tape drives attached are configured as follows:

cpu: intel xeon 2.27GHz

memory: 24GB

OS: 64bit win 2008 R2

hba: qlogic, all running at P2P at 4GB

i will try suggestion by nicolai and go from there. I will post the result after changing the value.

thank you

 

 

BackupGuy2015
Level 3

oh another thing i found out, there are 2 hba's on the media server.  Each hba shared the ports with disk and tape drive.  Those disk are used for dedup disk (MSDP).  So configuration as follows:

hba1:

PortA: -5xdisk,tape

PortB: -5xdisk,tape

hba2: 

PortA: -5xdisk,tape

PortB: -5xdisk,tape

I believe as best practise, should i be separating those tape drive on separate ports? 

thank you

sdo
Moderator
Moderator
Partner    VIP    Certified

True - it is best practice to NOT zone disk WWPN target and tape WWPN target to the same server/appliance initiator WWPN.

It also used to be best practice - for some older HBAs - to also not mix tape and disk on the same dual port HBA - but this is less and less true these days with more modern HBAs.  So, most sites do now mix tape and disk on the same server side dual port HBA card, but just NOT on the same HBA port.

If your server has two dual port HBAs, and thus four initiator ports, and you have dual (resilient) fabric SAN design, then perhaps you can do this:

server HBA1 port 1 initiator - disk target - fabric A

server HBA1 port 2 initiator - tape target - fabric B

server HBA2 port 1 initiator - tape target - fabric A

server HBA2 port 2 initiator - disk target - fabric B

 

BackupGuy2015
Level 3

thank you for the information sdo. I will be sure to follow that best practise.

sdo
Moderator
Moderator
Partner    VIP    Certified

You should not need to change SIZE_DATA_BUFFERS, because the size you have of 256KB is a typical good value for LTO5 drives.  And if / when changing SIZE_DATA_BUFFERS one has to be very careful about media being readable at other sites, so my advice (right now) would be ti leave SIZE_DATA_BUFFERS at the current setting.

Increasing the NUMBER_DATA_BUFFERS could be a good thing, but your bptm process is waiting for a full buffer, so this means that the duplication job is not receiving data from the duplication source quickly enough, but the the delay count of 27940 is quite low and not a significant problem - so you may not need to change this.  And remember, changing NUMBER_DATA_BUFFERS can have quite an effect on systems with low free RAM - so always do your calculation of total RAM consumed for buffers = number-of-tape-drives * multi-plexing * number-of-data_buffers * size-of-buffers.

BackupGuy2015
Level 3

thank you sdo. much appreciated for the information. I will post further result after changing the value.

Nicolai
Moderator
Moderator
Partner    VIP   

Increasing NUMBER_DATA_BUFFERS for  "waited for full buffer" will not help since data isn't recived fast enough from the client for Netbackup to fill a data buffer. Netbackup is saying "can't send data to tape, got not full buffers". If Netbackup was saying "waited for emptu buffers" increasing NUMBER_DATA_BUFFERS do help.

Genericus
Moderator
Moderator
   VIP   

If your staging to tape is impacting your production LAN, get off the production network.

Set up a dedicated backup VLAN, and use that.

Set up a disk storage, write to that, go from there to tape.

I use three data domain, we pummel them. I write over 100TB through my DD every day - to 17 tape drives.

17 includes some spares so I can do restores, run seperate long term backups, etc.

 

Nicolai is correct - do the math. However - test it, I find that we average closer to 100MB/Sec than the maximum throughput. Some systems cannot generate enough throughput.

 

Before we bought the Data Domains, we made EMC prove they could handle the throughput we needed. Some vendors can handle ingest, or output, not both. TEST in your environment!

NetBackup 10.2.0.1 on Flex 5360, duplicating via SLP to Access 3350, duplicating via SLP to LTO8 in SL8500 via ACSLS

BackupGuy2015
Level 3

thanks for reply guys.  sorry been a while since checking this again.

anyway, as nicolai said, if the disk is slow to read and sure enough the data are not fast enough send to tape, then surely there is a way to adjust the tape drive. As I understand, there are 2 config needs adjusting:

Tape drive:

NUMBER_DATA_BUFFERS  = 32

SIZE_DATA_BUFFERS = 262144

Disk:

NUMBER_DATA_BUFFERS  = 32

SIZE_DATA_BUFFERS = 262144

after changing both number_data_bufferes, here are the result:

7/07/2015 2:57:18 AM - begin Duplicate
7/07/2015 2:57:22 AM - requesting resource per1nbu04-hcart2-robot-tld-0
7/07/2015 2:57:22 AM - awaiting resource per1nbu04-hcart2-robot-tld-0 - No drives are available
7/07/2015 4:52:53 AM - granted resource 7490L5
7/07/2015 4:52:53 AM - granted resource HP.ULTRIUM5-SCSI.001
7/07/2015 4:52:53 AM - granted resource per1nbu04-hcart2-robot-tld-0
7/07/2015 4:52:55 AM - Info bpdbm(pid=6044) catalogued 1232 entries          
7/07/2015 4:52:55 AM - Info bptm(pid=9688) start            
7/07/2015 4:52:55 AM - started process bptm (9688)
7/07/2015 4:52:55 AM - Info bptm(pid=9688) start backup           
7/07/2015 4:52:56 AM - Info bpdm(pid=5552) started            
7/07/2015 4:52:56 AM - started process bpdm (5552)
7/07/2015 4:52:56 AM - Info bpdm(pid=5552) reading backup image          
7/07/2015 4:52:56 AM - Info bpdm(pid=5552) using 32 data buffers         
7/07/2015 4:52:57 AM - Info bptm(pid=9688) media id 7490L5 mounted on drive index 1, drivepath {4,0,5,0}, drivename HP.ULTRIUM5-SCSI.001, copy 2
7/07/2015 4:52:57 AM - begin reading
7/07/2015 4:52:57 AM - Info bptm(pid=9688) INF - Waiting for positioning of media id 7490L5 on server per1nbu04 for writing.
7/07/2015 5:19:31 AM - Info bptm(pid=9688) waited for full buffer 43488 times, delayed 97032 times    
7/07/2015 5:19:39 AM - end reading; read time: 0:26:42
7/07/2015 5:19:39 AM - Info bptm(pid=9688) EXITING with status 0 <----------        
7/07/2015 5:19:39 AM - Info bpdm(pid=5552) completed reading backup image         
7/07/2015 5:19:43 AM - end Duplicate; elapsed time: 2:22:25
the requested operation was successfully completed (0)

 

@genericus, we are doing disk staging which doesn't show thruput..where do you see it?

 

 

sdo
Moderator
Moderator
Partner    VIP    Certified

During duplications from DSSU to tape, open TaskMgr, click Performance tab, click Open Resource Monitor, maximize, click the Disk tab, expand the lower "Storage" panel...

...if you see Active Time (%) near 100% and a Disk Queue Length > 1.0  then this implies that the disk storage underneath the Windows NTFS drive/volume letter is not able to respond quickly enough to the reading/writing process(es).

If you see the above, then use "PerfMon.msc" to view the disk queues in more detail, i.e. check for long (i.e. > 1) read disk queue and/or long (i.e. > 1) write disk queues.

.

There are some other tips related to checking NTFS volume characteristics here:

https://www-secure.symantec.com/connect/forums/msdp-msdp-slp-duplications-slow-what-check

BackupGuy2015
Level 3

Sorry I meant:

Disk:

NUMBER_DATA_BUFFERS_DISK  = 32

SIZE_DATA_BUFFERS_DISK = 262144

sdo
Moderator
Moderator
Partner    VIP    Certified

When this is seen:

...waited for full buffer 43488 times, delayed 97032 times  

...this means that of the 43,000 waits, that bptm was delayed, on average, twice by each wait.  The delay count of 97,000 * 15ms is just over 24 minutes of bptm waiting for data.  Reducing the buffer count probably won't improve the situation.  The issue is that the client (data producer) (i.e. the media server acting as DSSU) is unable to move data quickly enough to bptm (data consumer).  There is another layer to this, in that there are usually bptm parent and bptm child processes on the server, but there's probably no need to dig in to that layer, because the data just isn't getting to bptm quickly enough.

Nicolai
Moderator
Moderator
Partner    VIP   

Regardless of what feed and speed you get, NUMBER_DATA_BUFFERS should be at least 128 for tape.

Genericus
Moderator
Moderator
   VIP   

To determine throughput speed, I have to do the math, 1TB processed in 3 hours for example is just under 100MB/Sec.

Some devices have limits on how many "restore" jobs can run at once.

Some vendors catagorize output from disk as a restore, even if you are writing to tape.

 

Check the media server for memory, since number of buffers * size of buffers * number of jobs = real memory used.

 

Here are my tape media server values: (note - buffers can have NDMP, DISK and RESTORE options)

NDMP must match filer for best results!

Now checking NUMBER_DATA_BUFFERS:  128
Now checking SIZE_DATA_BUFFERS:  262144
Now checking NUMBER_DATA_BUFFERS_NDMP:   64
Now checking SIZE_DATA_BUFFERS_NDMP:  262144
Now checking NET_BUFFER_SZ:  262144
 

NetBackup 10.2.0.1 on Flex 5360, duplicating via SLP to Access 3350, duplicating via SLP to LTO8 in SL8500 via ACSLS

BackupGuy2015
Level 3

hi sdo

thanks for that info. I checked performance monitor as per your link..and here are the result while duplicating is running:

Avg disk queue length: 7

avg read disk queue length: 6

avg write disk queue length: 0.018

disk activity : avg >95%

I would conclude that the disk performance is under performing for sure.  Last time I ran camel tool from netbackup, it was not even reacing 130mb/s (recommended) only up to 90 mb/s.

Also, how much effect if the dedupe database and data reside on same drive letter? and if we want to split the msdp database and data on different drive letter, will that only involve changing the configuration or is there technote out there?

thank you