cancel
Showing results for 
Search instead for 
Did you mean: 

want to know where is the bottleneck in the replication. (NetBackup or Data Domain)???

techiee
Level 3
Employee Accredited
We are using AIR with Data Domain. 
We cannot and don't want to launch manual replication between DataDomains. 
We are using AIR and don't want to use DataDomain outside of NetBackup. 
 
 
DDOS : 5.4.0.5-393571 
OST plugin version : 2.6.0.2-365315 
 
NetBackup master : SLES 11 SP2 on NBU7.6 FA 
 
 
The elapsed time to complete the replication between Data Domains is at least 2 minutes within NetBackup. Nevertheless, the replication between DDs is taking less than 1 minute. 
 
Sample : 
 
29 oct. 2013 09:32:14 - requesting resource 
29 oct. 2013 09:32:14 - granted resource 
29 oct. 2013 09:32:14 - Info nbreplicate (pid=####) Suspend window close behavior is not supported for nbreplicate 
29 oct. 2013 09:32:14 - Info nbreplicate (pid=####) window close behavior: Continue processing the current image 
29 oct. 2013 09:32:14 - started process RUNCMD (pid=####) 
29 oct. 2013 09:32:14 - requesting resource @#### 
29 oct. 2013 09:32:14 - reserving resource @#### 
29 oct. 2013 09:32:14 - resource @#### reserved 
29 oct. 2013 09:32:14 - granted resource MediaID=@####;DiskVolume=;DiskPool=;Path=;StorageServer=;MediaServer= 
29 oct. 2013 09:32:15 - Info bpdm (pid=####) started 
29 oct. 2013 09:32:15 - started process bpdm (pid=####) 
29 oct. 2013 09:32:16 - Info  (pid=####) Using OpenStorage to replicate backup id , media id @####, storage server , disk volume 
29 oct. 2013 09:32:16 - Info  (pid=####) Replicating images to target storage server Unknown, disk volume 
29 oct. 2013 09:34:18 - Info bpdm (pid=####) EXITING with status 0 
29 oct. 2013 09:34:18 - Replicated backup id  successfully 
29 oct. 2013 09:34:18 - Info bpdm (pid=####) started 
29 oct. 2013 09:34:18 - started process bpdm (pid=####) 
29 oct. 2013 09:34:18 - requesting resource @#### 
29 oct. 2013 09:34:18 - granted resource MediaID=@####;DiskVolume=;DiskPool=;Path=;StorageServer= 
29 oct. 2013 09:34:20 - Info agamar (pid=####) Using OpenStorage to replicate backup id , media id @####, storage server  
29 oct. 2013 09:34:20 - Info agamar (pid=####) Replicating images to target storage server Unknown, disk volume 
29 oct. 2013 09:36:21 - Info bpdm (pid=####) EXITING with status 0 
29 oct. 2013 09:36:21 - Replicated backup id  successfully 
 
 
#####:~> bperror -jobid #### -U 
TIME SERVER/CLIENT TEXT 
10/29/2013 09:32:16 ##### Using OpenStorage to replicate backup id #####, media id @####, storage server #####
10/29/2013 09:32:16 ##### Replicating images to target storage server Unknown, disk volume #####
10/29/2013 09:34:18 ##### successfully replicated backup id #####, copy 1, #### Kbytes at 15.000 Kbytes/sec 
 
 
 
If you take a look on the logs, you will notice that the replication between DDs took less than a second and 2 minutes 2 seconds with NetBackup (backup id #####) 
 
 
logs on ddfs for the same  :::: 
###:/dd01/log # grep  messages 
Oct 29 09:32:17 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###:4:1:: took 0 sec, net throughput 12 KB/s, virtual throughput 378 KB/s, refs_sent 6, refs_features_sent 0, segs_sent 0, segs_features_sent 3, size 65536, vbytes 65536, nbytes 2204 
Oct 29 09:32:18 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###:: took 0 sec, net throughput 1 KB/s, virtual throughput 9 KB/s, refs_sent 1, refs_features_sent 0, segs_sent 0, segs_features_sent 1, size 8192, vbytes 8192, nbytes 1320 
Oct 29 09:32:18 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###:: took 0 sec, net throughput 4 KB/s, virtual throughput 116 KB/s, refs_sent 6, refs_features_sent 0, segs_sent 0, segs_features_sent 4, size 65536, vbytes 65536, nbytes 2756 
Oct 29 09:32:19 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###: took 1 sec, net throughput 664 KB/s, virtual throughput 1729 KB/s, refs_sent 201, refs_features_sent 0, segs_sent 0, segs_features_sent 200, size 1769472, vbytes 1769472, nbytes 680124 
 
 
 cat /usr/openv/var/global/nbcl.conf
SLP.MAX_SIZE_PER_DUPLICATION_JOB = 400 GB
SLP.MAX_TIME_TIL_FORCE_SMALL_DUPLICATION_JOB = 40 minutes
SLP.MIN_SIZE_PER_DUPLICATION_JOB = 50 GB
SLP.MAX_SIZE_PER_BACKUP_REPLICATION_JOB = 50 MB
SLP.JOB_SUBMISSION_INTERVAL = 5 minutes
SLP.IMAGE_PROCESSING_INTERVAL = 5 minutes
 
Due to the secure env, We couldn't share any logs/more environmental deatials, also you could 've noticed I hashed all personal/not generic info. 
 
The link between the Data Domain is 10 Gb.
We have to replicate thousands of very small backup images and the minimum time to replicate a backup image using AIR/Data Domain is 2 minutes.
2 minutes is too long for us.
 
When we are duplicating many small backup images in a single batch, the duplication is sequential and each image is taking 2 minutes to be duplicated.
All our backups are spread during the night and normally, we have many images in the file list for replication job.
 
 
data domain support comment :: 
 
when looking at the datadomain logs, I can see that the ddboost optdup ended properly and no more optdup operations are logged after that.
At this point I would say NBU and DataDomain don't agree about the ddboost optdup status. Also, despite NBU asks to retry the operation, the ddboost optdup runs only once.
In short, I don't see any ddboost filecopy activity regarding these images at that time on the datadomain but I see NBU logs reporting "ddpi_remote_fileop() success for operation: 1" operation.
 
 
We want to know where is the bottleneck in the replication. (NetBackup or Data Domain)???
Any thoughts or suggestions would be so helpful & highly appreciated.
1 ACCEPTED SOLUTION

Accepted Solutions

techiee
Level 3
Employee Accredited

sorry for this late response.

& thank you Marianne & Steven so much for your timely responses. :) 

still not solved the issue but found where the problem is: 

- The time difference is due to a HARD CODED sleep period at the end of a function call, found in BPDM logs.

which is POSSIBLY to be tuned in future releases 7.6.x.x. 

View solution in original post

3 REPLIES 3

Marianne
Level 6
Partner    VIP    Accredited Certified

Check bptm logs on both sides.

You should be able to see in these logs what buffer size is used and if the process was waiting for full or empty buffers.

Steven576
Level 1

On your destination Data Domain run the below command

ddboost event show

If you see lots of events waiting try configuring the below on the destination Master sever

SLP.IMAGE_PROCESSING_INTERVAL = 10 seconds

The bottle neck looks to be on how quickly your destination master server is importing the images you are replicating.

 

 

techiee
Level 3
Employee Accredited

sorry for this late response.

& thank you Marianne & Steven so much for your timely responses. :) 

still not solved the issue but found where the problem is: 

- The time difference is due to a HARD CODED sleep period at the end of a function call, found in BPDM logs.

which is POSSIBLY to be tuned in future releases 7.6.x.x.