Solved: want to know where is the bottleneck in the replic...

techiee · ‎12-06-2013

We are using AIR with Data Domain.

We cannot and don't want to launch manual replication between DataDomains.

We are using AIR and don't want to use DataDomain outside of NetBackup.

DDOS : 5.4.0.5-393571

OST plugin version : 2.6.0.2-365315

NetBackup master : SLES 11 SP2 on NBU7.6 FA

The elapsed time to complete the replication between Data Domains is at least 2 minutes within NetBackup. Nevertheless, the replication between DDs is taking less than 1 minute.

Sample :

29 oct. 2013 09:32:14 - requesting resource

29 oct. 2013 09:32:14 - granted resource

29 oct. 2013 09:32:14 - Info nbreplicate (pid=####) Suspend window close behavior is not supported for nbreplicate

29 oct. 2013 09:32:14 - Info nbreplicate (pid=####) window close behavior: Continue processing the current image

29 oct. 2013 09:32:14 - started process RUNCMD (pid=####)

29 oct. 2013 09:32:14 - requesting resource @####

29 oct. 2013 09:32:14 - reserving resource @####

29 oct. 2013 09:32:14 - resource @#### reserved

29 oct. 2013 09:32:14 - granted resource MediaID=@####;DiskVolume=;DiskPool=;Path=;StorageServer=;MediaServer=

29 oct. 2013 09:32:15 - Info bpdm (pid=####) started

29 oct. 2013 09:32:15 - started process bpdm (pid=####)

29 oct. 2013 09:32:16 - Info (pid=####) Using OpenStorage to replicate backup id , media id @####, storage server , disk volume

29 oct. 2013 09:32:16 - Info (pid=####) Replicating images to target storage server Unknown, disk volume

29 oct. 2013 09:34:18 - Info bpdm (pid=####) EXITING with status 0

29 oct. 2013 09:34:18 - Replicated backup id successfully

29 oct. 2013 09:34:18 - Info bpdm (pid=####) started

29 oct. 2013 09:34:18 - started process bpdm (pid=####)

29 oct. 2013 09:34:18 - requesting resource @####

29 oct. 2013 09:34:18 - granted resource MediaID=@####;DiskVolume=;DiskPool=;Path=;StorageServer=

29 oct. 2013 09:34:20 - Info agamar (pid=####) Using OpenStorage to replicate backup id , media id @####, storage server

29 oct. 2013 09:34:20 - Info agamar (pid=####) Replicating images to target storage server Unknown, disk volume

29 oct. 2013 09:36:21 - Info bpdm (pid=####) EXITING with status 0

29 oct. 2013 09:36:21 - Replicated backup id successfully

#####:~> bperror -jobid #### -U

TIME SERVER/CLIENT TEXT

10/29/2013 09:32:16 ##### Using OpenStorage to replicate backup id #####, media id @####, storage server #####

10/29/2013 09:32:16 ##### Replicating images to target storage server Unknown, disk volume #####

10/29/2013 09:34:18 ##### successfully replicated backup id #####, copy 1, #### Kbytes at 15.000 Kbytes/sec

If you take a look on the logs, you will notice that the replication between DDs took less than a second and 2 minutes 2 seconds with NetBackup (backup id #####)

logs on ddfs for the same ::::

###:/dd01/log # grep messages

Oct 29 09:32:17 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###:4:1:: took 0 sec, net throughput 12 KB/s, virtual throughput 378 KB/s, refs_sent 6, refs_features_sent 0, segs_sent 0, segs_features_sent 3, size 65536, vbytes 65536, nbytes 2204

Oct 29 09:32:18 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###:: took 0 sec, net throughput 1 KB/s, virtual throughput 9 KB/s, refs_sent 1, refs_features_sent 0, segs_sent 0, segs_features_sent 1, size 8192, vbytes 8192, nbytes 1320

Oct 29 09:32:18 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###:: took 0 sec, net throughput 4 KB/s, virtual throughput 116 KB/s, refs_sent 6, refs_features_sent 0, segs_sent 0, segs_features_sent 4, size 65536, vbytes 65536, nbytes 2756

Oct 29 09:32:19 dd01 ddfs[####]: NOTICE: ####-####: filecopy ####: do_filecopy_send_file: file ###: took 1 sec, net throughput 664 KB/s, virtual throughput 1729 KB/s, refs_sent 201, refs_features_sent 0, segs_sent 0, segs_features_sent 200, size 1769472, vbytes 1769472, nbytes 680124

cat /usr/openv/var/global/nbcl.conf

SLP.MAX_SIZE_PER_DUPLICATION_JOB = 400 GB

SLP.MAX_TIME_TIL_FORCE_SMALL_DUPLICATION_JOB = 40 minutes

SLP.MIN_SIZE_PER_DUPLICATION_JOB = 50 GB

SLP.MAX_SIZE_PER_BACKUP_REPLICATION_JOB = 50 MB

SLP.JOB_SUBMISSION_INTERVAL = 5 minutes

SLP.IMAGE_PROCESSING_INTERVAL = 5 minutes

Due to the secure env, We couldn't share any logs/more environmental deatials, also you could 've noticed I hashed all personal/not generic info.

The link between the Data Domain is 10 Gb.

We have to replicate thousands of very small backup images and the minimum time to replicate a backup image using AIR/Data Domain is 2 minutes.

2 minutes is too long for us.

When we are duplicating many small backup images in a single batch, the duplication is sequential and each image is taking 2 minutes to be duplicated.

All our backups are spread during the night and normally, we have many images in the file list for replication job.

data domain support comment ::

when looking at the datadomain logs, I can see that the ddboost optdup ended properly and no more optdup operations are logged after that.

At this point I would say NBU and DataDomain don't agree about the ddboost optdup status. Also, despite NBU asks to retry the operation, the ddboost optdup runs only once.

In short, I don't see any ddboost filecopy activity regarding these images at that time on the datadomain but I see NBU logs reporting "ddpi_remote_fileop() success for operation: 1" operation.

We want to know where is the bottleneck in the replication. (NetBackup or Data Domain)???

Any thoughts or suggestions would be so helpful & highly appreciated.

techiee · ‎12-19-2013

sorry for this late response.

& thank you Marianne & Steven so much for your timely responses. :)

still not solved the issue but found where the problem is:

- The time difference is due to a HARD CODED sleep period at the end of a function call, found in BPDM logs.

which is POSSIBLY to be tuned in future releases 7.6.x.x.

View solution in original post

Marianne · ‎12-06-2013

Check bptm logs on both sides.

You should be able to see in these logs what buffer size is used and if the process was waiting for full or empty buffers.

Handy NetBackup Links

Steven576 · ‎12-09-2013

On your destination Data Domain run the below command

ddboost event show

If you see lots of events waiting try configuring the below on the destination Master sever

SLP.IMAGE_PROCESSING_INTERVAL = 10 seconds

The bottle neck looks to be on how quickly your destination master server is importing the images you are replicating.

techiee · ‎12-19-2013

sorry for this late response.

& thank you Marianne & Steven so much for your timely responses. :)

still not solved the issue but found where the problem is:

- The time difference is due to a HARD CODED sleep period at the end of a function call, found in BPDM logs.

which is POSSIBLY to be tuned in future releases 7.6.x.x.

VOX

want to know where is the bottleneck in the replication. (NetBackup or Data Domain)???