Forum Discussion

Tanmoy1's avatar
Tanmoy1
Level 4
12 years ago

AIR Replication Jobs are getting queued.

 

We are experiencing a huge AIR replication job queue over the last two weeks and also many active jobs are running very slow. Also observed few of the oldest replications are failing with error code 227.My questions are ...
 
1.Is there any way we can improve the performance of the replication jobs considering the below environment settings.
2.Is there any way we can control the number of active replication jobs running at any point of time ?
 
Please help me...Thanks in advance.
 
Environment Details:
Master / Media Server : Netbackup Appliance 5220 2.5.2 
Netbackup Version : 7.5.0.5
Replication with AIR
 
Replication Bandwidth limitation :
master1:/disk/etc/puredisk # cat agent.cfg | grep bandwidth
# A bandwidth limit, in KiB/sec.
bandwidthlimit=1280
 
SLP parameters:
master1:/usr/openv/netbackup/db/config # cat LIFECYCLE_PARAMETERS 
AUTO_CREATE_IMPORT_SLP = 1 
MAX_GB_SIZE_PER_DUPLICATION_JOB = 100 
MIN_GB_SIZE_PER_DUPLICATION_JOB = 25
 
 
 
Replication failed with 227 (detailed log):
06/17/2013 01:11:48 - requesting resource LCM_stu_disk_master1
06/17/2013 01:11:48 - Info nbrb (pid=21872) Limit has been reached for the logical resource LCM_stu_disk_master1
07/10/2013 02:42:21 - granted resource  LCM_stu_disk_master1
07/10/2013 02:42:23 - started process RUNCMD (pid=6070)
07/10/2013 02:42:24 - Info bpdm (pid=6101) started
07/10/2013 02:42:24 - started process bpdm (pid=6101)
07/10/2013 02:42:24 - requesting resource @aaaac
07/10/2013 02:42:24 - reserving resource @aaaac
07/10/2013 02:42:24 - resource @aaaac reserved
07/10/2013 02:42:24 - granted resource  MediaID=@aaaac;DiskVolume=PureDiskVolume;DiskPool=dp_disk_master1;Path=PureDiskVolume;StorageServer=master1;MediaServer=master1
07/10/2013 02:44:42 - Info master1 (pid=6101) Using OpenStorage to replicate backup id Client1-db_1371393521, media id @aaaac, storage server master1, disk volume PureDiskVolume
07/10/2013 02:44:42 - Info master1 (pid=6101) Replicating images to target storage server hkx1bak03.apac.experian.local, disk volume PureDiskVolume
07/17/2013 11:19:46 - Info master1 (pid=6101) StorageServer=PureDisk:master1; Report=PDDO Stats for (master1): scanned: 24790571 KB, CR sent: 24838491 KB, CR sent over FC: 0 KB, dedup: 0.0%
07/17/2013 11:19:46 - Info bpdm (pid=6101) EXITING with status 0
07/17/2013 11:19:46 - Replicated backup id Client1-db_1371393521 successfully
07/17/2013 11:19:47 - Info bpdm (pid=3444) started
07/17/2013 11:19:47 - started process bpdm (pid=3444)
07/17/2013 11:19:47 - requesting resource @aaaac
07/17/2013 11:19:47 - granted resource  MediaID=@aaaac;DiskVolume=PureDiskVolume;DiskPool=dp_disk_master1;Path=PureDiskVolume;StorageServer=master1;MediaServer=master1
07/17/2013 11:21:29 - Info master1 (pid=3444) Using OpenStorage to replicate backup id Client1-db_1371393625, media id @aaaac, storage server master1, disk volume PureDiskVolume
07/17/2013 11:21:30 - Info master1 (pid=3444) Replicating images to target storage server hkx1bak03.apac.experian.local, disk volume PureDiskVolume
07/19/2013 02:55:00 - Info master1 (pid=3444) StorageServer=PureDisk:master1; Report=PDDO Stats for (master1): scanned: 4 KB, CR sent: 1 KB, CR sent over FC: 0 KB, dedup: 75.0%
07/19/2013 02:55:00 - Info bpdm (pid=3444) EXITING with status 0
07/19/2013 02:55:00 - Error nbreplicate (pid=6070) Failed to update image copy state for BID Client1-db_1371393625, replica copy 102.  EMM error code = 2020005.  Replication WAS successful
no entity was found  (227)
 
  • You would need to check for that image on both sites to see what its state is to determine if it is OK

    You can also try nbstlutil pendimplist on the target site and nbstlutil replist on the "sending" site to see if the image is classed as complete

    also the nbstlutil stlilist -backupid Client1-db_1371393625 to see what it says about the image in question on both sites

    Finally make sure it does appear in the target site when doing a verify in the catalog and that it appears in the BAR GUI

     

3 Replies

  • If you throttle it too much everything can just hang

    You usually find that if you cancel them all then when the SLP kicks back in they will start to fly through again for a while

    Your bandwidth limit of 1200 may be too small for it to make use of - try increasing it (needs a service re-start) - the opinion seems to vary but 1200 equates to about 1 MB/s.

    Either way ... once you have hit this state you need to cancel them all and let them fire off again to get any throughput out of the system (you can just cancel the active ones which usually lets the queued ones run though)

    Hope this helps

  •  

    Thanks a lot Mark for your comments. We have a Symantec case opened for this issue .As per their investigation with the provided logs/command outputs, they confirmed that there are some replication jobs which makes the catalog busy which is the reason for this replication queue. They have suggested to apply SYMC_NBAPP_EEB_ET3105408-2.5.2.0-1.x86_64.rpm on the appliance. After installation of the eeb & service restart the SLP replication jobs are started queuing once again. It will take some time for me to comment on the improvements. But still we are getting some Job failure with 227 which state the following 

    Error nbreplicate (pid=6070) Failed to update image copy state for BID Client1-db_1371393625, replica copy 102.  EMM error code = 2020005.  Replication WAS successful

    no entity was found  (227)

    Can you please tell me what does it mean? Does the replication completed successfully for all the image IDs of its partial .

     

  • You would need to check for that image on both sites to see what its state is to determine if it is OK

    You can also try nbstlutil pendimplist on the target site and nbstlutil replist on the "sending" site to see if the image is classed as complete

    also the nbstlutil stlilist -backupid Client1-db_1371393625 to see what it says about the image in question on both sites

    Finally make sure it does appear in the target site when doing a verify in the catalog and that it appears in the BAR GUI