cancel
Showing results for 
Search instead for 
Did you mean: 

Duplication Jobs hung at 99%

ShasRaj_UK
Level 3
Partner

Hi All,

I am struggling with the huge backup and duplication queue and then the failure with 196 issue .

I  have one master (v 7.6.0.4 , linux) connect to two netbackup appliance media server (v 2.6.0.4) .

Each media server has one one storage unit of capacity 67.199 Tb.

 

After looking at the long active job I found 20 duplication jobs has occupied the storage unit on media server01 as for a long time and seems they were hanged after 99% of completetion due to which the backup were stuck and then failed. When I abort those Duplication jobs the backup which were in queue at that time starts writting .

Max. concurrent jobs on each storage unit is 20.

I am copying the jod detials below , please help me out.

21/07/2015 15:42:07 - Info bpdm(pid=56598) started            
21/07/2015 15:42:07 - started process bpdm (56598)
21/07/2015 15:42:11 - Info bpdm(pid=56598) requesting nbjm for media         
21/07/2015 16:25:17 - requesting resource LCM_stu_lappsnbua01_dedupe
21/07/2015 16:25:18 - begin Duplicate
21/07/2015 16:25:18 - granted resource LCM_stu_lappsnbua01_dedupe
21/07/2015 16:25:18 - started process RUNCMD (4266)
21/07/2015 16:25:19 - ended process 0 (4266)
21/07/2015 16:25:40 - requesting resource stu_lappsnbua01_dedupe
21/07/2015 16:25:40 - reserving resource @aaaai
21/07/2015 16:25:46 - awaiting resource @aaaai Reason: Disk pool is unavailable, Media Server: N/A, 
     Robot Number: NONE, Robot Type: NONE, Media ID: @aaaai, Drive Name: N/A, 
     Volume Pool: N/A, Storage Unit: N/A, Drive Scan Host: N/A
    
21/07/2015 16:28:02 - awaiting resource stu_lappsnbua01_dedupe - Maximum job count has been reached for the storage unit
21/07/2015 16:29:12 - Info Duplicate(pid=4266) Initiating optimized duplication from @aaaai to @aaaag      
21/07/2015 16:29:12 - Info bpduplicate(pid=4266) Suspend window close behavior is not supported for optimized duplications   
21/07/2015 16:29:12 - Info bpduplicate(pid=4266) window close behavior: Continue processing the current image     
21/07/2015 16:29:12 - reserved resource @aaaai
21/07/2015 16:29:12 - granted resource MediaID=@aaaag;DiskVolume=PureDiskVolume;DiskPool=dp_lappsnbua01_dedupe;Path=PureDiskVolume;StorageServer=lappsnbua01;MediaServer=lappsnbua01
21/07/2015 16:29:12 - granted resource stu_lappsnbua01_dedupe
21/07/2015 16:29:12 - requesting resource @aaaai
21/07/2015 16:29:12 - granted resource MediaID=@aaaai;DiskVolume=PureDiskVolume;DiskPool=dp_lappsnbua02_dedupe;Path=PureDiskVolume;StorageServer=lappsnbua02;MediaServer=lappsnbua01
21/07/2015 17:21:13 - begin writing
21/07/2015 17:21:13 - end writing; write time: 0:00:00
21/07/2015 17:21:17 - begin writing
21/07/2015 17:21:17 - end writing; write time: 0:00:00
21/07/2015 17:21:47 - begin writing
21/07/2015 17:22:24 - Info bpdm(pid=56598) Creating OPT_DUP_END marker          
22/07/2015 08:36:59 - Error bpdm(pid=56598) media manager terminated by parent process       
22/07/2015 08:36:59 - Info lappsnbua01(pid=56598) StorageServer=PureDisk:lappsnbua02; Report=PDDO Stats for (lappsnbua02): scanned: 0 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 0.0%, cache disabled
22/07/2015 09:24:05 - Error bpduplicate(pid=4266) bpduplicate terminated by signal (1)        
22/07/2015 09:24:05 - Error bpduplicate(pid=4266) Duplicate of backupid LAPPWSDCPDB01_1437111038 failed, termination requested by administrator (150).   
22/07/2015 09:24:05 - Error bpduplicate(pid=4266) Status = no images were successfully processed.      
22/07/2015 09:24:05 - end Duplicate; elapsed time: 16:58:47

15 REPLIES 15

ShasRaj_UK
Level 3
Partner

Latest Duplication job details

symantec040
Level 4

In earlier appliances versions prior to 2.6.x, there was this rebase process run. Whch could cause this.

 

#./crcontrol --rebasestate

--rebaseoff

If On you can turn it off and check for duplication jobs to finish.

ShasRaj_UK
Level 3
Partner

Thanks for your response !

I did not find the switch " --rebasestate "

crcontrol: interactive content router control tool

Usage: crcontrol [-c <config>] [-r <routes>] [-i] [[-l <login>] [-p <password>] -m <mode> --queueinfo --reloadrt --hardsync --drstart <F|I> --drstop <result> --drstate --compacton --compactoff --compactstart [compact_pct [lbound]] --compactstate --dsstat --crccheckon --crccheckoff --crccheckrestart --crccheckstate --processqueue --processqueueinfo --processqueuestatus [--progressinfo] -M <marker>] [--modecontrol <mode>] [--trace <log>] --reloadconfig

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Hi,

 

Are you writing to both appliances for backups? Do you have SLP windows configured to say, when to backup and when to duplicate?

ShasRaj_UK
Level 3
Partner

Yes , backups are configured on both the appliances and yes I do have SLP confirgured for 24/7 .

Also both appliances dupliacte there images to other and then to tapes.

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

I would really suggest the you create windows for

  1. Backup
  2. Duplication to appliance
  3. Duplication to tape

Depending the amount of jobs you're sending to the appliance it can really put strain on it when you're try to write (backup) and read ( duplicate) from it. That load becomes even higher when you're doing duplication to tape.

 

I would also reduce the number of stream you're using for duplication to tape by creating a separate storage unit.

 

Finally, I would upgrade to 2.6.1.2 / 7.6.1.2 as the code released in 7.6.1 performs a lot better for duplication.

ShasRaj_UK
Level 3
Partner

Thanks a lot Riaan for you vauble post.

I have created a SLP Window and updated that in Dulication job , but still few SLP job getting generated "SLP_Multiple_Lifecycle" .How can I stop them .

I will try / discuss with my team to upgrade the nbu and look at the possibility for 2nd storage unit .

 

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Jobs that had originally started with the "old settings" still have those settings. NetBackup uses versions for the SLPs. It makes a new version everytime you modify something. So in your case the old version would be using 24x7 and the new ones would use your new SLP window.

 

If you want to modify the old version to also use the new SLP window you can do it via CLI using nbstl command.

ShasRaj_UK
Level 3
Partner

Yesterday I had suspended the secondary operation in SLP for few hours and let the backups run as a result today we are good in backups  :)

As soosn as I resume the duplication process ,again all are occupied the disk resource and active since 19hours (99%) .Is there any way to check the progress ?

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

reduce your max job on the storage unit to 1/2 when doing the dup. change it back for backups. until the mess  is sorted out and you've got new stu's for dups.

ShasRaj_UK
Level 3
Partner

Tried the duplication run over weekend to complete but they were not completed and keep occupied the disk unit .

 

24/07/2015 17:45:26 - requesting resource LCM_stu_lappsnbua01_dedupe
24/07/2015 17:45:31 - begin Duplicate
24/07/2015 17:45:31 - granted resource LCM_stu_lappsnbua01_dedupe
24/07/2015 17:45:31 - started process RUNCMD (30421)
24/07/2015 17:45:32 - ended process 0 (30421)
24/07/2015 17:45:52 - requesting resource stu_lappsnbua01_dedupe
24/07/2015 17:45:52 - reserving resource @aaaai
24/07/2015 17:46:02 - awaiting resource stu_lappsnbua01_dedupe - Maximum job count has been reached for the storage unit
24/07/2015 21:28:44 - Info bpdm(pid=83954) started            
24/07/2015 21:28:44 - started process bpdm (83954)
24/07/2015 21:28:45 - Info bpdm(pid=83954) requesting nbjm for media         
24/07/2015 21:31:45 - begin writing
24/07/2015 21:31:45 - end writing; write time: 0:00:00
24/07/2015 21:31:47 - begin writing
24/07/2015 21:32:18 - Info bpdm(pid=83954) Creating OPT_DUP_END marker          
24/07/2015 22:15:58 - reserved resource @aaaai
24/07/2015 22:15:58 - granted resource MediaID=@aaaag;DiskVolume=PureDiskVolume;DiskPool=dp_lappsnbua01_dedupe;Path=PureDiskVolume;StorageServer=lappsnbua01;MediaServer=lappsnbua01
24/07/2015 22:15:58 - granted resource stu_lappsnbua01_dedupe
24/07/2015 22:15:58 - requesting resource @aaaai
24/07/2015 22:15:59 - Info Duplicate(pid=30421) Initiating optimized duplication from @aaaai to @aaaag      
24/07/2015 22:15:59 - Info bpduplicate(pid=30421) Suspend window close behavior is not supported for optimized duplications   
24/07/2015 22:15:59 - Info bpduplicate(pid=30421) window close behavior: Continue processing the current image     
24/07/2015 22:15:59 - granted resource MediaID=@aaaai;DiskVolume=PureDiskVolume;DiskPool=dp_lappsnbua02_dedupe;Path=PureDiskVolume;StorageServer=lappsnbua02;MediaServer=lappsnbua01
27/07/2015 08:19:49 - Error bpdm(pid=83954) media manager terminated by parent process       
27/07/2015 08:19:49 - Info lappsnbua01(pid=83954) StorageServer=PureDisk:lappsnbua02; Report=PDDO Stats for (lappsnbua02): scanned: 0 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 0.0%, cache disabled
27/07/2015 09:07:13 - Error bpduplicate(pid=30421) bpduplicate terminated by signal (1)        
27/07/2015 09:07:13 - Error bpduplicate(pid=30421) Duplicate of backupid LAPVWEPHTDB101_1437193077 failed, termination requested by administrator (150).   
27/07/2015 09:07:13 - Error bpduplicate(pid=30421) Status = no images were successfully processed.      
27/07/2015 09:07:13 - end Duplicate; elapsed time: 63:21:42
termination requested by administrator(150)

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Hi,

 

Are you able to see how many images are stuck in this manner? It is (was) possible to get corrupt images and those would cause the duplications to not work. I've seen that on 7.6.0.x before.

 

Is is just a specific bunch or are they piling up and getting more and more each day?

symantec040
Level 4

Could you please get us the o/p of  --processqueueinfo from both source & target appliances.

ShasRaj_UK
Level 3
Partner

Answering Riaan :  Will check and let you about the images ,it has many .But seems this is getting more and more everyday .

Answering Symantec040 : Below are the output from two appliance .

****snbua01:/usr/openv/pdde/pdcr/bin # ./crcontrol --processqueueinfo
Busy   : no
Pending: no

****snbua02:/usr/openv/pdde/pdcr/bin # ./crcontrol --processqueueinfo
Busy   : no
Pending: no

 

 

 

 

symantec040
Level 4

Let me suggest you to check for, #nbstlutil report

and also o/p from master server, #bpps -x |grep "bpdbm\|nbstserv"

I am thinking on the lines whether there is a lag in bpdm which could be causing this lag.

And do you know when was the last time the services were bounced on master server & the appliances.

tq