Forum Discussion

Raaavan's avatar
Raaavan
Level 5
10 years ago

getting Info nbjm (pid=xxxx) Waiting in NetBackup scheduler work queue on server <master server> has been updated.

Hi,

There is serious issue i am facing in my environment with respect to RMAN oracle backup
newly generated streems remain in queue for long time though netbackup resources are available

01/07/2015 23:19:45 - requesting resource nbu_master.NBU_POLICY.MAXJOBS.RMAN_FCRM10DB_ARCHIVE
<= skipped =>
01/07/2015 23:45:45 - Info nbjm (pid=13731) Waiting in NetBackup scheduler work queue on server xxxxxxxx

it waited approximatly 25 min if a backup with 50 streems and each streem wait for such long time resulted into large backup window please suggest 

i found below tech note but can not co relate to this
http://www.symantec.com/business/support/index?pag...
http://www.symantec.com/business/support/index?pag...
 

Master server - Solaris 10 / netbackup 7.5.0.6

Total RMAN policy - 800

Thanks 

  • There are many reasons why jobs are queued, such as:

    MPX settings for STU or policy was reached
    All drives are busy
    Available media in use
    Max jobs per client was reached
    Max jobs per policy was reached
    etc...

    Above log extract seems as if Max Jobs per Policy was reached.

    Best to have a look at queued jobs in Activity Monitor to see reason for jobs being queued.

    Another possibility may be the Resource Broker on the master server that is battling to keep up with amount of jobs.

    Have a look at these performance tuning docs:

    NetBackup 7.6 Best Practices: Optimizing Performance

    Updated NetBackup Backup Planning and Performance Tuning Guide for Release 7.5 and Release 7.6

  • Persoanlly I would get the following:

    Persoanlly I would ask for mds, nbpem, jm, rb and emm at Debug/ Diag 6 and 6 in raw format.  These should cover a restart and left until the system slows and the problem is seen.

    In fact, nbrb should only be at debug 4 (hope I've remembered that right) as any higher could cause issues itself.

    So, 111, 116, 117, 118 + 143

    143 logs into 111, so only 4 to collect.

    vxlogcfg-a -p 51216 -o 143 -s DebugLevel=6 -s DiagnosticLevel=6

    vxlogcfg-a -p 51216 -o 111 -s DebugLevel=6 -s DiagnosticLevel=6

    vxlogcfg -a -p 51216 -o 116 -s DebugLevel=6 -s DiagnosticLevel=6 

    vxlogcfg -a -p 51216 -o 117 -s DebugLevel=6 -s DiagnosticLevel=6

    vxlogcfg -a -p 51216 -o 118 -s DebugLevel=4 -s DiagnosticLevel=6

    Leave the system running long enough that a jobs starts and the delay is seen after the logs have been enabled.

    Collect logs nbjm, nbpem, nbrb, nbemm

    nbrb log would be the place to start, need to run the raw log through vxlogview with -d all option,.

    If you look for lines containing

    [ResBroker_i::doEvaluation]

    This should show how long the evaluation cycle is taking.

    Use same commands but DebugLevel=1 to set back to defaults.

    Numerous possible causes, but above may give more clues and if nothing else will give an insight as to other potential performance issues.

    Is there any nbrb/ nbemm tuning applied ?

    I'd also stop services and see if there are any 'stale' files in /usr/openv/netbackup/db/media/drives and /usr/openv/netbackup/db/media/tpreq  (these files would be in the media server)

  • MPH999 is on a clue

    If EMM has to little memory  to play with, it can cause ridiculous long resource handling in general. You might consider to set the EMM memory configuration to large

    Run /usr/openv/db/bin/dbadm

    Select

    2) Database Space and Memory Management

    4) Adjust Memory Settings,

    3) Large.

    Re-start of Netbackup required.

    That said - Marianne also has a very valid point also - but we can't with the given information available right now tell wrong. Consider also to increase the policy priority 99999, this will let the Oracle job come first in queue.

  • Dear All

    Apologies for the delay in reply 


    Tech support suggested me to remove  offline server list from EMM & also unwanted entry from bp.conf  [ a proper
    decommission ( 18 was the count at that time)  ] since NBRB while scanning for the resource request it tries to contact each entry in bp.conf and it takes significant amount of time there.

    I did that though issue is not totally resolved but there is improvement


    Still I have 10 offline servers ,On other part some team doing POC  & servers are not gracefully added & removed in EMM
    I trying to get count below 5-4 but there are some application dependacy like switch back switch overs , LDOM server etc.  so it become task to achive 


    In parallel i am looking in to Oracle backup tuning since this environment having 723 Oracle Type & 584 MS-SQL-server Type polices which are quit a large in number
    There are some flaws from DBA end since RMAN & netbackup repository maintenance is not done ( CROSSCHECK & Obsolete backup pieces build up over the period ) also "%t" factor in rman FORMAT templet ( NBU 7.0 )

    also observed that minimum patch level to run netbackup 7.5.0.6 on solaris box was missing very soon we will patch master with latest patch level 

    I can't see any thumb rule to get rid out of this

    @Nicola - Checked enough memory is there

    @
    MPH999 - i have not collected this logs i'll check & update on this

    @Marianne - I have crosschecked all setting suggested by you & they are appropriate

  • Issue started reoccurring at this moment i don't have offline media server in EMM but still allocation is getting delay.

    I have 300 + RMAN backup runs in production hours. i have observed that though DBA are using RMAN template comes with NBU

    they having additional statement like

    DELETE NOPROMPT ARCHIVELOG ALL BACKED UP 2 TIMES TO 'SBT_TAPE' COMPLETED BEFORE 'SYSDATE -2';

    and

    and other maintenance task command like crosscheck , etc ..

    I suspect this request must be hitting EMM & making it busy which might be causing allocation delay but not sure so need help like what comes in bprd or bpdbm logs when any maintains statement is  run by RMAN

  • Have a look at this post dated 12 Jan :

    https://www-secure.symantec.com/connect/forums/getting-info-nbjm-pidxxxx-waiting-netbackup-scheduler-work-queue-server-master-server-has-bee#comment-10794331 

    Have you enabled and collected logs as suggested?

    Answer to this question in same post?

    Is there any nbrb/ nbemm tuning applied ? 

    Have you gone through Performance Tuning Guides such as these?

    NetBackup Backup Planning and Performance Tuning Guide  

    NetBackup 7.6 Best Practices: Optimizing Performance 
    (Will be valid for 7.5 as well).

  • I have enabled the logs today as there was space  crunch now unified logs are moved to another filesystem. will share details with you post today's production hours

    Answer to the question

    Is there any nbrb/ nbemm tuning applied ?

    I don't think it is tuned explicitly , find conf file

    cat /usr/openv/var/global/nbrb.conf
    SECONDS_FOR_EVAL_LOOP_RELEASE = 180
    RESPECT_REQUEST_PRIORITY = 0
    DO_INTERMITTENT_UNLOADS = 1
     

  • PFA logs

    JOB details
    07/15/2015 14:40:51 - Info nbjm (pid=15859) starting backup job (jobid=10051329) for client IHYDIBMPF-C3B4-CL7, policy RMAN_XXY_WLY, schedule VTL_MLY
    07/15/2015 14:40:51 - Info nbjm (pid=15859) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=10051329, request id:{63F0A0B4-2AD1-11E5-B45F-002128825C08})
    07/15/2015 14:40:51 - requesting resource DD01_XXY_VTL
    07/15/2015 14:40:51 - requesting resource hydt5440-nbms.NBU_CLIENT.MAXJOBS.IHYDIBMPF-C3B4-CL7
    07/15/2015 14:40:51 - requesting resource hydt5440-nbms.NBU_POLICY.MAXJOBS.RMAN_XXY_WLY
    07/15/2015 14:42:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:44:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:46:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:48:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:50:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:52:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:54:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:56:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 14:58:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 15:00:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 15:02:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 15:04:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 15:06:51 - Info nbjm (pid=15859) Waiting in NetBackup scheduler work queue on server hydt5440-nbms
    07/15/2015 15:07:56 - granted resource  hydt5440-nbms.NBU_CLIENT.MAXJOBS.IHYDIBMPF-C3B4-CL7
    07/15/2015 15:07:56 - granted resource  hydt5440-nbms.NBU_POLICY.MAXJOBS.RMAN_XXY_WLY
    07/15/2015 15:07:56 - granted resource  QL0106
    07/15/2015 15:07:56 - granted resource  IBM.ULTRIUM-TD3.514
    07/15/2015 15:07:56 - granted resource  DD01_XXY_VTL
    07/15/2015 15:07:57 - estimated 0 kbytes needed
    07/15/2015 15:07:57 - Info nbjm (pid=15859) started backup (backupid=IHYDIBMPF-C3B4-CL7_1436953076) job for client IHYDIBMPF-C3B4-CL7, policy RMAN_XXY_WLY, schedule VTL_MLY on storage unit DD01_XXY_VTL
    07/15/2015 15:07:59 - Info bpbrm (pid=27621) starting bptm
    07/15/2015 15:07:59 - Info bpbrm (pid=27621) Started media manager using bpcd successfully
    07/15/2015 15:07:59 - started process bpbrm (pid=27621)
    07/15/2015 15:08:01 - Info bpbrm (pid=27621) IHYDIBMPF-C3B4-CL7 is the host to backup data from
    07/15/2015 15:08:01 - Info bpbrm (pid=27621) telling media manager to start backup on client
    07/15/2015 15:08:01 - Info bptm (pid=27633) using 262144 data buffer size
    07/15/2015 15:08:01 - Info bptm (pid=27633) using 128 data buffers
    07/15/2015 15:08:02 - Info bpbrm (pid=27621) spawning a brm child process
    07/15/2015 15:08:02 - Info bpbrm (pid=27621) child pid: 27643
    07/15/2015 15:08:02 - Info bptm (pid=27633) start backup
    07/15/2015 15:08:02 - Info bptm (pid=27633) Waiting for mount of media id QL0106 (copy 1) on server XXY.
    07/15/2015 15:08:02 - mounting QL0106
    07/15/2015 15:08:03 - Info bpbrm (pid=27621) sending bpsched msg: CONNECTING TO CLIENT FOR IHYDIBMPF-C3B4-CL7_1436953076
    07/15/2015 15:08:03 - Info bpbrm (pid=27621) listening for client connection
    07/15/2015 15:08:03 - connecting
    07/15/2015 15:08:06 - Info bptm (pid=27642) setting receive network buffer to 524288 bytes
    07/15/2015 15:08:07 - Info bpbrm (pid=27621) INF - Client read timeout = 32000
    07/15/2015 15:08:07 - Info bpbrm (pid=27621) accepted connection from client
    07/15/2015 15:08:07 - connected; connect time: 0:00:00
    07/15/2015 15:08:11 - Info bptm (pid=27633) media id QL0106 mounted on drive index 66, drivepath /dev/rmt/91cbn, drivename IBM.ULTRIUM-TD3.514, copy 1
    07/15/2015 15:08:12 - mounted QL0106; mount time: 0:00:10
    07/15/2015 15:08:24 - positioning QL0106 to file 157
    07/15/2015 15:08:25 - Info bpbkar (pid=9961730) Backup started
    07/15/2015 15:08:25 - positioned QL0106; position time: 0:00:01
    07/15/2015 15:08:25 - begin writing
    07/15/2015 15:08:25 - Info bpbrm (pid=27621) Sending the file list to the client
    07/15/2015 15:08:25 - Info bpbkar (pid=9961730) dbclient(pid=9961730) wrote first buffer(size=262144)
    07/15/2015 15:08:26 - Info bpbkar (pid=9961730) done. status: 0
    07/15/2015 15:08:30 - Info bpbrm (pid=27621) media manager for backup id IHYDIBMPF-C3B4-CL7_1436953076 exited with status 0: the requested operation was successfully completed
    07/15/2015 15:08:31 - end writing; write time: 0:00:06
    the requested operation was successfully completed  (0)

    Logs including nbrb -

    As per the log total evaluation time taken by nbrb is 25 min. where in DD VTL tape drives were available , Please see subsequent vmoprcmd output from respective media server & allocation details.
    request you to check why nbrb takes a long to satisfy such request.

    07/15/15 14:42:28.778 V-118-227 [ResBroker_i::requestResources] received resource request ID {63F0A0B4-2AD1-11E5-B45F-002128825C08}, priority 0, secondary priority 26,278, description THE_BACKUP_JOB-10051329-{63F0A0B4-2AD1-11E5-B45F-002128825C08}
    07/15/15 14:42:28.779 [ResBroker_i::requestResources] RequestSeq:
    Request=0 provider=MPXProvider resourcename=MpxRequest-10051329 MPXGroupRequest maxmpx=2
        SubRequest=0 provider=DriveOperationProvider resourcename=DD01_XXY_VTL userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=DD01_XXY_VTL mediaPool=QUATERLY_VTL retentionLevel=0 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=0 usageType=1 client=IHYDIBMPF-C3B4-CL7 shareGroup=*ANY* isLifeCycle=true NDMPHostName= } }
    Request=1 provider=NamedResourceProvider resourcename=hydt5440-nbms.NBU_CLIENT.MAXJOBS.IHYDIBMPF-C3B4-CL7 CountedResourceRequest { max=200 }
    Request=2 provider=NamedResourceProvider resourcename=hydt5440-nbms.NBU_POLICY.MAXJOBS.RMAN_XXY_WLY CountedResourceRequest { max=2147483647 }
    07/15/15 14:42:28.780 [ResBroker_i::evaluateNow] scheduling evaluation immediately
    07/15/15 14:42:28.780 [ResBroker_i::evaluateNow] evaluation cycle is in progress
    07/15/15 15:07:54.022 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {63F0A0B4-2AD1-11E5-B45F-002128825C08}

    07/15/15 15:07:56.628 [EMMProvider::setDriveAllocated] persistDriveInfo: status=0, DriveInfo=DriveInfo: allocationKey=92850418 driveName=IBM.ULTRIUM-TD3.514 driveKey=2019618 STU=DD01_XXY_VTL mediaServer=XXY mediaId=QL0106 abstime=0 delay=0 state=inuse driveIsReusable=1 ejectUponUnload=0 jobtype=1 pendingRequestWasAssigned=0 unloadType=0 flags=0 reallocating=0 deallocating=3 mediaRequestDelayAbsTime=0 externalFile= waitingRequest={00000000-0000-0000-0000-000000000000} pendingGuid={00000000-0000-0000-0000-000000000000} waitingAllocation=0

    STU details
    Label:                DD01_XXY_VTL
    Storage Unit Type:    Media Manager
    Host Connection:      XXY
    Number of Drives:     30
    On Demand Only:       no
    Max MPX/drive:        2
    Density:              hcart3 - 1/2 Inch Cartridge 3
    Robot Type/Number:    TLD / 16
    Max Fragment Size:    512000 MB

    vmoprcmd -d -h  XXY
     56 IBM.ULTRIUM-TD3.504   No       XXY                          
     57 IBM.ULTRIUM-TD3.505   No       -                                        
     58 IBM.ULTRIUM-TD3.506   No       -                                        
     59 IBM.ULTRIUM-TD3.507   No       -                                        
     60 IBM.ULTRIUM-TD3.508   No       -                                        
     61 IBM.ULTRIUM-TD3.509   No       XXY                          
     62 IBM.ULTRIUM-TD3.510   No       -                                        
     63 IBM.ULTRIUM-TD3.511   No       -                                        
     64 IBM.ULTRIUM-TD3.512   No       -                                        
     65 IBM.ULTRIUM-TD3.513   No       -                                        
     66 IBM.ULTRIUM-TD3.514   No       -                                        
     67 IBM.ULTRIUM-TD3.515   No       -                                        
     68 IBM.ULTRIUM-TD3.516   No       -                                        
     69 IBM.ULTRIUM-TD3.517   No       XXY                          
     70 IBM.ULTRIUM-TD3.518   No       XXY                          
     71 IBM.ULTRIUM-TD3.519   No       -                                        
     72 IBM.ULTRIUM-TD3.520   No       -                                        
     73 IBM.ULTRIUM-TD3.521   No       -                                        
     74 IBM.ULTRIUM-TD3.522   No       -                                        
     75 IBM.ULTRIUM-TD3.523   No       XXY                          
     76 IBM.ULTRIUM-TD3.524   No       -                                        
     77 IBM.ULTRIUM-TD3.525   No       -                                        
     78 IBM.ULTRIUM-TD3.526   No       -                                        
     79 IBM.ULTRIUM-TD3.527   No       -                                        
     80 IBM.ULTRIUM-TD3.528   No       -                                        
     81 IBM.ULTRIUM-TD3.529   No       XXY                          
     82 IBM.ULTRIUM-TD3.530   No       XXY                          
     83 IBM.ULTRIUM-TD3.531   No       -                                        
     95 IBM.ULT3580-TD5.029   Yes      -

  • These replies have been moved.