cancel
Showing results for 
Search instead for 
Did you mean: 

Many Jobs delayed "granting resources" and "connecting"

Saul_Grysman
Level 2
We've recently upgraded from 6.5.3 to 6.5.4 and we've added an additional Media Server into the environment (all Solaris based).  We are experiencing significant delays in starting jobs - Unix, Windows, SQL and Oracle/RMAN - and writing to disk or tape.  The example below is typical: 12 minute delay in granting resources and 15 minute delay is connecting.  With the prior version, these times were in seconds.  We've restarted our entire environment to no avail.  Any help would be appreciated.

03/03/2010 11:14:51 - requesting resource nburobot
03/03/2010 11:14:51 - requesting resource nbucore1.NBU_CLIENT.MAXJOBS.sqlmisp

03/03/2010 11:14:51 - requesting resource nbucore1.NBU_POLICY.MAXJOBS.sql-sqlmisp

03/03/2010 11:27:27 - granted resource  nbucore1.NBU_CLIENT.MAXJOBS.sqlmisp
03/03/2010 11:27:27 - granted resource  nbucore1.NBU_POLICY.MAXJOBS.sql-sqlmisp
03/03/2010 11:27:27 - granted resource  T01171
03/03/2010 11:27:27 - granted resource  sl8500_drv20
03/03/2010 11:27:27 - granted resource  nbucore1_sl8500
03/03/2010 11:27:29 - estimated 0 kbytes needed
03/03/2010 11:27:30 - started process bpbrm (pid=4546)
03/03/2010 11:27:31 - connecting
03/03/2010 11:27:42 - connected; connect time: 0:00:00
03/03/2010 11:42:43 - mounting T01171
03/03/2010 11:43:30 - mounted T01171; mount time: 0:00:47
03/03/2010 11:43:31 - positioning T01171 to file 596
03/03/2010 11:44:53 - positioned T01171; position time: 0:01:22
03/03/2010 11:44:53 - begin writing
03/03/2010 11:45:14 - end writing; write time: 0:00:21
the requested operation was successfully completed (0)
1 ACCEPTED SOLUTION

Accepted Solutions

CRZ
Level 6
Employee Accredited Certified

You'd probably be better off applying 6.5.5, because you can get the nbrb EEB for 6.5.5 via a TechFile:

DOCUMENTATION: Issues resolved in the nbrb package for NetBackup Resource Broker, included in Etrack 1924099.
 http://support.veritas.com/docs/346920

The 6.5.4 bundle is still only available through a support case.  (I think we're only at version 5, by the way - I haven't seen a version 6)

View solution in original post

29 REPLIES 29

babyd
Level 4
Partner Accredited Certified

Check the nbrb.conf file and nbrb log file for any errors;

Example format for parameters in nbrb.conf:
<install_path>\Veritas\NetBackup\var\global\nbrb.conf

By default all values are 0.

SECONDS_FOR_EVAL_LOOP_RELEASE = 180
RESPECT_REQUEST_PRIORITY = 0
DO_INTERMITTENT_UNLOADS = 1


More info:
http://seer.entsupport.symantec.com/docs/300442.htm

Saul_Grysman
Level 2
The path for Unix is /usr/openv/var/global/nbrb.conf which is what we're using.

We did not have the conf file or the log directory.  I added both & will monitor, and will update this thread.

Thanks for the feedback.  If anyone else might have any suggestions, please don't hesitate.

Saul_Grysman
Level 2
Jobs are still queuing.  They get hung getting resources granted (see output below) and they get hung after connecting:

From original post above:
03/03/2010 11:27:42 - connected; connect time: 0:00:00
03/03/2010 11:42:43 - mounting T01171

The job below was queued from 16:37 until 16:43 when it listed an "info" message.

03/03/2010 16:37:14 - requesting resource nburobot
03/03/2010 16:37:14 - requesting resource nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc
03/03/2010 16:37:14 - requesting resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
03/03/2010 16:43:45 - Info nbrb (pid=21875) Limit has been reached for the logical resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
03/03/2010 16:43:47 - granted resource  nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc
03/03/2010 16:43:47 - granted resource  nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc

The nbrb log had no reference to the job until 16:43
03/03/10 16:43:44.555 [ResBroker_i::requestResources] RequestSeq:
Request=0 provider=MPXProvider resourcename=MpxRequest-577611 MPXGroupRequest maxmpx=1
    SubRequest=0 provider=DriveOperationProvider resourcename=nburobot userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=nburobot mediaPool=DB retentionLevel=2 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=0 usageType=1 client=webdev15mtc shareGroup=*ANY* } }
Request=1 provider=NamedResourceProvider resourcename=nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc CountedResourceRequest { max=99 }
Request=2 provider=NamedResourceProvider resourcename=nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc CountedResourceRequest { max=1 }

03/03/10 16:43:45.577 [CountedProvider::allocate] cannot allocate, resource is busy nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc (current_ref_count=1, max_ref_count=1)
03/03/10 16:43:45.577 [Error] V-118-171 request ID {EF1BA70C-1DD1-11B2-BFFC-0003BA21D3A5} failed with status 1 (Resource is currently in use); releasing 1 allocated resources
03/03/10 16:43:45.578 [ResBroker_i::holdOne] tracking error status (RBErrorInfo: errorStatus=2005034 retryClass="RB Defined Retry" retryReason=24 retryText="LOG 1267652625 4 nbrb 21875 Limit has been reached for the logical resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc" retryResource=nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc)
03/03/10 16:43:45.578 [ResBroker_i::doEvaluation] <EVALPERF> Evaluated 2 requests, evalTime=106 msec, numSucceeded=0, numQueued=1, numPended=0, numFailed=1, numMultipleCopyJobs=0, numMdsCacheMisses=0, restartedEvalCycle=0, numDriveReleasesBeforeEval=0, numDrivesAvailStart=7, numDrivesAvailEnd=7, requestQueueSize=1
03/03/10 16:43:45.749 [ResBroker_i::doEvaluation] CONF: releaseTimeSeconds = 180, respectRequestPriority = false, doIntermittentUnloads = true, maxHighPriorityQueueSize = 100

Anything else I can provide, please advise.

babyd
Level 4
Partner Accredited Certified
Can you enable bpbrm,bpdbm logs ?
Change the verbose level to 5 and rerun the backup.

Saul_Grysman
Level 2
I was just told by Symantec that this is "a well known problem" and that they have engineering binaries to fix it.  I'll get them, install and try to report back.

(I suppose that it isn't as well known as they think.  Wish I knew about this before.)

David_McMullin
Level 6
can you let us know the ETRACK numbers so we can check them out?

bcblake
Level 4
Partner
tried everything in:

http://seer.entsupport.symantec.com/docs/297322.htm

Just moved EMM (/usr/openv/db/data) to it's own LUN (Clariion CX-3) this week, hoping the problem would be fixed but it's gotten worse.

Cleared all resource requests with nbrbutil -resetAll, started a few jobs which all went active fine, but as we add more jobs, the time to get a resource gets longer and longer... it's absolutely ridiculous. Jobs now waiting over 25 minutes just to get a MAXJOBS resource... this is getting very frustrating.

Solaris 10 master (which was just updated to Update 7), T5220 64 procs and 64G of RAM and nbrb makes this whole thing come to a crawl.

Current nbrb.conf:

RESPECT_REQUEST_PRIORITY = 1
SECONDS_FOR_EVAL_LOOP_RELEASE = 270
DO_INTERMITTENT_UNLOADS = 1

I'm going to let it bake for a while and see how it behaves but it's not looking promising.

Please let us know if you have an Etracks and if they have EEBs for this.

Thanks.

cgoliver
Level 5
It's been four of five months of this behavior and no solution from Symantec.

Originally, the problem started when we added two new tape drives. Upgrading from 6.5.3.1 to 6.5.4 didn't resolve this specific problem.

Partial analysis of the problem suggests that a communication delay is occuring between the client, media server, master. One month ago we made changes for some clients, media servers to ensure all connection went through vnetd, after we found that callback were being done on high random ports, which is a no-no for our firewall.


Sincerely,
Chris O.

bcblake
Level 4
Partner

All-

Thanks to one of the advanced TSEs (Jim Harris, highly recommended!), my case 411-519-814, yielded an EEB for nbrb for 6.5.4.

Mention any of these eTracks and you should be able to get it:

 

iPeng_Ops.book

ET1596861, ET1702148, ET1634241, ET1600913, ET1785238, ET1860650, ET1873458, ET1870125, ET1867225, ET1886274, ET1915680, ET1977194

One of these (ET1977194) was actually happening to me where bpduplicates would exit with a 96 error and nbrb would report 2005000 (EMM_ERROR_MediaUnavailable). So that saves me an extra call in and case to Symantec.

I am planning to apply this tomorrow and will report back status.

Brian

 

 

iPeng_Ops.book

bcblake
Level 4
Partner

A couple of other kernel parms that I ended up using that seem to help as well on the master (Solaris 10):

/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q 8192

/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0 8192

/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 15000

/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 3600000

(I put these in /lib/svc/method/net-init to automatically set on boot time, and manually set them while I had NetBackup down)

Definitely disable tcp_fusion (see http://seer.entsupport.symantec.com/docs/306694.htm).

And setup project for kernel parameters (see http://seer.entsupport.symantec.com/docs/307610.htm on how to set them up):

project.max-msg-ids 256

project.max-sem-ids 1024

project.max-shm-ids 256

I didn't bother with shm-max since 8GB (1/4 of system memory) is fine for my master, as it doesn't handle client backups.

Modified /usr/openv/var/global/emm.conf to contain:

NUM_DB_CONNECTIONS=51

NUM_DB_BROWSE_CONNECTIONS=50

NUM_ORB_THREADS=61

 

And /usr/openv/var/global/server.conf (adding the -gn to be 6 more than NUM_DB_CONNECTIONS), and also up'ed the cache to 75M to start (500M high water mark):

 -n NB_<master>

   -x tcpip(LocalOnly=YES;ServerPort=13785)  -gp 4096 -ct+ -gd DBA -gk DBA -gl DBA -ti 0 -c 75M -ch 500M

 -cl 75M -gn 57 -zl -os 1M  -o /usr/openv/db//log/server.log

 -ud


Hope those can help some of you!


 

cgoliver
Level 5
Can you post the name of the EEB? Back on 6.5.3.1, they had me install an eeb for nbrb. Didn't help this problem, but it did addresss a different issue.

I've changed some of our tcp params, but we are running Solaris 9 so no application profiles for kernel params :(....

MILB65
Level 4

By the way, there were no ebb's that we put into place.

Turn up the logging for nbemm
vxlogcfg -a -p NB -o 111 -s DebugLevel-6 -s DiagnosticLevel=6

Then monitor, wait till the long delay then run:
vxlogview -i 111 --display D,T,p,t,c,w,x -t 02:00:00 > <somewhere>

then run
grep 'gethostbyname failed for` <somewhere> |grep dns_hosts_equal |awk '{print $11}' |sort | uniq > <someplace new>

so now for the explanation.

nbrb is probably getting hung up in a "learning mode" for a server that it can not resolve.  however, while it learns, it is hanging up the backups of the servers that it does not have to learn about.

Omar_Villa
Level 6
Employee

Symantec is on NBRB EEB version 5 after 6.5.4, be sure you get version 6, because they still have a lot of issues with the resource handeling.

bcblake
Level 4
Partner

Not sure if I got version 5 of nbrb EEB (would be nice to know how to tell which version!). but it was for NBU 6.5.4.

cgoliver- I had same problems in 6.5.3.1, they told me there were fixes in 6.5.4... I continued to run into problems (one of which was the one from the ET I mentioned above).

I applied the EEB I just got (no idea which version) and it seems to be running ok for now.

It still takes a heckuva lot more time than I would think when there are only a couple hundred jobs in queue, and looking at the evaluation performance

Looking at evaluation cycle performance, its looking a little better (evaluated 400-500 requests in the 3 minutes we have set for SECONDS_FOR_EVAL_LOOP_RELEASE), whereas it used to be in the 30-40 number of requests.

Also wondered if it has anything to do with how often the catalog backup runs, which cleans out the Sybase transaction log. So make sure you get good hot catalog backups often!

And as for Solaris 9, use the appropriate shared memory, semaphore and message queues kernel parameters (see the NBU 6.5 Perf Tuning Guide, they should be set in /etc/system I believe for Solaris 9), or check with your Solaris admin or google it.

CRZ
Level 6
Employee Accredited Certified

You'd probably be better off applying 6.5.5, because you can get the nbrb EEB for 6.5.5 via a TechFile:

DOCUMENTATION: Issues resolved in the nbrb package for NetBackup Resource Broker, included in Etrack 1924099.
 http://support.veritas.com/docs/346920

The 6.5.4 bundle is still only available through a support case.  (I think we're only at version 5, by the way - I haven't seen a version 6)

cgoliver
Level 5

We usually stay back at least one revision.

CRZ
Level 6
Employee Accredited Certified

The Etrack for the 6.5.4 EEB is 1747025.  (Omar will correct me if I'm wrong, I hope!)  Mention it to the engineer handling your case.

Really, though, by not installing 6.5.5 you're missing out on a lot of other fixes.  (As a representative of support, I am obligated to add this to every post.  Also, it's totally true!)

Omar_Villa
Level 6
Employee

      Today I saw the master server just not running jobs after NBRB EEV v5 was install, the difference is the jobs just didn't start, they dont go to the queue, no NBRB assigments nothing, as if the policies where disabled and more weird it didnt' happen with all the policies, so I bounce the master but previously raise the Originator 116 (NBPEM) to level 3 just to get some logs. Everything is back and scheduling, but we still need to see how runs tonight.

Have anyone of you see this behavior after the EEB was deploy?


Please let me know.
Regards

Omar_Villa
Level 6
Employee

      Today I saw the master server just not running jobs after NBRB EEV v5 was install, the difference is the jobs just didn't start, they dont go to the queue, no NBRB assigments nothing, as if the policies where disabled and more weird it didnt' happen with all the policies, so I bounce the master but previously raise the Originator 116 (NBPEM) to level 3 just to get some logs. Everything is back and scheduling, but we still need to see how runs tonight.

Have anyone of you see this behavior after the EEB was deploy?


Please let me know.
Regards