Jobs in queued state: Waiting in NetBackup schedul...

tollboy · ‎08-22-2014

Hello Everybody,

We are seeing this all the jobs. I have checked other threads related to same problem but none of them helped. Nothing is moving in the environment, almost every backup is failing with 196 after staying in queued state.

We opened a severity one case with support they are asking to increase the memory, we have ordered the new memory. But I am seeing no sighns that server is running out of memory or something like that. Anytime there is nearly 3 GB available of physical memory.

Below is the output from one job.

8/22/2014 9:32:39 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:34:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:36:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:38:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:40:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:42:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:44:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server
8/22/2014 9:46:41 AM - Info nbjm(pid=12846) Waiting in NetBackup scheduler work queue on server

Secondly I tried this command as given in another thread just to check the scheduler queue but no use.

nbpemreq -jobs screen all
Maximum output size exceeded, see log

We have already tried restarting the NBU, rebooting the master server but this things creeeps up every time.

Everything is so slow now. If we triggering a job using bpbackup it takes good 20 minutes to show up in activity monitor. Tapes drives are free but still NBU is not able to assign them for the waiting job and those jobs are just waiting for the drives to continue.

I am not sure which logs to provide. Let me know which logs and what other info is needed to look into this.

Environment info:

NBU 7.5.0.5 running on Solaris 10

20+ media server sharing same DD and Tape library

SLP Duplication is used

SymTerry · ‎08-22-2014

What are the ram/processor specs of the master/media servers? Also sounds like something could be hung. Have you tried releasing all allocations? nbrbutil -resetAll. How is disk space on the master? TECH176213 talks about low disk space causing this issue. Collect the nbjm to troubleshoot

If you would not mind, please DM me the case number you have open. I would like to check in with the TSE assigned.

mph999 · ‎08-22-2014

Persoanlly I would ask for mds, nbpem, jm, rb and emm at Debug/ Diag 6 and 6 in raw format. These should cover a restart and left until the system slows and the problem is seen.

In fact, nbrb should only be at debug 4 (hope I've remembered that right) as any higher could cause issues itself.

So, 111, 116, 117, 118 + 143

143 logs into 111, so only 4 to collect.

vxlogcdf -a -p 51216 -o 111 -s DebugLevel=6 -s DiagnosticLevel=6

vxlogcdf -a -p 51216 -o 116 -s DebugLevel=6 -s DiagnosticLevel=6

vxlogcdf -a -p 51216 -o 117 -s DebugLevel=6 -s DiagnosticLevel=6

vxlogcdf -a -p 51216 -o 118 -s DebugLevel=4 -s DiagnosticLevel=6

Use same commands but DebugLevel=1 to set back to defaults.

The TSE should then run these through IRMAL script to get details on what's going on.

Numerous possible causes, but above may give more clues and if nothing else will give an insight as to other potential performance issues.

Is there any nbrb/ nbemm tuning applied ?

tollboy · ‎08-22-2014

Terry,

I tried resetAll when I restarted the NBU earlier, it didnt help.

1.) Here is the CPU info:

# psrinfo -pv
The physical processor has 16 virtual processors (0-15)
UltraSPARC-T1 (chipid 0, clock 1000 MHz)

/root
# kstat cpu_info|grep core_id|sort -u|wc -l
4

2.) Memory we have 8 GB physical, near around 10 GB of swap on it.

3.) /usr/openv is only 46% full which has every thing related to NBU.

/dev/md/dsk/d80 492G 222G 265G 46% /usr/openv

So it rules out this technote.

I am sorry but I didnt get "DM me" oart about the case number. You meant sending the case number to you in personal chat?

tollboy · ‎08-22-2014

Hello mph99,

I will collect the logs over a restart of NBU services and upload.

There was no tuning applied but we were using MSEO. When we started getting this issue we unconfigured MSEO devices and uninstalled MSEO properly thinking MSEO is culprit. But that was not the case.

--

Regards

gaurav_372 · ‎08-23-2014

Hello mph99,

This is Gaurav, Working with tollboy only.

As you suggested we Restarted NBU services on master server. and as of now we are not seeing much queued jobs but as time will pass number of queued jobs will increase.

Also we did "nbrbutil -resetAll" and cancled all running jobs.

I will send you the required logs once the problem will start.

Below is the sample outout.

============

root@ # bpps | egrep -i 'nbpem|bpjobd|nbsl|nbjm|nbrb|nbemm|dbsrv';date
root 26458 1 2 05:13:33 ? 602:58 /usr/openv/db//bin/NB_dbsrv @/usr/openv/var/global/server.conf @/usr/openv/var/
root 27832 1 0 05:13:49 ? 13:14 /usr/openv/netbackup/bin/nbrb
root 28308 1 0 05:13:56 ? 23:27 /usr/openv/netbackup/bin/nbjm
root 28820 28503 0 05:14:03 ? 1:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem
root 22612 28503 0 05:35:21 ? 0:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem_cleanup
root 28817 28308 0 05:14:03 ? 0:12 /usr/openv/netbackup/bin/nbproxy dblib nbjm
root 27683 1 0 05:13:48 ? 43:19 /usr/openv/netbackup/bin/nbemm
root 29534 1 0 05:14:15 ? 37:02 /usr/openv/netbackup/bin/nbsl
root 28312 28260 0 05:13:56 ? 16:34 /usr/openv/netbackup/bin/bpjobd
root 28503 1 0 05:13:58 ? 15:45 /usr/openv/netbackup/bin/nbpem
root 22834 28503 0 06:33:32 ? 0:19 /usr/openv/netbackup/bin/nbproxy dblib nbpem_email
Sat Aug 23 07:09:31 EDT 2014

/root
root@ # bpps | egrep -i 'nbpem|bpjobd|nbsl|nbjm|nbrb|nbemm|dbsrv';date
root 26458 1 2 05:13:33 ? 611:04 /usr/openv/db//bin/NB_dbsrv @/usr/openv/var/global/server.conf @/usr/openv/var/
root 27832 1 0 05:13:49 ? 13:17 /usr/openv/netbackup/bin/nbrb
root 28308 1 0 05:13:56 ? 23:42 /usr/openv/netbackup/bin/nbjm
root 28820 28503 0 05:14:03 ? 1:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem
root 22612 28503 0 05:35:21 ? 0:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem_cleanup
root 28817 28308 0 05:14:03 ? 0:13 /usr/openv/netbackup/bin/nbproxy dblib nbjm
root 27683 1 0 05:13:48 ? 43:44 /usr/openv/netbackup/bin/nbemm
root 29534 1 0 05:14:15 ? 37:10 /usr/openv/netbackup/bin/nbsl
root 28312 28260 0 05:13:56 ? 16:41 /usr/openv/netbackup/bin/bpjobd
root 28503 1 0 05:13:58 ? 15:48 /usr/openv/netbackup/bin/nbpem
root 22834 28503 0 06:33:32 ? 0:19 /usr/openv/netbackup/bin/nbproxy dblib nbpem_email
Sat Aug 23 07:10:54 EDT 2014

================

Attaching bpps -x out put as well with this.

Please suggest what can be the issue.

Regards,

Gaurav

gaurav_372 · ‎08-23-2014

Hi mph99,

This is Gaurav, working with tollboy only. We restarted NBU services as you suggested and will send the required logs once the problem will start again.

We did nbrbutil -resetAll as well and killed all jobs while restrating NBU serives.

We are seeing one more issue and seems both are relataed. After restarting NBU service we notice that for some NBU processes running time is increasing very frequently.

Below is the sample output.

===============

root@ # bpps | egrep -i 'nbpem|bpjobd|nbsl|nbjm|nbrb|nbemm|dbsrv';date
root 26458 1 2 05:13:33 ? 602:58 /usr/openv/db//bin/NB_dbsrv @/usr/openv/var/global/server.conf @/usr/openv/var/
root 27832 1 0 05:13:49 ? 13:14 /usr/openv/netbackup/bin/nbrb
root 28308 1 0 05:13:56 ? 23:27 /usr/openv/netbackup/bin/nbjm
root 28820 28503 0 05:14:03 ? 1:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem
root 22612 28503 0 05:35:21 ? 0:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem_cleanup
root 28817 28308 0 05:14:03 ? 0:12 /usr/openv/netbackup/bin/nbproxy dblib nbjm
root 27683 1 0 05:13:48 ? 43:19 /usr/openv/netbackup/bin/nbemm
root 29534 1 0 05:14:15 ? 37:02 /usr/openv/netbackup/bin/nbsl
root 28312 28260 0 05:13:56 ? 16:34 /usr/openv/netbackup/bin/bpjobd
root 28503 1 0 05:13:58 ? 15:45 /usr/openv/netbackup/bin/nbpem
root 22834 28503 0 06:33:32 ? 0:19 /usr/openv/netbackup/bin/nbproxy dblib nbpem_email
Sat Aug 23 07:09:31 EDT 2014

/root
root@ # bpps | egrep -i 'nbpem|bpjobd|nbsl|nbjm|nbrb|nbemm|dbsrv';date
root 26458 1 2 05:13:33 ? 611:04 /usr/openv/db//bin/NB_dbsrv @/usr/openv/var/global/server.conf @/usr/openv/var/
root 27832 1 0 05:13:49 ? 13:17 /usr/openv/netbackup/bin/nbrb
root 28308 1 0 05:13:56 ? 23:42 /usr/openv/netbackup/bin/nbjm
root 28820 28503 0 05:14:03 ? 1:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem
root 22612 28503 0 05:35:21 ? 0:04 /usr/openv/netbackup/bin/nbproxy dblib nbpem_cleanup
root 28817 28308 0 05:14:03 ? 0:13 /usr/openv/netbackup/bin/nbproxy dblib nbjm
root 27683 1 0 05:13:48 ? 43:44 /usr/openv/netbackup/bin/nbemm
root 29534 1 0 05:14:15 ? 37:10 /usr/openv/netbackup/bin/nbsl
root 28312 28260 0 05:13:56 ? 16:41 /usr/openv/netbackup/bin/bpjobd
root 28503 1 0 05:13:58 ? 15:48 /usr/openv/netbackup/bin/nbpem
root 22834 28503 0 06:33:32 ? 0:19 /usr/openv/netbackup/bin/nbproxy dblib nbpem_email
Sat Aug 23 07:10:54 EDT 2014

===========

Attaching bpps -x output as well,

Regards,

Gaurav

Deb_Wilmot · ‎08-24-2014

The message "Waiting in NetBackup scheduler work queue on server" can actually be a problem in several areas.

A backup job waits in the work queue any time after it's submitted to nbjm, until the time it obtains resources and creates the image.

Examples:

nbjm --> nbrb

*** requests can wait here if nbrb is busy and can't pick up requests quickly enough. Sometimes tuning nbrb can help. Can you run nbrbutil -listSettings and post the output?

nbrb --> request and updates to nbemm slow to return.

*** This is MOST LIKELY where the problem your encountering is.

--- There are several issues with both Sybase and NetBackup not performing well for various reasons including problems with Sybase, problem with slow queries to the database, etc., YOU SHOULD APPLY 7.6.0.1, then 7.6.0.3 as soon as it can be scheduled.

**** AS ALWAYS make sure to do a hot catalog backup right before the upgrade and also make a copy of /usr/openv/db/data as the last step before the upgrade occurs - while the processes are down. This would make recovery a lot easier if anything went awry. If you're running on zfs - take a zfs snapshot while the processes are down, immediately before the upgrade.

REASON: A new version of Sybase is used in 7.6 which performs much better under heavy load. Additionally, queries to the database have been optimized to eliminate any calls with are lengthy to return. There is no equivalent binary set available that has all the same changes in it.

ADDITIONAL considerations:

-- Poor performing disk can also be a reason you would see this. Consider running iostat -xtn 5 60 to an output file. Review the await column (ignore the first value as this is since boot). If awaits are over 20 msec; performance degrades.

-- Resources that are in use or things such as maximum jobs per policy and maximum jobs per client settings can exhibit this behavior too. The messages leading up to the 'waiting' messages would give more details.

-- Check for stale files in the /usr/openv/netbackup/db/media/drives directory on the media servers with tape drives associated with them.

How do you know if they're stale??? Review the date of the file - is it really old (i.e. more than a week). If so - is almost certainly stale. If unsure - cat the file, there is a line that starts with "PID". Run ps -elf | grep "the PID". If the pid is still associated with a bptm process - DO NOT DELETE IT. If unsure - shutdown the nbu processes on that media server and delete any left over files.

-- Servers marked as offline (communication issues can cause this) or tape drive / robotic problems.

Check the messages file - sometimes that will show easy clues - such as qlaxxx disappearing from the fabric, or robotic device <x> is marked down, etc.,

On Solaris, the iostat -en output is sometimes useful as it will show the number of soft write, hard write and transport errors since boot.

nbjm -->nbjm

*** If nbjm is slow to pick up allocation information from nbrb, delays can occur here.

nbjm --> bpcompatd --> bpdbm --> nbdb

*** Above are the processes involved in image creation. Note again there is a call to Sybase (nbdb). The above steps are used to create the backup image.

All of the above are areas to consider. You did not paste in the messages leading up to the 'waiting' messages - sometimes this can provide clues.

The above are the most common reasons. Problems in this area are a bit painful to diagnose as some in-depth knowledge of the application processes is required.

I really would suggest you apply 7.6.0.3, this will most likely be the resolution to the issue.

My apologies for the novel. I feel I should put "The End" here :)

Additional information about resource allocation is available at:
http://www.symantec.com/docs/TECH137761
and

Resource allocation training for customers and partners, Module 1.
http://www.brainshark.com/symantec/vu?pi=zGuzB8ntZz3BSfz0&

Resource allocation training for customers and partners, Module 2.
http://www.brainshark.com/symantec/vu?pi=zGQzOgFNNz3BSfz0&intk=609121994

Resource allocation training for customers and partners, Module 3.
http://www.brainshark.com/symantec/vu?pi=zHizHaZuoz3BSfz0&

Resource allocation training for customers and partners, Module 4.
http://www.brainshark.com/symantec/vu?pi=zEqz430k6z3BSfz0&

** The training is a few years old; it's still accurate .

Deb_Wilmot · ‎08-24-2014

Oh boy...

I just noticed the following from earlier in the thread. This is NOT GOOD:

1.) Here is the CPU info:

# psrinfo -pv
The physical processor has 16 virtual processors (0-15)
UltraSPARC-T1 (chipid 0, clock 1000 MHz)

We have not recommended UltraSPARC chipsets for several releases now. The 7.1 Release notes actually have a blurb in it regarding them:

7.1 Release notes on page 82:

Sybase ASA performance is poor with an UltraSparc-T series processor. The Sybase ASA database does not perform well when Solaris is used with the UltraSparc-T series processor. Thus, Symantec recommends that you do not run your EMM server on this type of hardware.

Additional comments:

2.) Memory we have 8 GB physical, near around 10 GB of swap on it.

This is not a lot of memory. Swap is a mote point I believe - it's not adequate for the processing speeds NBU requires as it tends to be 100's of times slower than physical memory. If you're using swap at any time, this will cause performance issues - especially given the chipset you have.

Another good technote for you to review is: "Key performance considerations for NetBackup 7.5 master servers" Article URL http://www.symantec.com/docs/TECH202840

Deb_Wilmot · ‎08-24-2014

PS - I'd like a message with the case number in it too.

gaurav_372 · ‎08-24-2014

Hi Deb/mph99

Attacing nbjm/rb/emm and pem logs with this mail. We restarted NBU services and collected the logs once we were having the issue.

Also did a reboot yesterday but it did not solve the issue.

Below NBU tuning has been done in /etc/system file

========

set msgsys:msginfo_msgmap=512
set msgsys:msginfo_msgmax=8192
set msgsys:msginfo_msgmnb=65536
set msgsys:msginfo_msgmni=256
*set msgsys:msginfo_msgssz=32
set msgsys:msginfo_msgtql=512
set msgsys:msginfo_msgseg=8192
* Semaphores
*set semsys:seminfo_semmap=64
set semsys:seminfo_semmni=1024
*set semsys:seminfo_semmns=1024
*set semsys:seminfo_semmnu=1024
set semsys:seminfo_semmsl=300
set semsys:seminfo_semopm=32
*set semsys:seddminfo_semume=64
* Shared memory
set shmsys:shminfo_shmmax=7516192768
*set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=256
*set shmsys:shminfo_shmseg=100
* rlim File Descriptors
set rlim_fd_cur=32768

=========

Yesterday after reboot we were noticing the same messaes again but that time we had enogh memory and only 7 jobs were queeud and 30 were in active.

Below is the memory status once the issue started.

Memory: 8064M real, 4648M free, 2352M swap in use, 8667M swap free

Could you please suggest what can be caused for resource crunch.

Regards

Gaurav

gaurav_372 · ‎08-24-2014

Hi Deb,

We are running NBU on same hardware coniguration since a long time (upgraded on 7.5.X one year back) and we were not having an issues. We are facing this issues since last month only where we changed some NBU paramenters (tcp keep alive vaule and client read time out) or a troubelshooting another issue.

Also how much memory require for running NBU smoothlly

We ae runninng below configuation/Data backup size:

========

Catalog backup: daily 175 G approx.

/user/openv: total size 492G (43 % used)

Physical memory : 8 G

Duplication: vai SLP (total 12 LTO 6 drives)

=======

I noticed yesterday after reboot, we had 4.6 G physical memoy avaialble, but we received again the same message.

Below was the memory status yestrday.That time only 7 queued jobs and 30 active jobs wee there.

Memory: 8064M real, 4648M free, 2352M swap in use, 8667M swap free

8/24/2014 11:07:12 AM - Info nbjm(pid=7540) Waiting in NetBackup scheduler work queue on server bhradcebs

8/24/2014 11:09:12 AM - Info nbjm(pid=7540) Waiting in NetBackup scheduler work queue on server bhradcebs

8/24/2014 11:11:12 AM - Info nbjm(pid=7540) Waiting in NetBackup scheduler work queue on server bhradcebs

Regards,

Gaurav

tollboy · ‎08-25-2014

Hello,

This is the case number 07060189 which we have opened with support.

Deb_Wilmot · ‎08-25-2014

Thanks for the logs and for the case ID. ! I'll review each today as time permits.

Deb_Wilmot · ‎08-25-2014

FYI - I managed to look a things for a bit today. I found the data uploaded to our evidence server and added that to my review list.

There is a combination of issues here.

1. Max jobs per policy is being met so we can't allow more jobs to go active for the specified policy.

2. Max jobs per client is met, so again, we can't allow more jobs to go active for the specified client.

3. Duplication jobs are running for long periods and they are causing contention (as one duplication is using a media required by another...)

4. Scsi errors reported in the messages file.

Of course it's never an easy thing.. :(

The resource broker itself it running just fine. Cycles are normally about 3 1/2 minutes, and 100's of requests are typcially bieng reivewed.

Example:

08/23/14 20:09:39.894 [Debug] NB 51216 nbrb 118 PID:27832 TID:11 File ID:118 [No context] 1 [ResBroker_i::doEvaluation] <EVALPERF> Evaluated 400 requests, evalTime=274790 msec, numSucceeded=1, numQueued=399, numPended=0, numFailed=0, numMultipleCopyJobs=6, numMdsCacheMisses=7, restartedEvalCycle=1, numDriveReleasesBeforeEval=6, numDrivesAvailStart=6, numDrivesAvailEnd=0, requestQueueSize=597

nbrb was interrupted at 3 minutes to look at the waiting jobs:

08/23/14 20:12:56.108 [Debug] NB 51216 nbrb 118 PID:27832 TID:11 File ID:118 [No context] 1 [ResBroker_i::doEvaluation] Eval cycle interrupted by timer

The cycle completed about 3 1/2 minutes later:

08/23/14 20:16:23.137 [Debug] NB 51216 nbrb 118 PID:27832 TID:11 File ID:118 [No context] 1 [ResBroker_i::doEvaluation] <EVALPERF> Evaluated 541 requests, evalTime=390371 msec, numSucceeded=3, numQueued=538, numPended=0, numFailed=0, numMultipleCopyJobs=13, numMdsCacheMisses=15, restartedEvalCycle=2, numDriveReleasesBeforeEval=3, numDrivesAvailStart=6, numDrivesAvailEnd=0, requestQueueSize=595

So about 7 minutes to evaluate 541 requests.

For #1, check the attached file named policy-limits - it has the policy names and the max jobs per policy. Consider increasing the max jobs per policy for the backups that do so for.

Example error:

08/23/14 19:44:54.123 [Debug] NB 51216 nbrb 118 PID:27832 TID:11 File ID:118 [jobid=3074876] 1 [ResBroker_i::holdOne] tracking error status (RBErrorInfo: errorStatus=2005034 retryClass="RB Defined Retry" retryReason=24 retryText="LOG 1408837494 4 nbrb 27832 Limit has been reached for the logical resource bhradcebs.NBU_POLICY.MAXJOBS.FSPROD_11_DB" retryResource=bhradcebs.NBU_POLICY.MAXJOBS.FSPROD_11_DB)

For #2, The master server, host properites, Global properties tab has a maximum job per clients. Currently you have yours set to 10:

# /usr/openv/netbackup/bin/admincmd/bpconfig -U

Admin Mail Address:
Job Retry Delay: 10 minutes
Max Simultaneous Jobs/Client: 10

---cut out the rest--

Consider increasing this number especially since you limit the number of jobs in the policy in many cases.

Example nbrb message:
08/23/14 19:45:39.397 [Debug] NB 51216 nbrb 118 PID:27832 TID:11 File ID:118 [jobid=3074921] 1 [ResBroker_i::holdOne] tracking error status (RBErrorInfo: errorStatus=2005034 retryClass="RB Defined Retry" retryReason=24 retryText="LOG 1408837539 4 nbrb 27832 Limit has been reached for the logical resource bhradcebs.NBU_CLIENT.MAXJOBS.[client-name]" retryResource=bhradcebs.NBU_CLIENT.MAXJOBS.[client-name]" )

For #3:

Example message:

08/23/14 19:23:56.128 [Debug] NB 51216 nbrb 118 PID:27832 TID:11 File ID:118 [jobid=3071905] 2 [RbActionHandler::evaluate] MDS defined retry : (RetryInfo_Record: reason=1 (MEDIA IS IN USE) mediaServerName=adcdupp1 robotNum=0 robotType=8 mediaTypeName=NetBackup HCART3 driveTypeName=NetBackup HCART3 mediaId= mediaKey=0 driveName= driveKey=0 poolName=SPECTRA stuName=adcdupp1-hcart3-robot-tld-0 timeout=0 scanHostName= retryInfoString=MDS_RETRY 1 adcdupp1 0 8 *NULL* *NULL* SPECTRA adcdupp1-hcart3-robot-tld-0 *NULL* 0 *NULL* *NULL* queuedReasonDisplayString=adcdupp1-hcart3-robot-tld-0)

*** media in use message above. You might want to check to see if there are stuck dup jobs (jobs that have been running for days but not doing anything. ) Cancel any that are suspect.

For #4.

The messages file is showing messages that indicate there are a number of errors with the fiber connections:

Aug 14 06:58:31 bhradcebs svc.startd[7]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 1
Aug 14 07:06:15 bhradcebs scsi: [ID 365881 kern.info] /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0/st@w21110090a500526a,0 (st32):
Aug 14 07:06:15 bhradcebs <IBM ULTRIUM-TD6 >
Aug 14 09:14:28 bhradcebs svc.startd[7]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 1
Aug 14 09:39:48 bhradcebs scsi: [ID 365881 kern.info] /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0/st@w21120090a500526a,0 (st30):
Aug 14 09:39:48 bhradcebs <IBM ULTRIUM-TD6 >

and

Aug 13 07:30:08 bhradcebs avrd[1608]: [ID 825999 daemon.notice] Reservation Conflict status from SPECTRA.T950.025 (device 5)
Aug 13 07:45:58 bhradcebs tldd[21889]: [ID 824863 daemon.error] TLD(0) [21889] timed out after waiting 841 seconds for ready, drive 6
Aug 13 09:26:02 bhradcebs ltid[1160]: [ID 747366 daemon.error] Operator/EMM server has DOWN'ed drive SPECTRA.T950.025 (device 5)
Aug 13 09:26:39 bhradcebs avrd[1608]: [ID 825999 daemon.notice] Reservation Conflict status from SPECTRA.T950.025 (device 5)
Aug 13 11:26:10 bhradcebs scsi: [ID 365881 kern.info] /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0/st@w21130090a500526a,0 (st41):

Contact the hardware vendor should be able to help with this. The NBU processes are only reporting the errors received by the OS.

*** Note that I DO NOT have the messages file from the same time as the log file.

Additionally - looks like the nxge cards are going down a lot. It doesn't look like you use them much so I'm not sure how relevent these messages are:

Aug 13 06:13:07 bhradcebs nxge: [ID 339653 kern.notice] NOTICE: nxge1: xcvr addr:0x0c - link is down
Aug 13 06:13:08 bhradcebs nxge: [ID 339653 kern.notice] NOTICE: nxge2: xcvr addr:0x0b - link is down

I'll be forwarding these notes to the TSE working this case. If you have questions - please let me know.

gaurav_372 · ‎08-26-2014

Thanks a lot Deb heading up.

* we are getting these messages before it reach limit of max_job/client/policy.

*for scsi erorrs, there were some drive conflicat issue ans drives were in "PEND-TLD" state, after fixing that also we are getting the same.

* for link down (nxg link) i have seen the same message a long back also , when we were not having issue.

I am not sure what troubleshooting we need to do to fix this issue.

Could you please suggest me futher, what logs we needed to troublshoot further.

Duplication is happening vai SLP, as most of the jobs are waiting for resouce only so there was a a backlog of the images whihc needs to be duplicate, while considering this fact we tried to inactive the images whihc are not duplicated, (we though that might be those are eating resouce ) but that also didn't help much. (though still there are so many (3000+ )backup images whihc needs to be duplicate whihc are in "IN_PROGRESS" state).

Once the server was rebooted there were no duplication only 7 queued jobs and 30 active jobs were that and 70% physical memory was free.

Could you please guide me in which direction we need to troubleshoot further,

Thanks a lot for your valueable suggestions.

Regards,

Gaurav

tollboy · ‎08-26-2014

1# and 2#: It is not because of this. Let me explain you with example: Right now policy for this IAM_LNX_PROD are in queued state. There are 8 in total none of them is active all are in queued state giving the following messgae in their details.

8/26/2014 7:00:06 AM - requesting resource bhradcebs.NBU_CLIENT.MAXJOBS.adciampsp1
8/26/2014 7:00:06 AM - requesting resource bhradcebs.NBU_POLICY.MAXJOBS.IAM_LNX_PROD

Now not even single backup is running for the client adciampsp1 but it is still waiting for that resource. To add to this we have these settings from the start, this can be suspeceted as culprit if it was issue with one or two policies but it seems to be the issue with every policy.

3# Duplication are done by SLP so they keep on triggering again and again.
4# We have fixed these Scsci errors, we have two Solaris 9, NBU 6.5 versions causing the tape conflicts. We have removed tapes from those two media servers. After that there is no conflicts between the media servers for the drives.

I am suspecting whenever jobs are getting completed it is not releasing resources. Fog e.g We restarted the backup NBU engaged the resources bhradcebs.NBU_CLIENT.MAXJOBS.adciampsp1 and bhradcebs.NBU_POLICY.MAXJOBS.IAM_LNX_PROD now when the job is completed it didn't release those resources. Next time backup is running it keep waiting for the resources which are not released by the job even if it is finished.
I am not sure if I am able to explain well my self, please let me know if this is the case.

Regards

Deb_Wilmot · ‎08-26-2014

Hi,

First item: I mentioned stale files in the db/media/drive directories on the media servers. From my previous notes:.

-- Check for stale files in the /usr/openv/netbackup/db/media/drives directory on the media servers with tape drives associated with them.

How do you know if they're stale??? Review the date of the file - is it really old (i.e. more than a week). If so - is almost certainly stale. If unsure - cat the file, there is a line that starts with "PID". Run ps -elf | grep "the PID". If the pid is still associated with a bptm process - DO NOT DELETE IT. If unsure - shutdown the nbu processes on that media server and delete any left over files.

I reviewed the nbrb log files for stale files and your media servers do have a lot of these. Please check all of them for these files and remove any files that are not valid.

Example from the master server nbsu:

/usr/openv/netbackup/db/media/drives:
total 12
-rw-------   1 root     other        276 Feb 28 03:16 drive_IBM.03592E05.002
-rw-------   1 root     other        277 Mar 6 14:16 drive_IBM.03592E05.010
-rw-------   1 root     other        277 Mar 6 14:05 drive_IBM.03592E05.012
-rw-------   1 root     other        277 Mar 6 14:45 drive_IBM.03592E05.017
-rw-------   1 root     other        262 Aug 14 2014 drive_SPECTRA.T950.020
-rw-------   1 root     other        262 Aug 4 03:11 drive_SPECTRA.T950.026

All of the files are stale except for the file named drive_SPECTRA.T950.020. These files will prevent reallocation of drives.

I reviewed the log file information from the other day - the specified client / policy combo ran ok. I reviewed the policy information for that combination - they are running to a disk pool. Was the disk pool online at the time?

If you grep reason from the raw nbrb log files - do you see any occurances of the disk pool going down?

If you'd like, run nbcplogs for the timeframe when the job was supposed to run and I'll review specifically what occurred for that particular job.

Syntax (based on the assumption the job started at 7AM per the policy):

from /usr/openv/netbackup/bin/support:

nbcplogs -s 08/26-06:50 -e 08/26-10:00 --ftp 07060189 If you do not have direct ftp access - instead of --ftp 07060189 use a directory to copy the log files too such as /usr/openv/log-copy or whereever you'd like to copy it.

Deb

tollboy · ‎08-26-2014

Hello,

(I was wondering if you had read my last update https://www-secure.symantec.com/connect/forums/jobs-queued-state-waiting-netbackup-scheduler-work-queue-server#comment-10432311)

I have removed those stale files as you mentioned from all the media servers and master. They were from the old IBM library we had. No we have not seen any occurence of disk pool going down. It is a Datadomain. I am still collecting nbcplogs logs, will upload once completed.

In our environment most of the backups goes to disk (DD) and then duplicated to tape using SLPs.

Deb_Wilmot · ‎08-26-2014

Thank you for making sure I read your comment. I should have explained why this is so important - if anything has the same /dev/rmt device number or anything else that matches something in the database, nbrb will never reallocate the drive. This frequently causes problems with future allocations. The typical behavior I see is that the resources are allocated the first time after a recycle, then they are never allocated again.

The nbrb process triggers a bptm report drives function (-rptdrv) every 10 minutes that reports the contents of those files to the resource broker. As stated, anything matches something currently configured and drive allocations suffer.

With that said, that doesn't explain the behavior for the disk based storage units. I do see indiciations in the log files that there are some communicaiton type problems occuring and some other issues (like shared memory errors from some of the media servers, system call failures, etc.,) Again, those don't explain the slow allocations in resources for disk based storage units in general; although communicaiton errors to specific clients can cause delays overall.

I eagerly await the nbcplog data. It helps so much to have an isolated example to review (policy/client). If you couldn't use the --ftp flag, please make sure that the nbsu run with the nbcplogs is included in the information you uploaded. If I remember correctly it doesn't copy to the directory specified (go figure :)).

I will open a backline task on this case to continue my investigation into the problems as I think this thread is getting a bit long. We can continue this conversation via 'regular' email and update this thread when we figure out what went awry.

Deb

VOX

Jobs in queued state: Waiting in NetBackup scheduler work queue on server :