02-10-2012 07:48 AM
... the JOBs simply don't do nothing.....
Hello All,
Windows 2008 R2 servers (2 Master in cluster, 9 Medias)
NetBackup 7.1.0.3, in all servers.
We have now jobs that are not doing nothing, for hours, and days, they never end.
And they never run again, if they are hourly JOBs, or daily JOBs, they don't run again until last job ends, as it never ends.... we don't have regular backups.
This JOBs cannot be deleted, the only option (That I know) is to STOP all NB services an delete it from DB by hand… like:
• Shutdown NetBackup Master AND Media Server processes
• Verify no active NetBackup processes on the Master server
* Verify no active NetBackup processes on all Media servers
• Run bpjobd -r <jobid> for each ghost jobid.
Will this ever be resolved?! OR does anyone know another, less disruptive way, to remove this “never ending” JOBs?
Thank you
02-10-2012 08:13 AM
You can try this:
http://www.symantec.com/business/support/index?page=content&id=TECH35484
It will help.
Let us know if this will not work.
02-10-2012 08:22 AM
I prefer this one as it always works:
http://www.symantec.com/docs/TECH43177
Although you only need to delete the try and ffile files that relate to the hiung job id's
There was also this one but it says fixed in 7.1.0.2:
http://www.symantec.com/docs/TECH146990
The question is why this is happening to you in the first place?
You need to deal with these as per my first tech note above otherwise some policies may not be running
02-10-2012 12:12 PM
Hello Amaan,
Thank you for your reply.
When I execute the cancel command, I get the following:
C:\Users\Administrator>bpdbjobs -cancel 1118359
Canceling 0 jobs
It looks like the JOB has ended but console monitor does not know....
Also, it looks like the same policy is able to run, besides there’s other JOB in running mode… (I’m not following the reason of this…)
Monday I’ll do more testes.
02-10-2012 02:19 PM
OK - follow the first link I provided above (http://www.symantec.com/docs/TECH43177) to get rid of your orphaned jobs from the console - this always works and will resolve those for you.
If you find that you have jobs that have completed but the parent jobs are hung then it may be as a result of other things.
It may be a port blockage (running out of ports so parent status never gets updated) - you can help this in two ways so ensure that you do the following on the Master Server:
1. Add the following registry key (needs a reboot to take effect):
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
DWORD – TcpTimedWaitDelay - Decimal Value of 30
2. Run this command from an Adminstrative command prompt (no reboot required, takes effect immediately and is persistent):
netsh int ipv4 set dynamicport tcp start=10000 num=50000
Next, if may be that the server runs out of desktop head so also check the following:
The desktop heap is increased by editing the "Windows" key in
HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems\
Part of this key has a section that reads similar to: Windows SharedSection=1024,20480,1024
The last of the three figures needed changing to increase the desktop heap so that it now reads:
Windows SharedSection=1024,20480,2048
This solved my own issue with hung parent jobs: https://www-secure.symantec.com/connect/forums/hung-parent-jobs
Next it may be a PagedPool issue so also tune that:
To tune your paging use the following registry keys (need a reboot) - create them if not already there:
HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\
DWORD - PoolUsageMaximum - Decimal value of 40
DWORD - PagedPoolSize Hex value of FFFFFFFF (this is 8 x F)
All of these will help your system anyway so worth doing
Hope this helps
02-14-2012 03:47 AM
Thank you for your post Mark.
I'll try to get this solution into production today. I'll give you feedback when possible.
02-28-2012 06:46 AM
Sorry for the delay on this report...
Problem still occur after Mark suggestions.
02-28-2012 07:27 AM
So are you saying that you removed all of the Status 50 jobs by following my link but more have appeared?
If so then you have a process failing on your system
Check the Application and System Event logs for application failures / popups etc. so that we can ping down what is going wrong
02-28-2012 07:46 AM
02-28-2012 08:25 AM
I have no interaction with this JOBs. This JOBs ran normally when window start, 7PM, 8PM, 11PM, 3AM, etc .... I do not restart them, and I do not cancel them...
When I notice these JOBS they are already "running" for 3 or 4 days.... This is when I stop netbackup and delete them from database.
I have 15.000 JOBS running daily, it’s impossible for me to keep them all under "surveillance"...
02-28-2012 08:39 AM
If you have that many jobs running per day then it could quite possibly be an application or memory issue
If the majority of them are parent jobs then try the desktop heap setting has helped me in the past
How much RAM does your server have and what is your desktop heap setting?
02-28-2012 09:25 AM
This server has 12GB memory.
Is this the section for DesktopHeap?
"... Windows SharedSection=1024,20480,2048 ..."
Thank you
02-28-2012 03:12 PM
Have you typed that correctly?
the middle figure 20480 looks wrong!
03-05-2012 04:39 AM
I did a copy+past... this configuration was done based on your previous post....
03-05-2012 05:18 AM
What were the figures originally?
03-05-2012 06:41 AM
Ok - that is fine then - and you have increase it but still having errors
Did you do the other changes for the PagedPool?
As a Strategy you need to do the following (I know it is not easy to find the time but it is worth it)
1. Apply the PagedPool setting I gace you earlier
2. Cleanup the orphaned jobs as per the tech note I gave (needs NBU downtime)
3. When a failure ocurrs check out the Application and System event logs for anything happening - either processes dying / crashing or memory / system warnings
If you wish then upload your Application and System Event logs from the Master Server (in evtx format) and we can take a look to see what we can spot
03-05-2012 06:51 AM
It was like this: Windows SharedSection=1024,20480,1024 (but, this could already be wrongly changed.....)
03-05-2012 06:57 AM
Mark, I did all changes in your post...
I'm not looking daily at system logs, but 'l take a look...
"...
OK - follow the first link I provided above (http://www.symantec.com/docs/TECH43177) to get rid of your orphaned jobs from the console - this always works and will resolve those for you.
If you find that you have jobs that have completed but the parent jobs are hung then it may be as a result of other things.
It may be a port blockage (running out of ports so parent status never gets updated) - you can help this in two ways so ensure that you do the following on the Master Server:
1. Add the following registry key (needs a reboot to take effect):
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
DWORD – TcpTimedWaitDelay - Decimal Value of 30
2. Run this command from an Adminstrative command prompt (no reboot required, takes effect immediately and is persistent):
netsh int ipv4 set dynamicport tcp start=10000 num=50000
Next, if may be that the server runs out of desktop head so also check the following:
The desktop heap is increased by editing the "Windows" key in
HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems\
Part of this key has a section that reads similar to: Windows SharedSection=1024,20480,1024
The last of the three figures needed changing to increase the desktop heap so that it now reads:
Windows SharedSection=1024,20480,2048
This solved my own issue with hung parent jobs: https://www-secure.symantec.com/connect/forums/hung-parent-jobs
Next it may be a PagedPool issue so also tune that:
To tune your paging use the following registry keys (need a reboot) - create them if not already there:
HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\
DWORD - PoolUsageMaximum - Decimal value of 40
DWORD - PagedPoolSize Hex value of FFFFFFFF (this is 8 x F)
All of these will help your system anyway so worth doing
... "
03-05-2012 11:27 AM
Are you using storage lifecycle policies?
03-06-2012 03:01 AM
Hello snapshot4,
no we are not.