12-16-2011 04:13 AM
Hi
For a change i have an issue that i need some help on
NetBackup 7.0.1 (cant upgrade without major change control)
Master is clustered on Windows 2003 R2 x64 Enterprise
Multiple Media Servers on same NBU and O/S level.
All use Advanced Disk with SLPs and duplicate to LTO4 libraries
The system has been working perfectly for months - 80+TB on a weekly full, incrementals every day, 7000+ jobs per day.
Everything runs really well and all jobs and duplications keep nicely up to date.
The majority of jobs use multiple streams.
During the last week we have had a position where some (65) parents jobs are left as running even though their child jobs have all completed successfully.
This is on a range of media servers and not just on one - 7 different ones last night.
Have done a bpdown, process cleanup, cleaned up all media severs so nothing has been left hung.
Also done a cluster failover
But this is s till happening each day.
These parent jobs will not cancel.
I am thinking along the lines of the pempersist file (which will have followed the cluster failover) but just wanted some ideas on where to look first on this one
Thanks for any help
Solved! Go to Solution.
01-19-2012 08:39 AM
All resolved now
All updated to 7.1.0.3 and the desktop heap increased to 2048 - resolution was either the desktop heap or both - but not just 7.1.0.3.
For others information, especially those who think their Master Server is high enough specification to cope with anything - please compare with the one in this case .....
Clustered Dedicated Master
Windows 2003 R2 Enterprise ED 64 bit
32GB RAM
Dual Quad Core Opteron 2.4GHz Processors
1TB drive for catalogs etc. etc.
Should have coped!!
The desktop heap is increased by editing the "Windows" key in
HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems\
Part of this key has a section that reads similar to: Windows SharedSection=1024,20480,1024
The last of the three figures needed changing to increase the desktop heap so that it now reads:
Windows SharedSection=1024,20480,2048
Hope this helps someone in the future.
12-16-2011 03:15 PM
Oh well - no offers!
I have been digging and am looking at Anti Virus causing this - will update when i have confirmed
12-17-2011 08:54 PM
Well, difficult to say really. If it has been working and then all of a sudden the problems came, then one would have to suspect that there has been a change to the system.
From your description doesn't sound like something I have seen before really. You already mentioned pempersist, that is a viable option I think. You don't mention that the jobs are "Waiting for retry", so it is possible not that known case either.
You could be right about antivirus, and perhaps a firewall update? As the master is on Windows 2003, did anyone apply any Microsoft hotfixes about a week ago? Any changes to NIC configuration?
Does the admin log reveal anything about system calls?
I guess you could do the Windows way; reboot of all NBU servers... Hard to believe, but a reboot has helped me so many times for NBU. EV is even worse.
/A
12-18-2011 02:16 AM
If there are SQL backup jobs, then this may be related: http://www.symantec.com/docs/TECH65687
But if it's just filesystem, I can't think of why except some changes in the environment, especially on the network side. Maybe a TCP registry change maybe needed to see if it's related to the timeout (for child to connect back to parent), but I really forgot which one it is.
pempersist is also a good point, but that would involve restarting the master server..
12-19-2011 08:05 AM
Thanks All
No SQL backups, everything is straight file system.
We are working with the AV team to get everything excluded correctly and i have found many processes that were no excluded since the change to 64 bit severs
This should go through in the next 2 or 3 days (change control!!) so I will update this after the exclusions have been applied.
12-23-2011 03:24 AM
I have been working on this and have had all AV exclusions set which seemed to improve things a little
I then found that cmd.exe and nbdelete.exe processes were crashing (application pop ups)
So searched again and what do you know .... a tech note came out yesterday for just this fault (well not in its title but exactly in what is actually happening) and is fixed in 7.1.0.3:
http://www.symantec.com/docs/TECH177597
Trouble is we are on 7.0.1
Case logged to see if we can get the 7.0.1 version of the fix - will update if we do
01-06-2012 04:48 AM
Just an update on this one - which looks to be resolved.
There was no 7.0.1 EEB available so the system has been upgraded to 7.1.0.3.
The errors still persisted, although fewer of them and we now go a Desktop Heap Exhaution message in the event logs.
Yesterday I increase the desktop heap to 2048 from 1024 and we had no hung jobs overnight so will monitor this to see how it goes over the weekend - but it is looking good.
01-19-2012 08:39 AM
All resolved now
All updated to 7.1.0.3 and the desktop heap increased to 2048 - resolution was either the desktop heap or both - but not just 7.1.0.3.
For others information, especially those who think their Master Server is high enough specification to cope with anything - please compare with the one in this case .....
Clustered Dedicated Master
Windows 2003 R2 Enterprise ED 64 bit
32GB RAM
Dual Quad Core Opteron 2.4GHz Processors
1TB drive for catalogs etc. etc.
Should have coped!!
The desktop heap is increased by editing the "Windows" key in
HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems\
Part of this key has a section that reads similar to: Windows SharedSection=1024,20480,1024
The last of the three figures needed changing to increase the desktop heap so that it now reads:
Windows SharedSection=1024,20480,2048
Hope this helps someone in the future.