05-09-2014 02:25 PM
Hello Experts!
I have been facing issue recently where thousands of DB jobs for my SQL DB windows clients started failing with error 54(Child)/2(Parent)
I tried some basic troubleshooting like limiting the amount of jobs that run per client to 10, Extending the overall backup window,
Limiting the amount of jobs per STU to 10 and then overall limiting the jobs to use only 4 and then 3 STU's
Environment -
1 Master : HP-Ux 11.31 NBU 7505 [Client Connect timeout - 300 Client read timeout - 3600 Media server Connect timeout - 90]
6 Media : 3 X Hp-Ux 11.31 NBU 7505 + 3 X Redhat 6.2 NBU 7505 [Client Connect timeout - 300 Client read timeout - 3600 Media server Connect timeout - 90]
Multiple Clients : Win 2003 R2 and Win 2008 R2 X64
The failed jobs do run in the second attempt and get completed but idea is to get them working in the first attempt.
May be i am missing out something basic here but not sure.
Any help will be greatly appreciated.
05-12-2014 12:19 AM
You forgot to mention NBU version on the SQL Clients?
Is there a firewall anywhere in the picture?
Have a look at this TN and see if anything may be relevant:
http://www.symantec.com/docs/TECH138071
05-13-2014 12:58 PM
Thanks for the reply Marianne. The agent versions are 7.5.
There is no firewall in that particular environment setup.
05-20-2014 11:22 PM
CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT DWORD registry keys have helped me with a lot of timout issues, the default 300 (seconds) is often insufficient for database backups and especially restores
Regards
Michael
05-21-2014 10:28 AM
The Timeout is currently set to 1800 on the client and 3600 on the media servers.
05-21-2014 11:38 AM
Windows 2008 has the firewall turned on. Have you tried adding rules on the windows server to allow comminication. This was a issues I ran into.
05-21-2014 12:25 PM
Confirmed with our windows team that a exception rule for netbackup ports already exists. While troubleshooting here is what i found that it is happening randomly on saturdays during 6 AM and 10 PM which is when actually most of the jobs run. The job completes if it is rerun later on sunday or monday. I am arriving to a point where it might be a resource crunch causing the timeout. i also observed that the parent is getting initiated and fails, after which the child comes into the queue and then dies.
05-21-2014 12:47 PM
Is there any way to test by running a different day. Or move some of the other jobs to different time. I know I fire off my fulls all weekend. Starting Friday night till Sunday morning. To help with Speed.
05-21-2014 01:12 PM
I tried but the problem is i have ~5000 SQL policies along with other ones and i will have to accomodate them somewhere during the weekend. whereever i tried to move them, they failed during thier first run's. later got them runnning in sub-groups some how.
07-21-2014 04:02 AM
2-month-old unresolved post....
It seems as if this environment has outgrown the infrastructure... more backups than servers and network can handle?
Have a look at this Best Practice session at Vision this year: