SQL Jobs are failing with timeout errors

onepranav · ‎05-09-2014

Hello Experts!

I have been facing issue recently where thousands of DB jobs for my SQL DB windows clients started failing with error 54(Child)/2(Parent)

I tried some basic troubleshooting like limiting the amount of jobs that run per client to 10, Extending the overall backup window,

Limiting the amount of jobs per STU to 10 and then overall limiting the jobs to use only 4 and then 3 STU's

Environment -

1 Master : HP-Ux 11.31 NBU 7505 [Client Connect timeout - 300 Client read timeout - 3600 Media server Connect timeout - 90]

6 Media : 3 X Hp-Ux 11.31 NBU 7505 + 3 X Redhat 6.2 NBU 7505 [Client Connect timeout - 300 Client read timeout - 3600 Media server Connect timeout - 90]

Multiple Clients : Win 2003 R2 and Win 2008 R2 X64

The failed jobs do run in the second attempt and get completed but idea is to get them working in the first attempt.

May be i am missing out something basic here but not sure.

Any help will be greatly appreciated.

Marianne · ‎05-12-2014

You forgot to mention NBU version on the SQL Clients?

Is there a firewall anywhere in the picture?

Have a look at this TN and see if anything may be relevant:
http://www.symantec.com/docs/TECH138071

Handy NetBackup Links

onepranav · ‎05-13-2014

Thanks for the reply Marianne. The agent versions are 7.5.

There is no firewall in that particular environment setup.

Michael_G_Ander · ‎05-20-2014

CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT DWORD registry keys have helped me with a lot of timout issues, the default 300 (seconds) is often insufficient for database backups and especially restores

Regards

Michael

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

onepranav · ‎05-21-2014

The Timeout is currently set to 1800 on the client and 3600 on the media servers.

NathanNieman · ‎05-21-2014

Windows 2008 has the firewall turned on. Have you tried adding rules on the windows server to allow comminication. This was a issues I ran into.

onepranav · ‎05-21-2014

Confirmed with our windows team that a exception rule for netbackup ports already exists. While troubleshooting here is what i found that it is happening randomly on saturdays during 6 AM and 10 PM which is when actually most of the jobs run. The job completes if it is rerun later on sunday or monday. I am arriving to a point where it might be a resource crunch causing the timeout. i also observed that the parent is getting initiated and fails, after which the child comes into the queue and then dies.

NathanNieman · ‎05-21-2014

Is there any way to test by running a different day. Or move some of the other jobs to different time. I know I fire off my fulls all weekend. Starting Friday night till Sunday morning. To help with Speed.

onepranav · ‎05-21-2014

I tried but the problem is i have ~5000 SQL policies along with other ones and i will have to accomodate them somewhere during the weekend. whereever i tried to move them, they failed during thier first run's. later got them runnning in sub-groups some how.

Marianne · ‎07-21-2014

2-month-old unresolved post....

It seems as if this environment has outgrown the infrastructure... more backups than servers and network can handle?

Have a look at this Best Practice session at Vision this year:

NetBackup 7.6 Best Practices: Optimizing Performance

Handy NetBackup Links