Solved: Need urgent solution:Backup failed with 24 - Page 2

2013 · ‎06-25-2014

Hello Team,

NBU Master Server: 7.5.0.7

Client 7.5.07/ Client Name :test.123

Policy :policy.xxx

I have gone through all Tech Notes and found 2 mount points is creating issue and for that i have enabled logging to 5 and also created touch file bpbkar_path_tr and fired backup.It fails with 24.

Can someone please check the logs and suggest what needs to be excluded.So far there is no exclusion list.I have ached bpbkar logs and it seems there is some issue.Please can someone check...

2013 · ‎06-27-2014

Media Server:

Client connect timeout :9600
Client read Timeout :9600

-------------------------------------------------------------------------------------

Client :

Client read timeout :9600

2013 · ‎06-27-2014

Media Server:

Client connect timeout :9600
Client read Timeout :9600

-------------------------------------------------------------------------------------

Client :Client read timeout :9600

Marianne · ‎06-27-2014

The above settings proof that the issue is not with any NBU timeout settings.

You need to work with OS and Firewall admins to find out where this 15 minute timeout is happening.

As per Martin's excellent post:

"NBU is the casualty in this issue, not the cause."

PS: Those timeouts are way too big and can cause backups to appear to be hanging for almost 3 hours before NBU timeout will kick in.

There should be no reason for timeouts larger than 1800.

Handy NetBackup Links

mph999 · ‎06-27-2014

The problem is, these are tuning settings, what works for one environment, may not work for another - hence why we can't really provide a document, beacuse it is unique to your enmvironment.

The last 'case' I had that was along these lines, these are the settings that was found to work:

net.ipv4.tcp_max_orphans= 400000

net.core.netdev_max_backlog= 250000
net.ipv4.tcp_keepalive_time= 900
net.ipv4.tcp_keepalive_intvl= 30
net.ipv4.tcp_max_syn_backlog= 16384
net.ipv4.tcp_synack_retries= 1

However, the system these were applied to was 'probably' totally different from your environment, so you can see why I am reluctent to start dishing out values, and recommend you to work with the network/ OS admins.

2013 · ‎06-27-2014

Strange thing is i initiated Diff Backup which completed successfully after 4 attempts but Full Back fails after 15min...

1 attempt took 15 min

2 attempt took 15 min

3 attempt took 15 min

4 attempt took 1hrs and backup completed.

mph999 · ‎06-27-2014

Smaller amount of data - less network traffic/ less load on network I suspect.

2013 · ‎06-29-2014

Can you guys please suggest me what parameters needs to be checked from OS and Networking side.Customer is saying other systems are working with the same configurations.So now need to be very specific what needs to be checked...The System is Linux Box.

Marianne · ‎06-30-2014

We cannot say as the error is not caused by NBU.

Your customer needs to understand:

NBU is the casualty. Not the cause.

Your customer will have to investigate to see 'what is different' on this Linux box.

A couple of years ago we had a similar situation where backups kept on failing for certain Linux clients - especially over weekends when large, full backups were running.

All troubleshooting from NBU point of view did not reveal anything and no co-operation from OS team.
We simply VPN'ed in over a weekend and resumed the backup every time we saw status 24 (with checkpoint restart we got a bit further each time...)

At some point, these problematic Linux clients got replaced with new hardware and obviously newer OS version.
Status 24's magically disappeared.

Handy NetBackup Links

2013 · ‎06-30-2014

Ok Thanks for the update.But i checked with the customer and as per them nothing changed on OS side.Anyways Thanks All for the help.

mph999 · ‎07-01-2014

Do you have a support agreement with Symantec.

If so, log a call and ask to run Appcritical between the Media server and the client, and then between the client and the media server (it only goes in one direction, hence why run twice).

If not, there is a free alterative to AppCritical, hopefully someone on here will know what it is called, as for the life of me, I can't remember.

the problem with this is, how many possible causes would you like :

Faultly hardware (including cables)

Drivers or Firmware (on any of the hardware involved (eg NIC card, switch)

Faulty ports on switch

Firewalls / Routers

OS settings (we have discussed some, did you try the ones I posted up ?)

** If you change them make a note of what they were, if the new setings don't fix, or improve, put them back else we could start introducing more faults that cause the same symptoms, which will be very very hard to sort out) **

I've even seen an error where the network card wouldn't send a particular type of data, all was ok until it hit a certain file, and just wouldn't send it ...think it was a .tar file, can't remember the fix though I think it was hardware related.

When the issue happens again, at the 'exact' time of failure - run netstat -a and attach the output to a file on this thread.

Did this client ever work, and if so, when did it stop working.

If so, on the day it stopped working, what was changed, bause it's 99.9% certain that something has - and this could be the key to finding the cause. Your customer is going to have to try and remember, because I really don't believe nothing changed.

If none of the above lead to anything then I can only think to get a tcpdump of the interface whist the backup is running and, until it fails, then look at it in wireshark (free) - and for that, you will really need to find a network type person who is used to looking at tcpdumps. From that, it should be possible to see why it fails, or at the very least narrow it down.

Marianne · ‎07-02-2014

I was smiling when Nicolai shared this TN today:

Overview of NetBackup performance testing
http://www.symantec.com/docs/TECH147296

Handy NetBackup Links

mph999 · ‎07-02-2014

Not seen that one before - useful, though I would disagree 100% that NBU tuning can cause a status 24,'cos I've never seen it ... I made a determined effort once to 'force' a status 24 by the 'mis-use' of buffer settings - couldn't do it, all I could get was various degrees of poor performance, but no failure. 2013 - as a matter of interest, on a backup that fails, but before it actually fails, approx how fast is it going ?

RonCaplinger · ‎07-02-2014

Another source of these problems could be the TCP "ring buffers" and TCP offloading. We tried using some Cisco UCS blades as media servers last year and had to switch back to physical hardware when we had intermittent status 24's and 2074's. A couple of the steps we tried were to increase the TCP Ring Buffers to their max value, 4096, and disabled all TCP offloading.

Here's a link describing how the ring buffers in Linux work:

http://www.linuxjournal.com/content/queueing-linux-network-stack

2013 · ‎07-03-2014

mph999 : Before backup fails the backup speed was very good but it stucks anywhere and then within 15min it will throw error 24.I have re-installed the client binaries thinking may be the binaries are corrupt but no luck.

RonCaplinger: I will check with linux guy on this TCP "ring buffers" and TCP offloading.

Thanks for your efforts.I will update you soon on this.

mph999 · ‎07-03-2014

Example of how to increase 'ring buffer'

ethtool -G eth2 rx 4096

2013 · ‎07-03-2014

Just need to know as its a Production server will it be safe to initiate the below command on Linux.Also i need to know what is the default value of ring buffer?

ethtool -G eth2 rx 4096

Is 4096 is the highest value? If the issue not resolved can we put the dafault value?

Also do i need to disable TCP offloading? Is it safe as its a prod box...

Please suggest...

2013 · ‎07-03-2014

Thanks All...Finally issue is fixed....ethtool -G eth2 rx 4096 .....save my life....

VOX

Need urgent solution:Backup failed with 24