Highlighted

Netbackups failing with error code 24 on linux clients

we are using net backup.

 

till many days it was very much successful, but from 2 days the backups are failing with error code 24.

 

socket write failed.

 

can any one sugguest on the issue, what to do and how to resolve the issue.

1 Solution

Accepted Solutions
Accepted Solution!

Socket write failed could be

Socket write failed could be anything fom the back of  the server to the client / or OS TCP settings.

The one thing it won't be is NBU.

If it was working, find out what has changed, this can be settings in the OS / settings on network switches or a fault, ave any patches been applied to the media server ?

There are virtually no network setting in NBU, because we only make use of what we are given, yes, n/ work issues will make NBU fail, but NBU is only the casualty, not the cause.

There are some t/outs, eg. client connect but these should really be used to troubleshoot, eg. does it work if the value is increased, but not treated as a fix, as they only mask the problem.

You need to consider if any clients work, when the backup fails (does it fail at different points, network load, try running a single job, does that work, or perhaps run for longer.

 

 

View solution in original post

13 Replies

I have seen this with old

I have seen this with old version of Linux and big filesystems on the Linux machines.

We could never find the reason, but the problem disappeared when the clients were migrated to new hardware with newer version of Linux. 

Issue is most of the time on client side. Try a reboot that will refresh NIC drivers, etc. 
Look for most recent OS updates on clients.
If problem is with backups being done by a particular media server, follow the same steps as suggested above.

mph999 has listed some reasons for status 24:

NetBackup Status Code 24 - Possible Parameters to Check 

Accepted Solution!

Socket write failed could be

Socket write failed could be anything fom the back of  the server to the client / or OS TCP settings.

The one thing it won't be is NBU.

If it was working, find out what has changed, this can be settings in the OS / settings on network switches or a fault, ave any patches been applied to the media server ?

There are virtually no network setting in NBU, because we only make use of what we are given, yes, n/ work issues will make NBU fail, but NBU is only the casualty, not the cause.

There are some t/outs, eg. client connect but these should really be used to troubleshoot, eg. does it work if the value is increased, but not treated as a fix, as they only mask the problem.

You need to consider if any clients work, when the backup fails (does it fail at different points, network load, try running a single job, does that work, or perhaps run for longer.

 

 

View solution in original post

Start with basic connectity

Start with basic connectity tests using ping and nslookup on all machines, bpclntcmd on clients, bptestbpcd from master and media server(s)

Have had similar issues, where the solution was to restart the client including the pbx. In those cases the culprits was security patches/antivirus software upgrades

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Here we have 20 clients

Here we have 20 clients failing with 3 master servers some filesystems are backing up and some are failing, this senario is in individual hosts.

 

Help would be highly appriciated

You need to concentrate on

You need to concentrate on failing clients.
Check NBU version and OS versions and then search for network-related issues specific to these versions. (As per Martin's post - hardly ever NBU).

Check if OS patches are up to date. 
Check KeepAlive settings.
Check OS resources while backups are running.
If problem is seen with large filesystems, see what happens if you try and list all files/folders with 'ls -lR'.

Try to reboot the clients as per my previous suggestion.

As per my post above - we saw this issue at a specific customer with old 32-bit Linux clients and large filesystems.
All we could do was to ensure checkpoints were enabled in policies and resume the backup each time it failed.
The problem disappeared when client servers were replaced with 64-bit machines and newer OS (was RHEL 5.x at the time).

what is different about the

what is different about the failing filesystems? Are they large? Do they have lots of files?

 

we dont have large filesystem

we dont have large filesystem size but we have large number of files in my carrer in backup operations i never faed such type of error, but we have gone through the network speed check between client and switch and also the master and switch. we are good with that and our servers run with rhel 5 and 6.

as it is large network we dont have access to all the things but trying the most that we can do.

we are unable to trace the issue, any one who can help us with commands will be most appreciated.

as we have large number of financial applications running on the servers cannot be rebooted as soon as possible.

thanks for the support kindly suggest on this. 

thanks in advance Smiley Happy

You need network and OS

You need network and OS admins to assist you with troubleshooting. NBU is reporting the issue, not causing it. All I can suggest is that you keep on resuming failed backup if checkpoints are enabled in the policies.

Pick one client and

Pick one client and troubleshoot that, use a test backup if necessary and run multiple times to get the pattern of failure, e.g., always at the beginning, the end, or just at random points.

Commands, netstat is a start.

Where in activity monitor details does it fail, I will guess between bptm on the media and bpbkar on the client.  Once we know this we can say which NBU logs are involved, but please note that the loss are unlikely to show anything more than we know, however we should check just to confirm.

There aren't really any netbackup commands that will help, if name resolution is correct, and bptestbpcd -client <client name> work which is only testing basic connectivity then that's about it from NBU side.

Are there any firewalls, these can cause all sorts of problems.

What is common about these current masters and clients, are they on the same networks, or are they totally separate, even using different switches.  When a working environment suddenly breaks across current masters and clients it's quite possibly environmental.

There are some network related setting in Linux that can cause status 24s, I'll have to dig them out but I came across them from when an appliance upgrade caused 24s, and I think Eng j sneering came up with some recommended values, I guess we could try those, if nothing else we would then know that the media servers have good settings - I am presuming the media servers are Linux, is that correct.

 

 

 

I think I found the TN that

I think I found the TN that Martin was referring to: 

http://www.symantec.com/docs/TECH224031 

Ignore the reference to Appliances - have a look at tcp_timestamps that is mentioned in the doc.

Check all problematic Linux clients (and media server if that is Linux).
Ensure this line exists in /etc/sysctl.conf :
net.ipv4.tcp_timestamps = 1

Opps, forgot about this

Opps, forgot about this ...

These were the settings I was thinking of ...

net.ipv4.tcp_max_orphans= 400000
net.core.netdev_max_backlog= 250000
net.ipv4.tcp_keepalive_time= 900
net.ipv4.tcp_keepalive_intvl= 30
net.ipv4.tcp_max_syn_backlog= 16384
net.ipv4.tcp_synack_retries= 1
 

Also increase ring buffer RX setting on (eth2 in this example)

ethtool -G eth2 rx 4096

 

Would be wise to make a note of the existing values before changing anything, and if no improvement, I would be inclined to set them back to what they were.

got solution for this

got solution for this issue

 

working with the network tem checking the network speed from switch to client and the master.

 

as well as reduced the load on the master server , i feel this is the one of the main cause.

 

thanks for every one who helped me in solving the problem,

mph999 mentioned switch

mph999 mentioned switch settings on 6 March.

I will mark his post as solution.

Best if you mark Solution, but I have asked you via PM so many times....