Solved: Just repeating a couple of

Wile_E__Coyote · ‎06-14-2010

We have about 400 clients in our backup environment. Out of the 400, this one lone Solaris 10 client keeps intermittently failing with codes 13 and 24. It fails more often than it succeeeds, and fails most often on full backups. It is the only client currently failing with 13 and/or 24 error codes.

Symantec Support has been provided with all the logs from the master, media, and client, and their only suggestion is to increase some timeout setting on the media servers.

Obviously, all that did was cause the job to take longer to fail.

Support has nothing further to offer us, except to increase the timeout even more.

jim_dalton · ‎06-17-2010

Mister Coyote,
You need to do some trial and error/ rule stuff in / out. It wont take too long to figure.
First narrow down the backup to reduce the data being backed up.Start with minimal mountpoints or dirs and gradually increase.
It could be a data related issue...Ive had one with hundreds of thousands of directories and this upsets netbackup.
Use unix find cmd to explore....if find takes ages to run, then netbackup will be similar.
If you can run test backups successfully on subsets of data, look into your data.
Try this: run bpbkar manually on the client only and throw the data into /dev/null backing up the same data in the policy. If this works, then its most likely nets or master server.
If it fails its likely to be data related.
Try this: try copying (lets say scp) from client to server: ie getrid of netbackup altogether.if this fails its probably a nets issue.Ive seen error 13 for precisely this problem.
Soon you'll find youve ruled out a component and you'll be well on your way to isolating the problem.

Jim

View solution in original post

Marianne · ‎06-14-2010

Please give following info:
NetBackup version on master, media and client
client behind firewall?
Does any data actually get transferred or does the job fail without any data written?
Client on same or different subnet as other (working) clients?
Check tracert between media server and problematic client. Compare with working clients.
Confirm that all connections are full duplex - media server, client, switch.

Test data transfer between client and media server - test with minimum of 1Gb data file.

Ask the support engineer to escalate call to a higher level...

Handy NetBackup Links

Wile_E__Coyote · ‎06-14-2010

Client, master and media are all verified at Netbackup 6.5.5.

All hosts involved are on the same subnet. No firewalls Names and IPs resolve correctly both forward and reverse.

We've checked duplex settings 100 times. Everything looks right.

Support has closed the case.

rjrumfelt · ‎06-14-2010

on the client from one of the failed backups?

Wile_E__Coyote · ‎06-14-2010

There are no errors in the bpcd logs on the client.

The only errors are in the bpbkar logs:

18:02:43.417 [21755] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno =
32: Broken pipe
18:02:43.437 [21755] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: sock
et write failed
18:02:43.437 [21755] <4> bpbkar Exit: INF - EXIT STATUS 24: socket write failed
18:02:43.437 [21755] <2> bpbkar Exit: INF - Close of stdout complete
18:02:43.437 [21755] <4> bpbkar Exit: INF - setenv FINISHED=0

Marianne · ‎06-14-2010

Just repeating a couple of questions:
NetBackup version on master, media and client?
Does any data actually get transferred or does the job fail without any data written?
Test data transfer between client and media server - test with minimum of 1Gb data file. (Use ftp)

Also check patch level of Solaris on the client - 'uname -a' will confirm kernel patch level.

Two TN's:
http://seer.entsupport.symantec.com/docs/336452.htm
http://seer.entsupport.symantec.com/docs/271200.htm

Handy NetBackup Links

Wile_E__Coyote · ‎06-14-2010

To reiterate:

Client, master and media servers are all at 6.5.5.
The backup sometimes completes successfully (so yes, data gets transferred).
There is nothing wrong with basic data transfer from the client and media servers. We have done the "large file transfer" test six ways from Sunday in trying to diagnose this problem.

SunOS dev2 5.10 Generic_139555-08 sun4u sparc SUNW,Sun-Fire-V210

Symantec support does not think it is a patch problem.

I have checked the settings listed in doc 271200, and they are all correct.
$ sudo rsh uxrd602 ndd -get /dev/tcp tcp_ip_abort_interval
480000
$ sudo rsh uxrd602 ndd -get /dev/tcp tcp_rexmit_interval_initial
3000
$ sudo rsh uxrd602 ndd -get /dev/tcp tcp_rexmit_interval_min
400
$ sudo rsh uxrd602 ndd -get /dev/tcp tcp_rexmit_interval_max
60000

I have checked the settings listed in doc 336452 and they are all correct too:
$ sudo ndd -get /dev/tcp tcp_recv_hiwat
2621440
$ sudo ndd -get /dev/tcp tcp_xmit_hiwat
2621440
$ sudo ndd -get /dev/tcp tcp_max_buf
1073741824

Marianne · ‎06-14-2010

Apologies - missed the version info.
Thanks for supplying rest of info...

Have a look at this TechNote:
http://seer.entsupport.symantec.com/docs/350695.htm

Handy NetBackup Links

Wile_E__Coyote · ‎06-14-2010

Thanks Marianne, but unfortunately I don't see how that one applies to my situation.

We're not using the ngxe driver on the client, and the servers work fine for the other 399 clients.

jim_dalton · ‎06-17-2010

Mister Coyote,
You need to do some trial and error/ rule stuff in / out. It wont take too long to figure.
First narrow down the backup to reduce the data being backed up.Start with minimal mountpoints or dirs and gradually increase.
It could be a data related issue...Ive had one with hundreds of thousands of directories and this upsets netbackup.
Use unix find cmd to explore....if find takes ages to run, then netbackup will be similar.
If you can run test backups successfully on subsets of data, look into your data.
Try this: run bpbkar manually on the client only and throw the data into /dev/null backing up the same data in the policy. If this works, then its most likely nets or master server.
If it fails its likely to be data related.
Try this: try copying (lets say scp) from client to server: ie getrid of netbackup altogether.if this fails its probably a nets issue.Ive seen error 13 for precisely this problem.
Soon you'll find youve ruled out a component and you'll be well on your way to isolating the problem.

Jim

VOX

Codes 13 and 24 from Solaris server