Re: status code 41 network connection timed out

bumj1 · ‎08-25-2008

Hi All,

I need your help on this one. I am running NBU 6.5.1.

I have 1 Linux client that when you run a full backup on it I receive the status code 41 network connection timed out. That is after approx 200GB of data is backed up and it fails for some reason. Cumulative backups are successfully running.

Any clue as to why this is only failing on Full backups and only after 200+ GB has been backed up?

Thanks!

Omar_Villa · ‎08-25-2008

can be a timeout, also a problem with one of your medias that fail after some time running over a bad tape or a bad drive, check for any disconnects between your client and media server also try sending big packets to the media server, this can help:

ping -s -i <nic> -p bpcd <media server> 65000

this will send a 65KB packet trough your backup NIC connecting with bcpd port, if you see some disconnects then you have a network quality problem, which can result in a 41 error.

let us know.

regards

bumj1 · ‎08-26-2008

OK, our client on this server was at 5.0 so we upraded to 6.5 and a smiliar problem occurred. The backup job on one particular directory errored out with error 41 just like before. I was actually watching it and was able to resume the job through the console. The backup ended up completing with status 0. I'm not sure if the job would have resumed itself or not.

The parent job errored out with 200 (scheduler found no backups due to run). So, I believe that I have backed up the data I need, but the client didn't communicate back to the server to report a successful status.

The backup seems to be erroring out when it is very close to completion (data size wise). I am still looking into this if anyone has any tips I would appreciate it.

Nathan_Kippen · ‎08-26-2008

Is it always erroring out around the same time? Increase your verbosity with bpbkar (I think that's the one) and see what file(s) the backup is getting caught up on.

bumj1 · ‎08-28-2008

So, the bpbkar tells me where this backup is stopping at, but it doesn't tell me why. And interestingly enough the direcotry that it continually is stopping at is empty.

The job does resume by itself and finishes with Status 0 but this is strange that this continues to occur.

Next steps are to exclude that directory and see if the backup continues or errors out on the directory before this.

dami · ‎08-28-2008

If this is a consistently occuring error (which usually means that the server has timed out waiting for data from the client) you may be able to resolve it by increasing your "Client Read Timeout" value (Host Propeties->Master Server->Timeouts). The default is 300s (5mins) and this is not always enough. Details may show in a verbose 5 bpbkar log and you can also enable what's called a bpbkar trace. First I would try with a larger timeout -- say 900s.

bumj1 · ‎08-28-2008

I did previously increase the timout to 30 minutes.

dami · ‎08-28-2008

It will be interesting to see what happens if you exclude (or physically move if you can) the offending directory. Will the backup complete or stop further down ... I once saw something years ago (NBU 3.4) like this which was I think caused by a bad nic config setting - duplex ? not sure. I doubt that the same applies here.

dami · ‎08-28-2008

Back then we used this to further debug bpbkar [I have no idea if these work above 5.x):

IV. Other files that can be created to get advanced logging features:
The following touch files can be used to get additional logging from the NetBackup daemons. These require that VERBOSE=<level> be added to the bp.conf on the server and the appropriate log directories exist under /usr/openv/netbackup/bin/logs.

/usr/openv/netbackup/bpbkar_path_tr - Log additional entries when bpbkar selects files to backup.

From: http://seer.entsupport.symantec.com/docs/243778.htm

bumj1 · ‎08-28-2008

Yes, we are going to exclude that directory, and upgrade the client to 6.5.1 (it is at 6.5 currently) just to be equal with our master, and if the logs still tell me nothing or if the backup just stops further up the chain I may open a call with support just for giggles.

Andy_Welburn · ‎08-28-2008

@bumj1 wrote:
I may open a call with support just for giggles.

You may even find that it's not the empty directory it's getting stuck on - it could be the last element of the previous one that it's having issues with.

We had a similar issue some time ago (unfortunately can't remember the specifics anymore) where a backup appeared to be 'stuck' at a certain file/directory so we excluded it. It then 'stuck' on the next one in the chain & so on. It wasn't until we excluded the one prior to it 'sticking' that the backup completed. Really can't remember what the final issue was altho' Linux client & sparse files rings a bell - but that may have been another story!!!

***Edit***

Have been 'reliably' informed that the issue was with the 'wtmp' file which on older Linux systems (certainly ours at the time) this was a ~1Tb sparse file. We deleted this & logged back on to recreate it as a 'normal' file.

Message Edited by Andy Welburn on 08-28-2008 04:32 PM

VOX

status code 41 network connection timed out