08-25-2008 06:16 AM
Hi All,
I need your help on this one. I am running NBU 6.5.1.
I have 1 Linux client that when you run a full backup on it I receive the status code 41 network connection timed out. That is after approx 200GB of data is backed up and it fails for some reason. Cumulative backups are successfully running.
Any clue as to why this is only failing on Full backups and only after 200+ GB has been backed up?
Thanks!
08-25-2008 01:07 PM
can be a timeout, also a problem with one of your medias that fail after some time running over a bad tape or a bad drive, check for any disconnects between your client and media server also try sending big packets to the media server, this can help:
ping -s -i <nic> -p bpcd <media server> 65000
this will send a 65KB packet trough your backup NIC connecting with bcpd port, if you see some disconnects then you have a network quality problem, which can result in a 41 error.
let us know.
regards
08-26-2008 03:22 AM
OK, our client on this server was at 5.0 so we upraded to 6.5 and a smiliar problem occurred. The backup job on one particular directory errored out with error 41 just like before. I was actually watching it and was able to resume the job through the console. The backup ended up completing with status 0. I'm not sure if the job would have resumed itself or not.
The parent job errored out with 200 (scheduler found no backups due to run). So, I believe that I have backed up the data I need, but the client didn't communicate back to the server to report a successful status.
The backup seems to be erroring out when it is very close to completion (data size wise). I am still looking into this if anyone has any tips I would appreciate it.
08-26-2008 04:57 AM
08-28-2008 07:34 AM
So, the bpbkar tells me where this backup is stopping at, but it doesn't tell me why. And interestingly enough the direcotry that it continually is stopping at is empty.
The job does resume by itself and finishes with Status 0 but this is strange that this continues to occur.
Next steps are to exclude that directory and see if the backup continues or errors out on the directory before this.
08-28-2008 07:49 AM
If this is a consistently occuring error (which usually means that the server has timed out waiting for data from the client) you may be able to resolve it by increasing your "Client Read Timeout" value (Host Propeties->Master Server->Timeouts). The default is 300s (5mins) and this is not always enough. Details may show in a verbose 5 bpbkar log and you can also enable what's called a bpbkar trace. First I would try with a larger timeout -- say 900s.
08-28-2008 07:54 AM
08-28-2008 08:10 AM
08-28-2008 08:14 AM
Back then we used this to further debug bpbkar [I have no idea if these work above 5.x):
IV. Other files that can be created to get advanced logging features:
The following touch files can be used to get additional logging from the NetBackup daemons. These require that VERBOSE=<level> be added to the bp.conf on the server and the appropriate log directories exist under /usr/openv/netbackup/bin/logs.
08-28-2008 08:15 AM
08-28-2008 08:27 AM
@bumj1 wrote:
I may open a call with support just for giggles.
You may even find that it's not the empty directory it's getting stuck on - it could be the last element of the previous one that it's having issues with.
We had a similar issue some time ago (unfortunately can't remember the specifics anymore) where a backup appeared to be 'stuck' at a certain file/directory so we excluded it. It then 'stuck' on the next one in the chain & so on. It wasn't until we excluded the one prior to it 'sticking' that the backup completed. Really can't remember what the final issue was altho' Linux client & sparse files rings a bell - but that may have been another story!!!
***Edit***
Have been 'reliably' informed that the issue was with the 'wtmp' file which on older Linux systems (certainly ours at the time) this was a ~1Tb sparse file. We deleted this & logged back on to recreate it as a 'normal' file.