Here is my problem, we have a netbackup server acting as Master/Media server and about twenty clients running Windows and Linux as OS.
Recently, two servers with MS-WINDOWS backup policy, just hang and block the rest of backups till the backup window closes (ERR: 196).
I thought that might be a problem with the file mentioned in the task dialog box (current file) but everytime when i add the folder found in the previous state, it hangs in another folder.
For now, the two backup policies are stopped, but it is not a solution.
Can anyone help here please?
Thanks in advance, and i am available for any information.
Regards and respect.
This may help:
But really we would need to see the client side logs, for bpcd, bpinetd, vnetd and bpbkar. Do you know hoe to enable logging and collect these logs? Do you know how to enable detailed logging for bpbkar? If not, see if these help:
I would check for reources exhaustion on these clients like 100% CPU, MEMORY, swapping and the VSS sub system as this often is the culprit on windows. I always start with vssadmin list writers in an administrative prompt.
Also check the application/system event log on the clients for warning/errors
The usually question, has there been changed anything the infrastructure around the same time as this problem started ?
Hope this helps
If 'Allow multiple data streams' is not enabled in Policy Attributes, please do so.
This will give you an indication of filesystem/volume where backup is hanging.
Verbose logging of bpbkar (level 3 is usually sufficient) will tell if backup is hanging on a specific file or folder.
Please change attributes when a full backup is due, otherwise incrementals may run as full.
@SDO for the logs here attached the logs you requested and a screen shot of the performance of the netbackup client
About your second comment, where should i create this file named bkbpar_path_tr ??
@Marianne, effectively it was disabled, i activated it, now, where should i look for the indication of filesystem/volume where backup is hanging??
And about the logging, i raised it to level 5 hoping that might give me some clue about the issue, but it just for troubleshooting purposes
Thanks in advance.
Windows: ?:\Program Files\Veritas\NetBackup\bpbkar_path_tr
(tip: be sure that Windows is not hiding file extensions, i.e. that the file does not have a type (i.e. not a '.txt' extension))
With 'Allow Multiple Data Streams' enabled, a separate backup job will be created for each drive letter and for System_State/Shadow_Copy_Components.
The assumption is that if there is a problem for a specific volume/stream, only that job will hang and the rest will complete successfully.
The Overview tab in Activity Monitor will tell you which drive letter each job is for.
I hope someone here on Connect has the time and patience to look at level 5 log....
Level 5 logs are simply too big for me to even try.
From the bpbkar log:
08:58:05.163: [7768.5924] <4> dos_backup::tfs_include: INF - folder (System Files) has been created recently (since 26/06/2015 03:21:39). It will be backed up in full.
...then ten minutes later:
09:09:30.865: [7768.4600] <16> dtcp_write: TCP - failure: send socket (472) (TCP 10054: Connection reset by peer) 09:09:30.865: [7768.4600] <16> dtcp_write: TCP - failure: attempted to send 1 bytes 09:09:30.865: [7768.4600] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :14
And the job was using:
08:48:30.096: [7768.5924] <2> WinMain: DAT - lpCmdLine = '-r 1209600 -ru root -dt 1052811 -to 0 -clnt srvprod-rds-1 -class srvprod-rds-1 -sched Incremental -st CINC -bpstart_to 300 -bpend_to 300 -read_to 300 -blks_per_buffer 128 -tir -tir_plus -use_otm -fso -b srvprod-rds-1_1436342091 -kl 28 -WOFB_enabled -WOFB_fim 0 -WOFB_usage 0 -WOFB_error 0 -ct 13 -use_ofb '
...which has a client timeout of 300 seconds, i.e. five minutes.
Status 14 in more detail here:
It looks as though the media server may have closed it's listennning port, and so the client failed to write (status 14) to the TCP socket.
The first thing I would try, is increasing the client read timeout on the media server, from say 300 seconds to 600, then 900, then 1200, then 1500, then 1800 - and test each time. If you still have problems with a media server client read timeout of 1800 seconds (i.e. 30 minutes) then we'll have to think again, and dig deeper.
Another thing to check for is the state condition of VSS:
> vssadmin list writers > vssadmin list providers > vssadmin list volumes > vssadmin list shadowstorage > vssadmin list shadows
Do any of these commands hang, or report/show errors?
I noticed that the backup job was using WOFB (Windows Open File Backup) settings of:
-WOFB_enabled -WOFB_fim 0 -WOFB_usage 0 -WOFB_error 0
...so it would appear that the backup client is using VSP and not VSS, and will abort on error rather than continue...
...however you might be better off, configuring the 'client attributes' of the backup client using:
> bpclient -add -client myclient.name.com -WOFB_enabled 1 -WOFB_FIM 1 -WOFB_usage 0 -WOFB_error 1
...if the '-add' fails then try '-update'...
...see page 82 of the NetBackup 7.1 Commands Reference Guide for details re these switches:
I would like to thank you all for the information you're providing.
I followed your instructions and activated logging level to 3 as requested by Marianne and activated "Allow multiple data streams".
Now, when i launch the backup, it hangs in the first stream (2.png and 2.txt in the joint archive file). I added the updated log files requested in the first comments.
The screen shots and logs don't appear to tie up - the screen shots show jobs starting at 11:32, but the logs have nothing beyond 11:26. And the job numbers from the screen shots do not match the job numbers in teh logs.
In the bpbkar log, we see again:
09:03:22.415: [5296.4560] <4> tar_backup_tfi::UpdateExcludeListWithVHD: INF - UpdateExludeListWithVHD begin
...then 25 minutes later:
09:28:20.228: [5296.3664] <16> dtcp_write: TCP - failure: send socket (480) (TCP 10054: Connection reset by peer) 09:28:20.228: [5296.3664] <16> dtcp_write: TCP - failure: attempted to send 1 bytes 09:28:20.228: [5296.3664] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :14
The bpbkar log doesn't appear to be very detailed. Are you sure that you enabled a high level of logging on the client?
Maybe try a test again. Clear the client logs, and capture screen shots to match the jobs, and then try the jobs again. Wait 45 minutes, then capture client logs. We don't really need screen shots. Also would be very useful to have the media server bpbrm and bptm logs.