cancel
Showing results for 
Search instead for 
Did you mean: 

Backup jobs hang in just two clients

Lotfi_BOUCHERIT
Level 5

Hello everyone,

Here is my problem, we have a netbackup server acting as Master/Media server and about twenty clients running Windows and Linux as OS.

Recently, two servers with MS-WINDOWS backup policy, just hang and block the rest of backups till the backup window closes (ERR: 196).

I thought that might be a problem with the file mentioned in the task dialog box (current file) but everytime when i add the folder found in the previous state, it hangs in another folder.

For now, the two backup policies are stopped, but it is not a solution.

Can anyone help here please?

Thanks in advance, and i am available for any information.

Regards and respect.

 

NB:

  1. Master Server :
  2.            OS : Windows Server 2003
  3.            Netbackup version : 7.1 
  4. Clients :
  5.            OS : Windows Server 2008 (both of them)
  6.            Netbackup client version : 7.1
  7.            Backup policy : MS-Windows
  8.            Backup directive : ALL-LOCAL-DRIVES

 

12 REPLIES 12

sdo
Moderator
Moderator
Partner    VIP    Certified

This may help:

https://support.symantec.com/en_US/article.TECH213267.html

.

But really we would need to see the client side logs, for bpcd, bpinetd, vnetd and bpbkar.  Do you know hoe to enable logging and collect these logs? Do you know how to enable detailed logging for bpbkar?  If not, see if these help:

https://www-secure.symantec.com/connect/forums/how-enable-bpbkar-log

 

Michael_G_Ander
Level 6
Certified

I would check for reources exhaustion on these clients like 100% CPU, MEMORY, swapping and the VSS sub system as this often is the culprit on windows. I always start with vssadmin list writers in an administrative prompt.

Also check the application/system event log on the clients for warning/errors

The usually question, has there been changed anything the infrastructure around the same time as this problem started ?

Hope this helps

 

 

 

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

sdo
Moderator
Moderator
Partner    VIP    Certified

Re enabling path trace in bpbkar - to see where it hangs as it walks a file system:

https://support.symantec.com/en_US/article.TECH31513.html

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

If 'Allow multiple data streams' is not enabled in Policy Attributes, please do so.
This will give you an indication of filesystem/volume where backup is hanging.

Verbose logging of bpbkar (level 3 is usually sufficient) will tell if backup is hanging on a specific file or folder.

Please change attributes when a full backup is due, otherwise incrementals may run as full.

Lotfi_BOUCHERIT
Level 5

@SDO for the logs here attached the logs you requested and a screen shot of the performance of the netbackup client

About your second comment, where should i create this file named bkbpar_path_tr ??

@Marianne, effectively it was disabled, i activated it, now, where should i look for the indication of filesystem/volume where backup is hanging??

And about the logging, i raised it to level 5 hoping that might give me some clue about the issue, but it just for troubleshooting purposes

Thanks in advance.

Regards

sdo
Moderator
Moderator
Partner    VIP    Certified

Unix:          /usr/openv/netbackup/bpbkar_path_tr

Windows:  ?:\Program Files\Veritas\NetBackup\bpbkar_path_tr

(tip: be sure that Windows is not hiding file extensions, i.e. that the file does not have a type (i.e. not a '.txt' extension))

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

With 'Allow Multiple Data Streams' enabled, a separate backup job will be created for each drive letter and for System_State/Shadow_Copy_Components.
The assumption is that if there is a problem for a specific volume/stream, only that job will hang and the rest will complete successfully. 
The Overview tab in Activity Monitor will tell you which drive letter each job is for.

I hope someone here on Connect has the time and patience to look at level 5 log....
Level 5 logs are simply too big for me to even try.

sdo
Moderator
Moderator
Partner    VIP    Certified

From the bpbkar log:

08:58:05.163: [7768.5924] <4> dos_backup::tfs_include: INF - folder (System Files) has been created recently (since 26/06/2015 03:21:39).  It will be backed up in full.

...then ten minutes later:

09:09:30.865: [7768.4600] <16> dtcp_write: TCP - failure: send socket (472) (TCP 10054: Connection reset by peer)
09:09:30.865: [7768.4600] <16> dtcp_write: TCP - failure: attempted to send 1 bytes
09:09:30.865: [7768.4600] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :14

.

And the job was using:

08:48:30.096: [7768.5924] <2> WinMain: DAT - lpCmdLine = '-r 1209600 -ru root -dt 1052811 -to 0 -clnt srvprod-rds-1 -class srvprod-rds-1 -sched Incremental -st CINC -bpstart_to 300 -bpend_to 300 -read_to 300 -blks_per_buffer 128 -tir -tir_plus -use_otm -fso -b srvprod-rds-1_1436342091 -kl 28 -WOFB_enabled -WOFB_fim 0 -WOFB_usage 0 -WOFB_error 0 -ct 13 -use_ofb '

...which has a client timeout of 300 seconds, i.e. five minutes.

.

Status 14 in more detail here:

https://support.symantec.com/en_US/article.HOWTO103944.html

.

It looks as though the media server may have closed it's listennning port, and so the client failed to write (status 14) to the TCP socket.

The first thing I would try, is increasing the client read timeout on the media server, from say 300 seconds to 600, then 900, then 1200, then 1500, then 1800 - and test each time.  If you still have problems with a media server client read timeout of 1800 seconds (i.e. 30 minutes) then we'll have to think again, and dig deeper.

sdo
Moderator
Moderator
Partner    VIP    Certified

Another thing to check for is the state condition of VSS:

> vssadmin list writers

> vssadmin list providers

> vssadmin list volumes

> vssadmin list shadowstorage

> vssadmin list shadows

Do any of these commands hang, or report/show errors?

.

I noticed that the backup job was using WOFB (Windows Open File Backup) settings of:

-WOFB_enabled -WOFB_fim 0 -WOFB_usage 0 -WOFB_error 0

...so it would appear that the backup client is using VSP and not VSS, and will abort on error rather than continue...

...however you might be better off, configuring the 'client attributes' of the backup client using:

> bpclient -add -client myclient.name.com -WOFB_enabled 1 -WOFB_FIM 1 -WOFB_usage 0 -WOFB_error 1

...if the '-add' fails then try '-update'...

...see page 82 of the NetBackup 7.1 Commands Reference Guide for details re these switches:

https://support.symantec.com/en_US/article.DOC3684.html

Lotfi_BOUCHERIT
Level 5

Hello,

I would like to thank you all for the information you're providing.

I followed your instructions and activated logging level to 3 as requested by Marianne and activated "Allow multiple data streams".

Now, when i launch the backup, it hangs in the first stream (2.png and 2.txt in the joint archive file). I added the updated log files requested in the first comments.

Thanks.

 

sdo
Moderator
Moderator
Partner    VIP    Certified

The screen shots and logs don't appear to tie up - the screen shots show jobs starting at 11:32, but the logs have nothing beyond 11:26.  And the job numbers from the screen shots do not match the job numbers in teh logs.

In the bpbkar log, we see again:

09:03:22.415: [5296.4560] <4> tar_backup_tfi::UpdateExcludeListWithVHD: INF - UpdateExludeListWithVHD begin

...then 25 minutes later:

09:28:20.228: [5296.3664] <16> dtcp_write: TCP - failure: send socket (480) (TCP 10054: Connection reset by peer)
09:28:20.228: [5296.3664] <16> dtcp_write: TCP - failure: attempted to send 1 bytes
09:28:20.228: [5296.3664] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :14

The bpbkar log doesn't appear to be very detailed.  Are you sure that you enabled a high level of logging on the client?

.

Maybe try a test again.  Clear the client logs, and capture screen shots to match the jobs, and then try the jobs again.  Wait 45 minutes, then capture client logs.  We don't really need screen shots.  Also would be very useful to have the media server bpbrm and bptm logs.

Lotfi_BOUCHERIT
Level 5

Hello,

Please, joint to this message, are my logs taken after 1h40 minutes of stuck backup job

and please, take a look to this screen capture, showing how i enabled logging on my client :

logging.JPG