Ok, let's summarize: Your

rookie11 · ‎03-10-2012

Hi guys

i m facing problem in my environment. One of client server OS windows 2003 hav 2 disk on it C: and E:\. backup of C:\ goes fine.but nbu take backup of E:\ for about 100000KB , 34000 files approx then stays in hung state for about 2, 3 days. new backup schedule do not start. we have cancel hung backup job and then start a new one. but same problem comes up again with E:\

few of my observations:

3:50:18.390 AM: [1100.2888] <4> tar_backup_tfi::backup_send_chkp_data_state: INF - checkpoint message: CPR - 205312 1100 0 0 34404 0 1 0 1 105119744 0 1 512 61339786 1 39 /E/das/crisworm/1997/02/14/PIC0LLTC.TIF

4:30:13.437 AM: [1100.2888] <16> tar_tfi::processException:

An Exception of type [SocketWriteException] has occured at:

Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54 $ , Function: TransporterRemote::write[2](), Line: 321

Local Address: [0.0.0.0]:0

Remote Address: [0.0.0.0]:0

OS Error: 10053 (An established connection was aborted by the software in your host machine.) Expected bytes: 32768

4:30:13.437 AM: [1100.2888] <2> tar_base::V_vTarMsgW: FTL - socket write failed

4:30:13.437 AM: [1100.2888] <4> ov_log::OVLoop: INF - Cycling log file

4:30:13.437 AM: [1100.2888] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\VERITAS\NetBackup\logs\BPBKAR\030712.LOG

12:58:53.672 AM: [836.4168] <4> tar_backup_tfi::backup_send_chkp_data_state: INF - checkpoint message: CPR - 199168 836 0 0 34241 0 0 0 1 101974016 0 1 512 35515759 1 39 /E/das/crisworm/1996/07/15/PIC67BRE.TIF

1:04:10.605 AM: [836.4168] <4> tar_backup_tfi::backup_send_chkp_data_state: INF - checkpoint message: CPR - 199680 836 0 0 34249 0 0 0 1 102236160 0 1 512 36918912 1 39 /E/das/crisworm/1996/07/25/PIC68GSJ.TIF

1:43:16.653 AM: [836.4168] <16> tar_tfi::processException:

An Exception of type [SocketWriteException] has occured at:

Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54 $ , Function: TransporterRemote::write[2](), Line: 321

Local Address: [0.0.0.0]:0

Remote Address: [0.0.0.0]:0

OS Error: 10053 (An established connection was aborted by the software in your host machine.) Expected bytes: 32768

1:43:16.653 AM: [836.4168] <2> tar_base::V_vTarMsgW: FTL - socket write failed

1:43:16.653 AM: [836.4168] <4> ov_log::OVLoop: INF - Cycling log file

1:43:16.653 AM: [836.4168] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\VERITAS\NetBackup\logs\BPBKAR\030812.LOG

Taqadus_Rehman · ‎03-10-2012

can you paste bpbkar logs for this backup job. or better attach bpbkar logs here.

Michael_G_Ander · ‎03-14-2012

In my experience 10053 often is related to timeouts, creating/increasing the registry keys CLIENT_READ_TIMEOUT & CLIENT_CONNECT_TIMEOUT might help

I would also run chkdsk on E: to see if there was any indication of problems with file system

Try to do backup of different areas of E:, to see if the problem is related to a specific folder/file

Regards

Michael

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

revarooo · ‎03-14-2012

Also you could try letting bpbkar run through the E drive without sending the data over to the media server to see if it also aborts, if it does it's not CLIENT_READ_TIMEOUT issues or the network

C:\program files\veritas\netbackup\bin\bpbkar32 -nocont E:\ > NUL 2> c:\temp.txt

Ensure you have netbackup\logs\bpbkar\ logging enabled and logging increased on the client to maximum.

Amarnath_Sathis · ‎03-22-2012

Hi Rookie,

Check whether you have enough disk space for the drive in which you have configured the log file.

If not please free space in the drive or add some additional disk space.

If the backup is unable to update the log file the backup goes to hung state.

The same issue we faced in our environment.

rookie11 · ‎04-16-2012

Hi guys

I tried C:\program files\veritas\netbackup\bin\bpbkar32 -nocont E:\ > NUL 2> c:\temp.txt on client which works completly fine.

Disk space n time out options already checked.

This problem not just present on 1 client but close to around 15 clients in my IT enviornment.

Please suggest wat other options i can check.

revarooo · ‎04-16-2012

Rookie,

If it is affecting 15 clients, then this is definitely either network or timeout issue.

I would first check the network is good.

Also on the media server check what the CLIENT_READ_TIMEOUT is set to.

rookie11 · ‎04-16-2012

client _read_timeout is set as 500 on all 4 media servers.

to check network; is there specific software which symantec recommends or any command netbackup based or data domain command[ it my storage unit] or OS based command [media are window servers]

revarooo · ‎04-16-2012

Rookie, there is some tools we can use for checking the network, but you will need to raise a case for this.

CLIENT_READ_TIMEOUT is on the low side. I would increase that to 1200.

Not sure if it will help as you say the clients stay hung for 2-3 days. I think setting up bpbkar trace logging may help. Increase logging to maximum on the client and media server. Ensure bpbkar and bpfis log directory is in place on the client under netbackup\logs\

Create empty file (with no extensions) in the parent netbackup directory called bpbkar_path_tr

On media server ensure logging is enabled for bpbrm and bptm.

Run a backup, when it starts hanging, take a look at the bpkar and bpfis logs.

Marianne · ‎04-17-2012

Ok, let's summarize:

Your opening post made it look like ONE client has a problem with D-drive only.

Now we see that about 15 clients have this problem.

What is the common factor? One media server? All media servers?

Is the same NIC on the media server(s) used to receive data from clients as well as send data to DD?

Have you checked/verified that latest drivers and firmware settings have been applied to the NIC?

Have you obtained network/NIC settings for DD to ensure optimum performance?

Have you tried to monitor incoming data on DD itself while backup is running?

I have some time ago seen that specific model Broadcom NIC had a problem with high I/O. I found the information by Goog'ling the NIC model number. Latest drivers/NIC solved the problem.

Handy NetBackup Links

Mark_Solutions · ‎04-17-2012

When first reading your post I felt that this was caused by corrupt files but if it affects 15 clients that seems less likely unless the data gets copied to them all and so the corruption is spread about However the log also mentions checkpoints so it is as if the checkpoint interval and the client connect / read timeouts are clashing What is the checkpoint interval set to in the policies? Also, as covered by the earlier questions, what is the common factor here? - media server, policy etc.

Jibs · ‎04-18-2012

Not sure this is the correct place to post my problem

recently I have installed Backup exec 2012 on windows 2008 R2 64bit server, I have Symantec vault server 8.02 when I schedule a full backup it working only once. next time when the same job runs its getting stuck after EVmonitir backup (approximately 8MB) after that backup not responding even if i keep it running for 24hrs. If I reboot my Vault server backup works but next backup same problem. No error reporting

I have logged call with Symantec tech support but no luck and very poor respond from them

I have changed storage from tape to disk but no luck any idea.. I checked sgmonitor but dont know where to look

rookie11 · ‎04-25-2012

HI guys

on some clients i hav set CLIENT_CONNECT_TIMEOUT = 3600

CLIENT_READ_TIMEOUT = 3600

backup which goes to hung state shows :

Info bptm(pid=3300) waited for full buffer 9408 times, delayed 28351 times <-- this is same for almost all clients which goes in hung state.

mph999 · ‎04-25-2012

Info bptm(pid=3300) waited for full buffer 9408 times, delayed 28351 times

This shows that the bptm process was delayed 28000 times waiting to be sent data from the client. It is only an indication, and without knowing what the similar line in bpbkar shows, is virtually useless (as if bpbkar is dalayed more times, then this would be the more important value, if it is less, then the bptm value is more important).

Also, 28351 looks like a big value, but that depends on how big the backup is - if the backup is a small amount, then yes, this is a big value, if the back is large then it is less relevant.

So, it indicates that there is a delay, where the clients are not sending data to the media server when they should be - how much of a fact or this is depends.

Regarding this:

to check network; is there specific software which symantec recommends or any command netbackup based or data domain command[ it my storage unit] or OS based command [media are window servers]

No - not really. The network is not the responsibility of Symantec (sorry) . It (the network) is on the same level as the operating system in terms of 'support'. However, there is a tool called 'Camel' *no idea why) that can give some performance figures, and AppCritical which can be very useful. You will need to log a call with Symantec to use these.

Martin