cancel
Showing results for 
Search instead for 
Did you mean: 

Backup in hung state

rookie11
Moderator
Moderator
   VIP   

Hi guys

i m facing problem in my environment. One of client server OS windows 2003 hav 2 disk on it C: and E:\. backup of C:\ goes fine.but nbu take backup of E:\ for about  100000KB , 34000 files approx then stays in hung state for about 2, 3 days. new backup schedule do not start. we have cancel hung backup job and then start a new one. but same problem comes up again with E:\

few of my observations:

 

3:50:18.390 AM: [1100.2888] <4> tar_backup_tfi::backup_send_chkp_data_state: INF - checkpoint message: CPR - 205312 1100 0 0 34404 0 1 0 1 105119744 0 1 512 61339786 1 39 /E/das/crisworm/1997/02/14/PIC0LLTC.TIF

4:30:13.437 AM: [1100.2888] <16> tar_tfi::processException:

An Exception of type [SocketWriteException] has occured at:

  Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54 $ , Function: TransporterRemote::write[2](), Line: 321

  Local Address: [0.0.0.0]:0

  Remote Address: [0.0.0.0]:0

  OS Error: 10053 (An established connection was aborted by the software in your host machine.)  Expected bytes: 32768

 

4:30:13.437 AM: [1100.2888] <2> tar_base::V_vTarMsgW: FTL - socket write failed

4:30:13.437 AM: [1100.2888] <4> ov_log::OVLoop: INF - Cycling log file

4:30:13.437 AM: [1100.2888] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\VERITAS\NetBackup\logs\BPBKAR\030712.LOG

 

 

12:58:53.672 AM: [836.4168] <4> tar_backup_tfi::backup_send_chkp_data_state: INF - checkpoint message: CPR - 199168 836 0 0 34241 0 0 0 1 101974016 0 1 512 35515759 1 39 /E/das/crisworm/1996/07/15/PIC67BRE.TIF

1:04:10.605 AM: [836.4168] <4> tar_backup_tfi::backup_send_chkp_data_state: INF - checkpoint message: CPR - 199680 836 0 0 34249 0 0 0 1 102236160 0 1 512 36918912 1 39 /E/das/crisworm/1996/07/25/PIC68GSJ.TIF

1:43:16.653 AM: [836.4168] <16> tar_tfi::processException:

An Exception of type [SocketWriteException] has occured at:

  Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54 $ , Function: TransporterRemote::write[2](), Line: 321

  Local Address: [0.0.0.0]:0

  Remote Address: [0.0.0.0]:0

  OS Error: 10053 (An established connection was aborted by the software in your host machine.)  Expected bytes: 32768

 

1:43:16.653 AM: [836.4168] <2> tar_base::V_vTarMsgW: FTL - socket write failed

1:43:16.653 AM: [836.4168] <4> ov_log::OVLoop: INF - Cycling log file

1:43:16.653 AM: [836.4168] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\VERITAS\NetBackup\logs\BPBKAR\030812.LOG

 

18 REPLIES 18

Taqadus_Rehman
Level 6

can you paste bpbkar logs for this backup job. or better attach bpbkar logs here.

Michael_G_Ander
Level 6
Certified

In my experience 10053 often is related to timeouts, creating/increasing the registry keys CLIENT_READ_TIMEOUT & CLIENT_CONNECT_TIMEOUT might help

I would also run chkdsk on E: to see if there was any indication of problems with file system

Try to do backup of different areas of E:, to see if the problem is related to a specific folder/file

Regards

Michael

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

revarooo
Level 6
Employee

Also you could try letting bpbkar run through the E drive without sending the data over to the media server to see if it also aborts, if it does it's not CLIENT_READ_TIMEOUT issues or the network

 

C:\program files\veritas\netbackup\bin\bpbkar32 -nocont E:\ > NUL 2> c:\temp.txt

 

Ensure you have netbackup\logs\bpbkar\ logging enabled and logging increased on the client to maximum.

 

Amarnath_Sathis
Level 5

Hi Rookie,

Check whether you have enough disk space for the drive in which you have configured the log file.

If not please free space in the drive or add some additional disk space.

If the backup is unable to update the log file the backup goes to hung state.

 

The same issue we faced in our environment.

rookie11
Moderator
Moderator
   VIP   

Hi guys

I tried C:\program files\veritas\netbackup\bin\bpbkar32 -nocont E:\ > NUL 2> c:\temp.txt on client which works completly fine.

Disk space n time out options already checked.

This problem not just present on 1 client but close to around 15 clients in my IT enviornment.

Please suggest wat other options i can check.

revarooo
Level 6
Employee

Rookie, 

If it is affecting 15 clients, then this is definitely either network or timeout issue.

I would first check the network is good.

Also on the media server check what the CLIENT_READ_TIMEOUT is set to.

 

rookie11
Moderator
Moderator
   VIP   

client _read_timeout is set as 500 on all 4 media servers.

to check network; is there specific software which symantec recommends or any command netbackup based or data domain command[ it my storage unit] or OS based command [media are window servers]

revarooo
Level 6
Employee

Rookie, there is some tools we can use for checking the network, but you will need to raise a case for this.

CLIENT_READ_TIMEOUT is on the low side. I would increase that to 1200. 

Not sure if it will help as you say the clients stay hung for 2-3 days. I think setting up bpbkar trace logging may help. Increase logging to maximum on the client and media server. Ensure bpbkar and bpfis log directory is in place on the client under netbackup\logs\

Create empty file (with no extensions) in the parent netbackup directory called bpbkar_path_tr

On media server ensure logging is enabled for bpbrm and bptm.

Run a backup, when it starts hanging, take a look at the bpkar and bpfis logs.

Marianne
Level 6
Partner    VIP    Accredited Certified

Ok, let's summarize:

Your opening post made it look like ONE client has a problem with D-drive only.

Now we see that about 15 clients have this problem.

What is the common factor? One media server? All media servers?

Is the same NIC on the media server(s) used to receive data from clients as well as send data to DD?

Have you checked/verified that latest drivers and firmware settings have been applied to the NIC?

Have you obtained network/NIC settings for DD to ensure optimum performance?

Have you tried to monitor incoming data on DD itself while backup is running?

 

I have some time ago seen that specific model Broadcom NIC had a problem with high I/O. I found the information by Goog'ling the NIC model number. Latest drivers/NIC solved the problem.

 

Mark_Solutions
Level 6
Partner Accredited Certified
When first reading your post I felt that this was caused by corrupt files but if it affects 15 clients that seems less likely unless the data gets copied to them all and so the corruption is spread about However the log also mentions checkpoints so it is as if the checkpoint interval and the client connect / read timeouts are clashing What is the checkpoint interval set to in the policies? Also, as covered by the earlier questions, what is the common factor here? - media server, policy etc.

Jibs
Level 3

Not sure this is the correct place to post my problem

recently I have installed Backup exec 2012 on windows 2008 R2 64bit server, I have Symantec vault server 8.02 when I schedule a full backup it working only once. next time when the same job runs its getting stuck after EVmonitir backup (approximately 8MB) after that backup not responding even if i keep it running for 24hrs. If I reboot my Vault server backup works but next backup same problem. No error reporting

I have logged call with Symantec tech support but no luck and very poor respond from them

I have changed storage from tape to disk but no luck any idea.. I checked sgmonitor but dont know where to look

 

rookie11
Moderator
Moderator
   VIP   

HI guys

on some clients i hav set CLIENT_CONNECT_TIMEOUT = 3600

CLIENT_READ_TIMEOUT = 3600
 
backup which goes to hung state shows :
Info bptm(pid=3300) waited for full buffer 9408 times, delayed 28351 times    <-- this is same for almost all clients which goes in hung state.

 

 

 

mph999
Level 6
Employee Accredited

Info bptm(pid=3300) waited for full buffer 9408 times, delayed 28351 times

This shows that the bptm process was delayed 28000 times waiting to be sent data from the client.  It is only an indication, and without knowing what the similar line in bpbkar shows, is virtually useless (as if bpbkar is dalayed more times, then this would be the more important value,  if it is less, then the bptm value is more important).

Also, 28351 looks like a big value, but that depends on how big the backup is - if the backup is a small amount, then yes, this is a big value, if the back is large then it is less relevant.

So, it indicates that there is a delay, where the clients are not sending data to the media server when they should be - how much of a fact or this is depends.

Regarding this:

to check network; is there specific software which symantec recommends or any command netbackup based or data domain command[ it my storage unit] or OS based command [media are window servers]

No  - not really.  The network is not the responsibility of Symantec (sorry) . It (the network) is on the same level as the operating system in terms of 'support'.  However, there is a tool called 'Camel' *no idea why) that can give some performance figures, and AppCritical which can be very useful.  You will need to log a call with Symantec to use these.

Martin

 

 

 

 

 

 

Marianne
Level 6
Partner    VIP    Accredited Certified

bptm tells us that the backup was no really hung, just d-o-g slow.

Have you disabled TCP Chimney on all the W2003 clients?

I have see how disabling it dramatically increased backup performance.

http://www.symantec.com/docs/TECH60844

Network connectivity tuning to avoid network read/write failures and increase performance

 

 

rookie11
Moderator
Moderator
   VIP   

 

got this on backup job , another hung job
info bptm(pid=7024) waited for full buffer 7313 times, delayed 20782 times    
4/25/2012 8:15:41 PM - Error bpbrm(pid=3824) could not write KEEPALIVE to COMM_SOCK

Marianne
Level 6
Partner    VIP    Accredited Certified

Have a look at above TN.

TCP Chimney causes various 'horror' problems on W2003 - slow throughput, network errors, etc...

 

rookie11
Moderator
Moderator
   VIP   

u forgot technote marianne cheeky

Marianne
Level 6
Partner    VIP    Accredited Certified

I did not - it's in 2 posts ago: https://www-secure.symantec.com/connect/forums/backup-hung-state#comment-7045481

 

http://www.symantec.com/docs/TECH60844