cancel
Showing results for 
Search instead for 
Did you mean: 

Status 14

arctics2
Level 5
Using Netbackup 5230 appliance on 2.5.3
 
I seem to be getting status 14 errors on a few of our backups.  The backups run for a few hours, then fail at various times.
 
12/27/2013 12:06:28 - Error bptm (pid=27347) image copy failed: error 2060017: system call failed
12/27/2013 12:06:32 - Critical bptm (pid=27347) sts_close_handle failed: 2060017 system call failed
12/27/2013 12:06:32 - Info wpbk11ho (pid=27347) StorageServer=PureDisk:master; Report=PDDO Stats for (master:( scanned: 308099982 KB, CR sent: 263117 KB, CR sent over FC: 0 KB, dedup: 99.9%
12/27/2013 12:06:32 - Critical bptm (pid=27347) sts_close_server failed: error 2060005 object is busy, cannot be closed
12/27/2013 12:06:34 - Info bptm (pid=27347) EXITING with status 14 <----------
12/27/2013 12:06:35 - Info bpbkar (pid=3284) done. status: 14: file write failed
 
In most cases, the log shows as above, however, in some cases, I see as below.
 
12/26/2013 21:02:33 - Error bptm (pid=11612) image copy failed: error 2060002: memory allocation
12/26/2013 21:02:38 - Critical bptm (pid=11612) sts_close_handle failed: 2060002 memory allocation
12/26/2013 21:02:38 - Info wpbk11ho (pid=11612) StorageServer=PureDisk:master; Report=PDDO Stats for (master:( scanned: 352620800 KB, CR sent: 147996 KB, CR sent over FC: 0 KB, dedup: 100.0%
12/26/2013 21:02:38 - Critical bptm (pid=11612) sts_close_server failed: error 2060005 object is busy, cannot be closed
12/26/2013 21:02:40 - Info bptm (pid=11612) EXITING with status 14 <----------
12/26/2013 21:02:42 - Info bpbkar (pid=8532) done. status: 14: file write failed

Or,

12/23/2013 11:35:58 - Error bptm (pid=19685) image copy failed: error 2060017: system call failed
12/23/2013 11:36:01 - Critical bptm (pid=19685) sts_close_handle failed: 2060011 offset invalid for object or context
12/23/2013 11:36:01 - Info wpbk11ho (pid=19685) StorageServer=PureDisk:master; Report=PDDO Stats for (master:( scanned: 375044440 KB, CR sent: 381747 KB, CR sent over FC: 0 KB, dedup: 99.9%
12/23/2013 11:36:01 - Critical bptm (pid=19685) sts_close_server failed: error 2060005 object is busy, cannot be closed
12/23/2013 11:36:16 - Info bptm (pid=19685) EXITING with status 14 <----------
12/23/2013 11:36:18 - Info bpbkar (pid=9796) done. status: 14: file write failed
1 ACCEPTED SOLUTION

Accepted Solutions

Mark_Solutions
Level 6
Partner Accredited Certified

The object busy impies your system is under load and struggling - shouldn't actually cause a failure until it hits 500 but getting to 67 is not good.

So it is worth doing those things but other things that have a real effect on the system - worth just checking through these things:

1. A system runs best when no more than 80% full - if possible keep it at no more than 75% full

2. A lag in queue processing will cause issues - run crcontrol --queueinfo to see how big your queue is - if it looks large (the result is in bytes) then keep running --processqueue until it has reduced substantially - this helps keep performance at its best

3. If you use accelerator then tune the /disk/etc/puredisk/contentrouter.cfg - changing the WorkerThreads value from 64 to 128 (doesnt hurt to do this anyway - needs a service re-start or a reboot to register this change - but dont reboot until you have reduced your queue size if it is large)

4. Make sure that haven't had any disk failures - a disk failure will reduce performance by 20%

Check through all of these to optimise your performance

View solution in original post

18 REPLIES 18

Mark_Solutions
Level 6
Partner Accredited Certified

It is possible that the pools are bouncing UP and DOWN - take a look at the All Log Entries report to see if that is the case

If so then it is worth using the following file on the all media servers and the Master Server (in the equivalent path for Windows Servers):

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

Put a value of 800 in the file

If you have other files starting with DPS_PROXY in the location then rename them to .old

Once they are on all servers then re-start the netbackup services on them all to register these file changes

See if that helps

arctics2
Level 5

Thanks for the response and suggestion.  

I see no indication in the All Log Entries report that the pools are bouncing up and down.

All I see when the backup is attempted is the following, which is repeated ...

Sun Dec 29 17:58:04 EST 2013 Master client   5274   image copy is not ready, retry attempt: 67 of 500 object is busy, cannot be closed

 

arctics2
Level 5

On further investigation, I'm noticing the "image copy is not ready, retry attempt: 67 of 500 object is busy, cannot be closed" error on almost all client backups.  

I read a few other posts indicating problems with client side dedupe that might cause this, so I disabled it, and the errors seem to disappear.  However, this isn't ideal, as client side dedupe should work fine, so what could be the cause?

 

arctics2
Level 5

Mark:  I read the following thread which appears to have some similarities to my issue.  Would the solution  you suggested there be applicable here?

https://www-secure.symantec.com/connect/forums/stsclosehandle-failed-2060019-error-occurred-network-...

 

Mark_Solutions
Level 6
Partner Accredited Certified

The object busy impies your system is under load and struggling - shouldn't actually cause a failure until it hits 500 but getting to 67 is not good.

So it is worth doing those things but other things that have a real effect on the system - worth just checking through these things:

1. A system runs best when no more than 80% full - if possible keep it at no more than 75% full

2. A lag in queue processing will cause issues - run crcontrol --queueinfo to see how big your queue is - if it looks large (the result is in bytes) then keep running --processqueue until it has reduced substantially - this helps keep performance at its best

3. If you use accelerator then tune the /disk/etc/puredisk/contentrouter.cfg - changing the WorkerThreads value from 64 to 128 (doesnt hurt to do this anyway - needs a service re-start or a reboot to register this change - but dont reboot until you have reduced your queue size if it is large)

4. Make sure that haven't had any disk failures - a disk failure will reduce performance by 20%

Check through all of these to optimise your performance

arctics2
Level 5

Thanks for the response Mark.

1.  The system is running at about 25% full at the moment, as it hasn't been in production very long.

2.  Not sure what would be considered a large queue, but on one appliance, the queue is at about 8GB and the other is about 4GB.

3.  Pretty well all our backups utilize Accelerator.  I applied the suggested change, so hopefully this helps.

4.  Hardware looks good.

Thanks

arctics2
Level 5

I applied the above changes, and checked the DPS_PROXYDEFAULTRECVTMO file, which already contained 800 in it.  But TECH156490 suggests that it should be increased to 3600, so I did that.  The image copy is not ready errors on clients with client side dedupe enabled seem to have decreased quite a lot, but still continue.  They aren't causing failures so far, but not sure if I should still be concerned about it.

arctics2
Level 5

So this issue seems to have returned on one of our clients.  It fails with a status 14.

There are a number of errors in the logs stating

image  copy is not ready, retry attempt: 0 of 500 object is busy, cannot be closed 

However, the number of attempts don't seem to exceed 15 of 500.

The bpbkar log on the client indicates the following error repeated a number of times:

10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 211 bytes
 
Until eventually, it gets to:
 
10:50:31.082 PM: [4756.2880] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:31.082 PM: [4756.2880] <16> dtcp_write: TCP - failure: attempted to send 1 bytes
10:50:31.082 PM: [4756.2880] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :14
 
10:50:31.128 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:31.128 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 97 bytes
10:50:31.128 PM: [4756.4596] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
10:50:31.144 PM: [4756.4596] <4> send_msg_to_monitor: INF - in send_msg_to_monitor()...
10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 71 bytes
10:50:31.144 PM: [4756.4596] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 40: network connection broken
10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 48 bytes
10:50:31.175 PM: [4756.4596] <4> dos_backup::fscp_fini: INF - backup status:<6>

I have another client at the same site which has no issues at all.  Also, the failure seems to occur only when backing up the E:\ drive of the client, all other drives seem ok.

Mark_Solutions
Level 6
Partner Accredited Certified

We need to see the lines before what you have posted - a connection close is a timeout so what is the gap between the previous lines and when it fails - that time diference is the timeout that you need to address

This is not actually the same error - it is a timeout matter not a performance issue

arctics2
Level 5
 
10:50:24.576 PM: [4756.4596] <4> dos_backup::tfs_scannext: INF - detected renamed/new directory:<E:\Data\Oper_Supp\datapath>, forcing full backup
10:50:24.576 PM: [4756.4596] <2> dos_backup::fscp_change_detection(): DBG - file changed: incremental backup, always backup folders: <E:\Data\Oper_Supp\datapath>
10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 211 bytes
10:50:24.576 PM: [4756.4596] <4> dos_backup::tfs_scannext: INF - detected renamed/new file:<E:\Data\Oper_Supp\datapath>, forcing backup
10:50:24.639 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:24.639 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 245 bytes
These are the few entries in bpbkar just prior to when the failures start.  Not sure if this is enough.

Mark_Solutions
Level 6
Partner Accredited Certified

Still not enough - please show the full text from the detail section of job itself - if it is a lot copy it to a text file and add it as an attachment - thanks

arctics2
Level 5

Please see job details attached.

Mark_Solutions
Level 6
Partner Accredited Certified

Now there is yet another error !

02/27/2014 22:49:52 - Critical bptm (pid=22761) sts_close_handle failed: 2060002 memory allocation
So it starts at 21:01 and does updates every minute or so until near the end when it has a gap of almost 5 minutes(22:45 to 22:49) but does have a gap of 6 minutes just prior to that (22:20 to 22:26) so I am doubting a timeout - more of a coping issue

Did you up the worker threads as i suggested previously so that it can cope with the accelerator backups?

Are you also doing any VMware backups to the appliance and if so do you have multiple paths mapped via the fibre?

It just seems to run out of memory - perhaps due to the lack of threads, perhaps due to other factors such as the VMware backups which will make the appliance run out of memory if you have multiple paths mapped to it for VMware backups - or maybe you are just pushing the Appliance too hard?

Performance also drops as it get fuller as i mentioned previously and when the process queue is very large

Let me know the answers to the above and also the output of

./crcontrol --queueinfo

 

arctics2
Level 5
total queue size : 899966456
creation date of oldest tlog : Fri Feb 28 12:35:35 2014
 
Yes, worker threads are currently set to 128.
 
No vmware backups. 
 
Running at about 35% capacity on the dedupe pool.
 
I'm not sure if this would be a coping issue as the error occurs even if this is the only backup running.  Also, all other clients don't seem to be exhibiting the same issues.

Mark_Solutions
Level 6
Partner Accredited Certified

That queue is very large - process it until it comes right down to see if that helps

./crcontrol --processqueue

See if it is still running with

./crcontrol --processqueueinfo

arctics2
Level 5

Will do.  

While I'm waiting for it to come down though, I'm still confused why this would be an issue on this particular drive of this particular client.  If it was an appliance side issue, shouldn't all jobs and all clients be affected?

 

arctics2
Level 5

Hi Mark,

I've been doing as you suggested repeatedly since my last post, and the queue seems to go up and down, but as of right now, it's much higher:

total queue size : 1173007915
creation date of oldest tlog : Fri Mar 14 12:11:20 2014
 
Is this normal??
 
Also, in the meantime, since it's only this one client having issues, is there anything else I can check?

Mark_Solutions
Level 6
Partner Accredited Certified

Are you doing client side de-dupe for that client?

Perhaps it is the client itself running short of memory rather than the appliance.

As with any server it is worth re-booting the appliance occasionally as well to keep it clean and lean and any left over / orphaned processes flushed out of memory