Solved: Status 14

arctics2 · ‎12-27-2013

Using Netbackup 5230 appliance on 2.5.3

I seem to be getting status 14 errors on a few of our backups. The backups run for a few hours, then fail at various times.

12/27/2013 12:06:28 - Error bptm (pid=27347) image copy failed: error 2060017: system call failed

12/27/2013 12:06:32 - Critical bptm (pid=27347) sts_close_handle failed: 2060017 system call failed

12/27/2013 12:06:32 - Info wpbk11ho (pid=27347) StorageServer=PureDisk:master; Report=PDDO Stats for (master:( scanned: 308099982 KB, CR sent: 263117 KB, CR sent over FC: 0 KB, dedup: 99.9%

12/27/2013 12:06:32 - Critical bptm (pid=27347) sts_close_server failed: error 2060005 object is busy, cannot be closed

12/27/2013 12:06:34 - Info bptm (pid=27347) EXITING with status 14 <----------

12/27/2013 12:06:35 - Info bpbkar (pid=3284) done. status: 14: file write failed

In most cases, the log shows as above, however, in some cases, I see as below.

12/26/2013 21:02:33 - Error bptm (pid=11612) image copy failed: error 2060002: memory allocation

12/26/2013 21:02:38 - Critical bptm (pid=11612) sts_close_handle failed: 2060002 memory allocation

12/26/2013 21:02:38 - Info wpbk11ho (pid=11612) StorageServer=PureDisk:master; Report=PDDO Stats for (master:( scanned: 352620800 KB, CR sent: 147996 KB, CR sent over FC: 0 KB, dedup: 100.0%

12/26/2013 21:02:38 - Critical bptm (pid=11612) sts_close_server failed: error 2060005 object is busy, cannot be closed

12/26/2013 21:02:40 - Info bptm (pid=11612) EXITING with status 14 <----------

12/26/2013 21:02:42 - Info bpbkar (pid=8532) done. status: 14: file write failed

Or,

12/23/2013 11:35:58 - Error bptm (pid=19685) image copy failed: error 2060017: system call failed

12/23/2013 11:36:01 - Critical bptm (pid=19685) sts_close_handle failed: 2060011 offset invalid for object or context

12/23/2013 11:36:01 - Info wpbk11ho (pid=19685) StorageServer=PureDisk:master; Report=PDDO Stats for (master:( scanned: 375044440 KB, CR sent: 381747 KB, CR sent over FC: 0 KB, dedup: 99.9%

12/23/2013 11:36:01 - Critical bptm (pid=19685) sts_close_server failed: error 2060005 object is busy, cannot be closed

12/23/2013 11:36:16 - Info bptm (pid=19685) EXITING with status 14 <----------

12/23/2013 11:36:18 - Info bpbkar (pid=9796) done. status: 14: file write failed

Mark_Solutions · ‎01-02-2014

The object busy impies your system is under load and struggling - shouldn't actually cause a failure until it hits 500 but getting to 67 is not good.

So it is worth doing those things but other things that have a real effect on the system - worth just checking through these things:

1. A system runs best when no more than 80% full - if possible keep it at no more than 75% full

2. A lag in queue processing will cause issues - run crcontrol --queueinfo to see how big your queue is - if it looks large (the result is in bytes) then keep running --processqueue until it has reduced substantially - this helps keep performance at its best

3. If you use accelerator then tune the /disk/etc/puredisk/contentrouter.cfg - changing the WorkerThreads value from 64 to 128 (doesnt hurt to do this anyway - needs a service re-start or a reboot to register this change - but dont reboot until you have reduced your queue size if it is large)

4. Make sure that haven't had any disk failures - a disk failure will reduce performance by 20%

Check through all of these to optimise your performance

View solution in original post

Mark_Solutions · ‎12-30-2013

It is possible that the pools are bouncing UP and DOWN - take a look at the All Log Entries report to see if that is the case

If so then it is worth using the following file on the all media servers and the Master Server (in the equivalent path for Windows Servers):

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

Put a value of 800 in the file

If you have other files starting with DPS_PROXY in the location then rename them to .old

Once they are on all servers then re-start the netbackup services on them all to register these file changes

See if that helps

arctics2 · ‎12-30-2013

Thanks for the response and suggestion.

I see no indication in the All Log Entries report that the pools are bouncing up and down.

All I see when the backup is attempted is the following, which is repeated ...

Sun Dec 29 17:58:04 EST 2013

Master

client

5274

image copy is not ready, retry attempt: 67 of 500 object is busy, cannot be closed

arctics2 · ‎12-30-2013

On further investigation, I'm noticing the "image copy is not ready, retry attempt: 67 of 500 object is busy, cannot be closed" error on almost all client backups.

I read a few other posts indicating problems with client side dedupe that might cause this, so I disabled it, and the errors seem to disappear. However, this isn't ideal, as client side dedupe should work fine, so what could be the cause?

arctics2 · ‎12-30-2013

Mark: I read the following thread which appears to have some similarities to my issue. Would the solution you suggested there be applicable here?

https://www-secure.symantec.com/connect/forums/stsclosehandle-failed-2060019-error-occurred-network-...

Mark_Solutions · ‎01-02-2014

The object busy impies your system is under load and struggling - shouldn't actually cause a failure until it hits 500 but getting to 67 is not good.

So it is worth doing those things but other things that have a real effect on the system - worth just checking through these things:

1. A system runs best when no more than 80% full - if possible keep it at no more than 75% full

2. A lag in queue processing will cause issues - run crcontrol --queueinfo to see how big your queue is - if it looks large (the result is in bytes) then keep running --processqueue until it has reduced substantially - this helps keep performance at its best

3. If you use accelerator then tune the /disk/etc/puredisk/contentrouter.cfg - changing the WorkerThreads value from 64 to 128 (doesnt hurt to do this anyway - needs a service re-start or a reboot to register this change - but dont reboot until you have reduced your queue size if it is large)

4. Make sure that haven't had any disk failures - a disk failure will reduce performance by 20%

Check through all of these to optimise your performance

arctics2 · ‎01-02-2014

Thanks for the response Mark.

1. The system is running at about 25% full at the moment, as it hasn't been in production very long.

2. Not sure what would be considered a large queue, but on one appliance, the queue is at about 8GB and the other is about 4GB.

3. Pretty well all our backups utilize Accelerator. I applied the suggested change, so hopefully this helps.

4. Hardware looks good.

Thanks

arctics2 · ‎01-03-2014

I applied the above changes, and checked the DPS_PROXYDEFAULTRECVTMO file, which already contained 800 in it. But TECH156490 suggests that it should be increased to 3600, so I did that. The image copy is not ready errors on clients with client side dedupe enabled seem to have decreased quite a lot, but still continue. They aren't causing failures so far, but not sure if I should still be concerned about it.

arctics2 · ‎02-28-2014

So this issue seems to have returned on one of our clients. It fails with a status 14.

There are a number of errors in the logs stating

image copy is not ready, retry attempt: 0 of 500 object is busy, cannot be closed

However, the number of attempts don't seem to exceed 15 of 500.

The bpbkar log on the client indicates the following error repeated a number of times:

10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 211 bytes

Until eventually, it gets to:

10:50:31.082 PM: [4756.2880] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:31.082 PM: [4756.2880] <16> dtcp_write: TCP - failure: attempted to send 1 bytes

10:50:31.082 PM: [4756.2880] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :14

10:50:31.128 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:31.128 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 97 bytes

10:50:31.128 PM: [4756.4596] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)

10:50:31.144 PM: [4756.4596] <4> send_msg_to_monitor: INF - in send_msg_to_monitor()...

10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 71 bytes

10:50:31.144 PM: [4756.4596] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 40: network connection broken

10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 48 bytes

10:50:31.175 PM: [4756.4596] <4> dos_backup::fscp_fini: INF - backup status:<6>

I have another client at the same site which has no issues at all. Also, the failure seems to occur only when backing up the E:\ drive of the client, all other drives seem ok.

Mark_Solutions · ‎02-28-2014

We need to see the lines before what you have posted - a connection close is a timeout so what is the gap between the previous lines and when it fails - that time diference is the timeout that you need to address

This is not actually the same error - it is a timeout matter not a performance issue

arctics2 · ‎02-28-2014

10:50:24.576 PM: [4756.4596] <4> dos_backup::tfs_scannext: INF - detected renamed/new directory:<E:\Data\Oper_Supp\datapath>, forcing full backup

10:50:24.576 PM: [4756.4596] <2> dos_backup::fscp_change_detection(): DBG - file changed: incremental backup, always backup folders: <E:\Data\Oper_Supp\datapath>

10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 211 bytes

10:50:24.576 PM: [4756.4596] <4> dos_backup::tfs_scannext: INF - detected renamed/new file:<E:\Data\Oper_Supp\datapath>, forcing backup

10:50:24.639 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)

10:50:24.639 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 245 bytes

These are the few entries in bpbkar just prior to when the failures start. Not sure if this is enough.

Mark_Solutions · ‎02-28-2014

Still not enough - please show the full text from the detail section of job itself - if it is a lot copy it to a text file and add it as an attachment - thanks

arctics2 · ‎02-28-2014

Please see job details attached.

Mark_Solutions · ‎02-28-2014

Now there is yet another error !

02/27/2014 22:49:52 - Critical bptm (pid=22761) sts_close_handle failed: 2060002 memory allocation
So it starts at 21:01 and does updates every minute or so until near the end when it has a gap of almost 5 minutes(22:45 to 22:49) but does have a gap of 6 minutes just prior to that (22:20 to 22:26) so I am doubting a timeout - more of a coping issue

Did you up the worker threads as i suggested previously so that it can cope with the accelerator backups?

Are you also doing any VMware backups to the appliance and if so do you have multiple paths mapped via the fibre?

It just seems to run out of memory - perhaps due to the lack of threads, perhaps due to other factors such as the VMware backups which will make the appliance run out of memory if you have multiple paths mapped to it for VMware backups - or maybe you are just pushing the Appliance too hard?

Performance also drops as it get fuller as i mentioned previously and when the process queue is very large

Let me know the answers to the above and also the output of

./crcontrol --queueinfo

arctics2 · ‎02-28-2014

total queue size : 899966456

creation date of oldest tlog : Fri Feb 28 12:35:35 2014

Yes, worker threads are currently set to 128.

No vmware backups.

Running at about 35% capacity on the dedupe pool.

I'm not sure if this would be a coping issue as the error occurs even if this is the only backup running. Also, all other clients don't seem to be exhibiting the same issues.

Mark_Solutions · ‎02-28-2014

That queue is very large - process it until it comes right down to see if that helps

./crcontrol --processqueue

See if it is still running with

./crcontrol --processqueueinfo

arctics2 · ‎03-03-2014

Will do.

While I'm waiting for it to come down though, I'm still confused why this would be an issue on this particular drive of this particular client. If it was an appliance side issue, shouldn't all jobs and all clients be affected?

arctics2 · ‎03-14-2014

Hi Mark,

I've been doing as you suggested repeatedly since my last post, and the queue seems to go up and down, but as of right now, it's much higher:

total queue size : 1173007915

creation date of oldest tlog : Fri Mar 14 12:11:20 2014

Is this normal??

Also, in the meantime, since it's only this one client having issues, is there anything else I can check?

Mark_Solutions · ‎03-17-2014

Are you doing client side de-dupe for that client?

Perhaps it is the client itself running short of memory rather than the appliance.

As with any server it is worth re-booting the appliance occasionally as well to keep it clean and lean and any left over / orphaned processes flushed out of memory

VOX

Status 14