12-27-2013 12:09 PM
Or,
Solved! Go to Solution.
01-02-2014 01:44 AM
The object busy impies your system is under load and struggling - shouldn't actually cause a failure until it hits 500 but getting to 67 is not good.
So it is worth doing those things but other things that have a real effect on the system - worth just checking through these things:
1. A system runs best when no more than 80% full - if possible keep it at no more than 75% full
2. A lag in queue processing will cause issues - run crcontrol --queueinfo to see how big your queue is - if it looks large (the result is in bytes) then keep running --processqueue until it has reduced substantially - this helps keep performance at its best
3. If you use accelerator then tune the /disk/etc/puredisk/contentrouter.cfg - changing the WorkerThreads value from 64 to 128 (doesnt hurt to do this anyway - needs a service re-start or a reboot to register this change - but dont reboot until you have reduced your queue size if it is large)
4. Make sure that haven't had any disk failures - a disk failure will reduce performance by 20%
Check through all of these to optimise your performance
12-30-2013 05:51 AM
It is possible that the pools are bouncing UP and DOWN - take a look at the All Log Entries report to see if that is the case
If so then it is worth using the following file on the all media servers and the Master Server (in the equivalent path for Windows Servers):
/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO
Put a value of 800 in the file
If you have other files starting with DPS_PROXY in the location then rename them to .old
Once they are on all servers then re-start the netbackup services on them all to register these file changes
See if that helps
12-30-2013 11:23 AM
Thanks for the response and suggestion.
I see no indication in the All Log Entries report that the pools are bouncing up and down.
All I see when the backup is attempted is the following, which is repeated ...
Sun Dec 29 17:58:04 EST 2013 | Master | client | 5274 | image copy is not ready, retry attempt: 67 of 500 object is busy, cannot be closed |
12-30-2013 01:40 PM
On further investigation, I'm noticing the "image copy is not ready, retry attempt: 67 of 500 object is busy, cannot be closed" error on almost all client backups.
I read a few other posts indicating problems with client side dedupe that might cause this, so I disabled it, and the errors seem to disappear. However, this isn't ideal, as client side dedupe should work fine, so what could be the cause?
12-30-2013 02:38 PM
Mark: I read the following thread which appears to have some similarities to my issue. Would the solution you suggested there be applicable here?
01-02-2014 01:44 AM
The object busy impies your system is under load and struggling - shouldn't actually cause a failure until it hits 500 but getting to 67 is not good.
So it is worth doing those things but other things that have a real effect on the system - worth just checking through these things:
1. A system runs best when no more than 80% full - if possible keep it at no more than 75% full
2. A lag in queue processing will cause issues - run crcontrol --queueinfo to see how big your queue is - if it looks large (the result is in bytes) then keep running --processqueue until it has reduced substantially - this helps keep performance at its best
3. If you use accelerator then tune the /disk/etc/puredisk/contentrouter.cfg - changing the WorkerThreads value from 64 to 128 (doesnt hurt to do this anyway - needs a service re-start or a reboot to register this change - but dont reboot until you have reduced your queue size if it is large)
4. Make sure that haven't had any disk failures - a disk failure will reduce performance by 20%
Check through all of these to optimise your performance
01-02-2014 12:00 PM
Thanks for the response Mark.
1. The system is running at about 25% full at the moment, as it hasn't been in production very long.
2. Not sure what would be considered a large queue, but on one appliance, the queue is at about 8GB and the other is about 4GB.
3. Pretty well all our backups utilize Accelerator. I applied the suggested change, so hopefully this helps.
4. Hardware looks good.
Thanks
01-03-2014 01:02 PM
I applied the above changes, and checked the DPS_PROXYDEFAULTRECVTMO file, which already contained 800 in it. But TECH156490 suggests that it should be increased to 3600, so I did that. The image copy is not ready errors on clients with client side dedupe enabled seem to have decreased quite a lot, but still continue. They aren't causing failures so far, but not sure if I should still be concerned about it.
02-28-2014 10:37 AM
So this issue seems to have returned on one of our clients. It fails with a status 14.
There are a number of errors in the logs stating
image copy is not ready, retry attempt: 0 of 500 object is busy, cannot be closed
However, the number of attempts don't seem to exceed 15 of 500.
The bpbkar log on the client indicates the following error repeated a number of times:
10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)
10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 211 bytes
10:50:31.082 PM: [4756.2880] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)10:50:31.082 PM: [4756.2880] <16> dtcp_write: TCP - failure: attempted to send 1 bytes10:50:31.082 PM: [4756.2880] <16> tar_base::keepaliveThread: INF - keepalive thread abnormal exit :1410:50:31.128 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)10:50:31.128 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 97 bytes10:50:31.128 PM: [4756.4596] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)10:50:31.144 PM: [4756.4596] <4> send_msg_to_monitor: INF - in send_msg_to_monitor()...10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 71 bytes10:50:31.144 PM: [4756.4596] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 40: network connection broken10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)10:50:31.144 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 48 bytes10:50:31.175 PM: [4756.4596] <4> dos_backup::fscp_fini: INF - backup status:<6>
I have another client at the same site which has no issues at all. Also, the failure seems to occur only when backing up the E:\ drive of the client, all other drives seem ok.
02-28-2014 11:10 AM
We need to see the lines before what you have posted - a connection close is a timeout so what is the gap between the previous lines and when it fails - that time diference is the timeout that you need to address
This is not actually the same error - it is a timeout matter not a performance issue
02-28-2014 11:16 AM
10:50:24.576 PM: [4756.4596] <4> dos_backup::tfs_scannext: INF - detected renamed/new directory:<E:\Data\Oper_Supp\datapath>, forcing full backup10:50:24.576 PM: [4756.4596] <2> dos_backup::fscp_change_detection(): DBG - file changed: incremental backup, always backup folders: <E:\Data\Oper_Supp\datapath>10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)10:50:24.576 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 211 bytes10:50:24.576 PM: [4756.4596] <4> dos_backup::tfs_scannext: INF - detected renamed/new file:<E:\Data\Oper_Supp\datapath>, forcing backup10:50:24.639 PM: [4756.4596] <16> dtcp_write: TCP - failure: send socket (528) (TCP 10053: Software caused connection abort)10:50:24.639 PM: [4756.4596] <16> dtcp_write: TCP - failure: attempted to send 245 bytes
02-28-2014 11:27 AM
Still not enough - please show the full text from the detail section of job itself - if it is a lot copy it to a text file and add it as an attachment - thanks
02-28-2014 01:28 PM
Please see job details attached.
02-28-2014 01:47 PM
Now there is yet another error !
02/27/2014 22:49:52 - Critical bptm (pid=22761) sts_close_handle failed: 2060002 memory allocation
So it starts at 21:01 and does updates every minute or so until near the end when it has a gap of almost 5 minutes(22:45 to 22:49) but does have a gap of 6 minutes just prior to that (22:20 to 22:26) so I am doubting a timeout - more of a coping issue
Did you up the worker threads as i suggested previously so that it can cope with the accelerator backups?
Are you also doing any VMware backups to the appliance and if so do you have multiple paths mapped via the fibre?
It just seems to run out of memory - perhaps due to the lack of threads, perhaps due to other factors such as the VMware backups which will make the appliance run out of memory if you have multiple paths mapped to it for VMware backups - or maybe you are just pushing the Appliance too hard?
Performance also drops as it get fuller as i mentioned previously and when the process queue is very large
Let me know the answers to the above and also the output of
./crcontrol --queueinfo
02-28-2014 01:56 PM
02-28-2014 02:34 PM
That queue is very large - process it until it comes right down to see if that helps
./crcontrol --processqueue
See if it is still running with
./crcontrol --processqueueinfo
03-03-2014 07:35 AM
Will do.
While I'm waiting for it to come down though, I'm still confused why this would be an issue on this particular drive of this particular client. If it was an appliance side issue, shouldn't all jobs and all clients be affected?
03-14-2014 11:16 AM
Hi Mark,
I've been doing as you suggested repeatedly since my last post, and the queue seems to go up and down, but as of right now, it's much higher:
03-17-2014 03:43 AM
Are you doing client side de-dupe for that client?
Perhaps it is the client itself running short of memory rather than the appliance.
As with any server it is worth re-booting the appliance occasionally as well to keep it clean and lean and any left over / orphaned processes flushed out of memory