Server is 184.108.40.206 on Solaris 10, client is 220.127.116.11 on RedHat. Backup size is over 5TB so we have opted to use Accelerator as most of the data doesn't change.
We are seeing a status 13, which to me means I need to use a larger Client_Read_Timeout, but I hate to change that on the media server in question as it then affects the timeout on all clients that go through that media server. I've another media server with larger CRT, but ff I change media servers then I believe the first backup will NOT be an accelerator backup and I'll have to start all over from scratch.
Activity Job Details:
15/08/2014 5:55:47 PM - begin writing
15/08/2014 5:58:14 PM - Info bpbkar(pid=23233) 50035 entries sent to bpdbm
15/08/2014 5:59:06 PM - Info bpbkar(pid=23233) 100101 entries sent to bpdbm
15/08/2014 6:03:00 PM - Info bpbkar(pid=23233) 150151 entries sent to bpdbm
15/08/2014 6:03:06 PM - Info bpbkar(pid=23233) 200256 entries sent to bpdbm
15/08/2014 6:11:33 PM - Error bpbrm(pid=25800) socket read failed: errno = 62 - Timer expired
15/08/2014 6:21:53 PM - Error bptm(pid=25842) media manager terminated by parent process
15/08/2014 6:22:34 PM - Info bpbkar(pid=23233) done. status: 13: file read failed
15/08/2014 6:22:34 PM - end writing; write time: 00:26:47
file read failed(13)
Not finding any other posts regarding this issue... am I the only one?!?!
Whether or not it sends all of the data will depend on a few things, but different Media Server doesn't mean it has to. The question is whether or not it's the same Storage Server.
The track log is stored on Client in following directory format master/storage_server/client/policy/backup_stream
By chance, do Incrementals work fine and only Full backups fail with Status 13?
it is not my impression that alternating CLIENT_READ_TIMEOUT has huge negative impact.
As default value we set CLIENT_READ_TIME to 1800 (½ hour).
Alternating destination storage unit causes the next accelerator backup to be traditional full. This is per design to ensure integrity.
Accelerator is fairly new and I have seen quite a few mysterious problems with it. I would first recommend matching the patch level of the client to the master/media server as there have been many bugfixes in the 18.104.22.168 client package. There are more bug fixes in 7.6.X so I recommend making preparations to upgrade the entire environment.
You can increase client read timeout for a single client using the client host properties without changing the media server default.
I have seen conflicts between Accelerator and Client side dedupe. Try turning off client side dedupe and manually running a new job from the modified policy.
Finally, once you have tried all the aforementioned items you may still have to run a forced rescan to get accelerator working again. Here is information explaining why:
Scott_123 - yes, the incrs run just fine, it's only my fulls that are erroring with the status 13s.
Nicolai - my concern with the higher CRT is that now a failing job takes 30mins to die rather than 5mins to die... this has made a big difference to jobs that run on that windows media server that we use for our non-prod windows backups... I don't understand why this can't be a per client setting on the media server...
INT_RND - yes, we are looking to move the entire environment to 7.6... That would be something we do in September. I've updated the CRT on the client... but I've seen with other testing (DB2 backups) that changing the CRT on the client doesn't always work. We aren't using client side dedup as we use Data Domain... I can pretty easily upgrade the client to 22.214.171.124 so I might do that and then setup the policy to run a couple of fulls during the week this week to see if that makes any difference.
Please note that when I rerun the failed job manually during the day... the job usually works...
Sorry, I mentioned that I could update the client to 126.96.36.199... I'd update one of the non-prod machines that I'm seeing this issues with last week... the upgrade has pushed the client to be current with the Master/Media servers, however, I still had the status 13 with that client on the weekend.
Client read timeout setting is very important on large databases. Sometimes you can see the value in the detailed status. If it is not changing when you modify the client host properties it is possible that there is an override in the database script. Please take a quick look at the script on that client and see if there is anything about timeout values in there. Default being 300. The admin guide says you need a minimum value of 900 for database clients but I have heard admins who skip straight to 1800 or 3600. It really depends on your environment.
We could see in the user_ops log that the CRT was 300 even though the client was set to 1800. Moved the db2 policy to backup on the media server with the higher CRT and it got the CRT form the media server and was displayed in the user_ops log and the backups just worked.
Still think this should be a per client setting rather than a per media server setting.
If Incrementals are successful and only Fulls are failing, sounds like it's having problems with rolling up the backup to create the new Full after it's moved the data, not actually moving the data.
Adding another Full schedule with the Accelerator forced rescan option enabled and manually running it will force the Client to re-walk the file system and rebuild the Track Log. Your backup should work and subsequent Fulls and Incrementals will also be successful. This will not resend all of your data on a file system backup (unlike a forced rescan with a VMware backup using Accelerator). It will walk through the entire file system and recalculate checksums as it rebuilds the Track Log, so it will run longer and will potentially consume a little more CPU at the Client.
Once your Master/Media and Client have been upgraded to 188.8.131.52 (I believe 184.108.40.206 as well), it's supposed to detect times when it has problems with the Track Log and automatically run a forced rescan behind the scene (can be seen in Detailed Status of the Job), but ONLY if a schedule with the forced rescan option enabled does not exist in the Policy.
If you're using TIR w/ Move Detection on the Acceleartor policy, disable Move Detection and then run a backup with forced rescan as mentioned above. TIR w/ Move Detection has a bug that causes the rollup of the Full backup to fail, which causes the above.
I have conflicting information on this topic. I agree that it should be a per client setting. I found an article that confirms your experience that changing the client setting doesn't work. This same article also suggests some performance tuning that you can try that will make the server more responsive and less likely to timeout. Although performance tuning is important and has many benefits you will likely have to increase the timeout values still. Do some testing. Maybe you can get away with a value of 900 where it will work with the database job without adversely affecting your environment.
"Note: Changing the CLIENT_READ_TIMEOUT on the client host will not have any effect on this situation."
(This TECHNOTE also matches your error exactly)
It appears that the machine has not finished marching through the backup when it dies on the 13. It is a backup of over 5TB and can fail at 36GB, or 4.7TB... or whatever. If it had finished "sending" all the data, then I'd say that was true.
No TIR so that shouldn't be an issue.
Really seeming like I need to up the CRT... just don't like that as a solution.