Netbackup 5200 Duplication Jobs Fail with Error 191, then keep re-spawning and failing
Hi,
We are having problems with duplication jobs on our Netbackup 5200 appliance. We use storage lifecycle policies to backup to the 5200 as the primary copy, and then duplicate off to tape. This has worked fine for months, but recently the duplication jobs have started failing with error 191. They start OK and backup a few GB of data, but then fail before the end (the primary copy backup jobs to the appliance complete OK). Looking more closely at the job log reveals the following:
Critical bpdm(pid=14456) sts_read_image failed: error 2060017 system call failed
Critical bpdm(pid=14456) image read failed: error 2060017: system call failed
Error bpdm(pid=14456) cannot read image from disk, Invalid argument
I originally thought this was some kind of corruption on the appliance, but the strange this is that we can still backup to the appliance OK, and even restore from some of the images that are failing to duplicate. I've adjusted the PoolUsageMaximum and PagedPoolSize parameters as suggested on some sites without success.
Another problem is that when the duplication jobs fail they re-spawn again, so we have numerous duplication jobs running that are hogging our tape drives which is having a knock-on effect on normal backups to tape. We have temporarily suspended duplication to help with this, but it isn't a long term fix.
Our environment comprises: NBU Master Server 7.1.0.4 (Win2003 R2), NBU Media Sever 7.1.0.4 (Win2003 R2), NBU 5200 Appliance (2.0.2)
I've got a ticket open with Symantec about this and they are currently investigating, but just wondered if anyone else in the community had ever seen this before. Any help would be much appreciated.
Thanks
This was resolved with help from a Symantec Engineer. It turns out it was due to corruption after all. The Engineer provided a tool called "recoverCR" which was run on the appliance. This identied a large number of corrupt backups.
The corruption was fixed by disabling the cached mode on the appliance and re-running a full backup of all clients. The premise for this was that the corruption could be traced back to some shared data that duplication depended on for a number of subsequent backups. By disabling the cached mode the full backup over-wrote the corrupted source data. Re-running the recoverCR tool after the full backup showed that the level of corruption had been reduced to a miniscule level - just one backup image which we ended up deleting.
Cache mode was enabled again and everything worked fine after that.
For info, cache mode is disabled by changing the "CACHE_DISABLED = 1 " entry in the pd.conf file on the appliance. 1 indicates caching disabled, 0 indicates caching enabled.

