After not getting a response (since originally having posted this in July) to the following and getting thrown into the abyss by having this incorrectly moved to the Backup Exec for Netware un-moderated forum, I am re-posting this to where this should be!
----
I have read just about every thread on this subject on these forums. Although it is somewhat comforting that I am not the only site that suffers from this issue, it is very disconcerting that there is no post that actually states that the issue has been resolved, nor have I managed to resolve the issue by trying the numerous suggestions posted. Here is the background to my issue...
As of a month ago the backup exec job services started to drop out and I started getting the above error when running the nightly scheduled backup. Up until then all was well and nothing has really changed except that I am increasingly backing up more data and the backup time is taking more time. The backup job runs and actually verifies successfully. However a few minutes after the backup completes - at around the time when the tape should auto eject, the backup job service drops out. I have set it to restart after a minute as suggested by another thread. The service restarts OK, but the Alerts show a "The job was canceled by user recovery" and the soon after "The drive hardware is offline!". ultimately although the backup and verify appears to complete, the errors appear in the log and the job status reports the job as "Canceled". Oh, the type also does not auto eject.
As I already stated I have tried just about everything suggested here. I am running version 10.1 with all the patches installed, but I have replicated the problem using version 10.1 without patches, version 10 and 9.1 also. All my drivers and firmware for the HP Ultrium 448 tape drive are up to date. Yes I tried the latest Symantec as well as native HP with the same failure. And get this.... Out of desperation I even re-formatted the hard disk on the server reinstalled windows 2003 server and did a clean install of the version 10d software. I then recreated all the media sets and backup jobs and guess what.... I STILL GET THE ERROR! Surely, by doing this I have ruled out all issues to do with corrupt catalogs and such. Now for the weird thing.... I have noticed that the backup runs and reports "Successful" when I only backup some of the files. That is I am still backing up data from the 10 or so different volumes, just less of it.... Then it runs OK. Oh.... yesterday I threw in another GIG of memory into the server just for good measure.... still failed.
I am out of ideas where to go with this next. I have read threads going back to September last year reporting this problem.... Surely if this is inherent in the product that is enough time for Symantec to address the issue....
I strongly believe that the problem is linked to the actual real time length of the backups. I have recently tried running all my backups without the verify process at the end. Turning off verify has shaved off around an hour and a half of my backups. It used to take 7 hours 30 minutes or so and now completed in just over 6 hours. Interestingly, since I turned off verify ALL the backups for the last 8 days have completed successfully. As soon as I turn verify back on, it extends the backup total time to over 7.5 hours and the Job Engine fails before the backup completes. It is almost as if the job engine cannot run longer than the 7 hour 30 minute mark without falling over!
I suppose I could fragment my one big backup job into a few smaller ones. This may make the engine keep running. I did see a thread a while ago that talked about doing this for restoring jobs in situations where large restores caused the job service to fall over.
At the end of the day, all this is a work around to the core of the problem.... a flaky Backup Exec Job Engine service.