A single Backup Exec Media Server (BE v10.1d, SP3 for Windows Server) installed on a Windows 2003 Standard Server. No tape drive, just D2D to an attached SATA RAID array. Also attached are three 1 TB external FireWire 800 drives which are used for "impromptu" on-the-fly type backups as well as for testing purposes.
There are a total of nine (9) Windows servers that are backed up by this single stand alone media server. Seven (7) are Windows 2003 Standard servers with the other two being Windows 2000 Standard servers (soon to be upgraded to Windows 2003 Standard). Three of the 2003 servers also have SQL and one is an Exchange 2003 Standard server. SQL and Exchange agents are also installed on the respective systems and all (including the media serve) are on the same LAN.
This setup was first brought online in June of 2006 replacing an ARCServe and Exabyte 430 setup which never worked reliably.
My backup jobs are just as "black and white" in that I do your typical incremental backups during the week followed by a complete full backup over the weekend. A copy of the full backup is rotated to an offsite facility so I have simple (basic) GFS rotation.
A recent problem that had haunted me a couple of months ago (and I never really did figure out the cause or the fix as it "magically" fixed itself) was when scheduled (and manual) backup jobs went into a QUEUED state. Inventory jobs, however, didn't seem to be affected.
The first time this happened, I was testing a backup and restore job and was having problems with the restore in that it wouldn't. Searching the forums and Veritas support site introduced me to the BEUtility as well as suggested that either the catalogs were corrupt, a problem with the database or a bad backup media.
I eventually traced the problem down to the database and catalogs (performing a repair and rebuild of the database as well as purging the catalogs).
One thing I noticed when the restores failed was that the Admin Console on the BE media server would crash whenever I switched to the the Job Monitor tab. Running the Admin Console from my laptop, however, did not crash so I was pretty sure the problem was related to the BE media server.
Another thing to mention was that on the BE media server I would get a error dialog message saying that the BEENGINE.EXE had failed. I would get three or four of these in a row yet if I checked the services it showed that all was well (and I still could not launch the Admin Console on the BE media server).
Re-installing the Admin Console on the BE media server did not solve the problem. And I did find a few helpful suggestions in the forum; all seemed to suggest a problem with MSIE v7.0.
Because I could still connect via the Admin Console from my laptop, I felt that I was ok for now but knew I would have to figure out the cause and fix it.
About a week later, it seemed to fix itself. No more error message and I could now launch the Admin Console on the BE media server.
Life was good and I didn't question it further...
Fast forward to today; approximatly two months later.
One of my users reported that about a months worth of e-mails were suddenly missing. He suspects that it was his Blackberry. In any event, he asked if I could restore the missing e-mails or at least the past week.
"Sure, I'll see what I can pull up", replied I. Seeing as it was a Monday when he reported this to me I knew on m nearline storage that I had at least a weeks worth already and then I had at least another two to three weeks worth of daily incrementals I could also look at before making a request to my offsite facility.
The catalog went smooth as did the restore to the delight of my enduser who got back about three weeks worth of e-mail (spam included...).
That evening all of my scheduled backup jobs (incrementals) failed! And I noticed that the job service was also stopped. Thinking that was the problem, I restarted the service and resubmitted the backup jobs. A couple of them seemed to run but many either failed complaining about a communication error (like you get if a device is in an offline state or not reachable) with some just sitting in a QUEUED status.
Doing the basic things like rebooting the server to stopping and starting the services made no difference. It seemed that backup jobs either failed resulting in the job service stopping or they'd go into limbo (QUEUED status). Some jobs would actually run and complete but if I ran the same exact job again it may or may not complete - often times it was the later.
And I was once again seeing error messages that the BEENGINE.EXE was failing/crashing but I could launch and use the Admin Console where as mentioned earlier this was not the case.
To rule out hardware, I reconfigured a set of jobs to write to the external FireWire 800 drives and it yielded the same results; completed, failed or go into limbo.
Fairly confident that it wasn't hardware, I turned to the database. Using BEUtility, I did a database dump as well as making another config file. Reparing, rebuilding and aging the database did nothinig to correct the problem.
I went as far as to reformat the area on the array where the daily incrementals were being stored and starting with clean/empty backup folder locations. Nada difference...
In a last resort effort, I dropped the database, launced the Admin Console to verify that everything was "gone" and then recovered the database and config file.
Now my backup jobs just went into a QUEUED status and it would sit there until I canceled them. For some reason I felt I was making progress in solving the problem.
Searching through the forum and knowledge base I found people with similar problems as mine. Some reported that the problem just fixed itself, others were related to hardware or something in their infrastructure but there were still many that seemed to go unresolved or unanswered by the person reporting the problem.
Having said that, a sort of common thread (fix) seemed to be that whenever a job went into a queued status that if they issued an inventory job the "stuck" job would then start to work again; as if the inventory job jump started the device or something.
This, unfortunately, did not work for me...
However, this did inspire me to try something. I intentionaly caused an error and forced the backup job to fail as a result of the error. Then I corrected the error and the backup job then worked every time!
The error I invoked was to place the backup device in an offline (disabled) state; not paused though I did try that too.
With the device "not available" the job fails as it should.
My thought process at that time was to verify that the media device services were working properly and this was the only way I could think of that test that.
I would have been utterly dumbfounded if the backup job remained in a queued status with the device being unavailable - and this would suggest that there was a communication problem between the BE media server and the backup devices.
When I brought the device back online and the backup job ran, I was somewhat suprised but not too much as I assumed that if I reran the same job it would most likely fail. But to my suprised (again) it ran to completion.
Third times a charm and what do you know? It ran to completion...
I repeated this with my other backup jobs and they all ran. I repeated the jobs twice and on smaller jobs I ran them at least three times.
By 4:00pm PST, I decided to stop testing and let the system run the same jobs at their scheduled time which was 6:00pm PST.
All of my scheduled backup jobs ran to completion so I'm feeling fairly confident that my problem has been solved though I can't say for sure.
Nor can I say for sure what the root cause was although when I reflect back to when this happened the first time, the common thing was that I did a restore back to my Exchange server.
I've done restores since and even before the first tiem this problem came around though the restores were just data files (Word documents, Excel spreadsheets, etc...) and not Exchange data.
Sorry for the length of this message. Hopefully this will be of value to someone else who may be having similar (or the same) problems.
Message Edited by Robert Kuhn on 04-11-200710:27 PM
Wow you weren't kidding about the novel! :) I read through it expecting a problem at the end, thinking "I really hope I can help someone who has made this much effort"
Interesting post though, thanks for letting us know the story. Especially interesting solution of deliberatly causing the backup to fail... kind of like slapping BEWS across the face with a wet flannel! :)
Message Edited by Keith Langmead on 04-12-200706:53 AM
It's been a week and so far so good. I'll see how it goes in the next few days before I try and attempt a test restore back to my Exchange server and see if the problems come back. And if it does will my "fix" work.