Forum Discussion

Tom_Mucha's avatar
Tom_Mucha
Level 4
13 years ago

2010 R3, dual tape drives, 1 job sits queued for 2.5 hours

Hi Everyone,

I have a BE2010 R3 (latest patches care of LiveUpdate as of today) install that I'm having some trouble with.  The setup had a single MSL2024 on it with 1 LTO4 drive and we replaced it with a MSL4048 dual LTO5 library.  Added the proper Library Expansion license.  I have a policy that has the device listed as the MSL4048 (not the individual drives), and I create jobs off of that using 2 selection lists, so I end up with 2 jobs for the MSL4048 - which should pick and choose the tape drive.  I've used this setup in the past without issue.

After creating the new jobs, at this particular client, I'm seeing job #1 start off immediately, but job #2 has a status of queued with a current operation of none.  The state is active with the little pause icon.  No alerts asking for media, nothing about no devices being available, nothing in the event logs.  I let the job go last night, and after about 2.5hours the job started (note that job #1 was still running for many hours after job #2 started).  When I look at the backup log - this is what I see:

 

Job started: Wednesday, November 07, 2012 at 4:48:54 PM
Job type: Backup
Job Log: BEX_DC01_01734.xml
 
Drive and media mount requested: 11/7/2012 7:12:13 PM
I've attempted to use SGMON and watch what's going on, but nothing stands out to me.  I've set SGMON to do media and device logging and have attempted to also watch the "Job engine, raws, agent browser" but it's just too much to go through while the other job is running like a champ.  Tried many reboots, even recreated my selection list.  The old library was backing everything up with 1 selection list and did not have this delay in starting the job.
 
I do not have pre-scan enabled in BE since that was my first thought.  Job #2 has some large servers in it like the file & print server along with 2 sharepoint servers and some smaller SQL servers.  I've tried to hunt down if there is some pre-scan going, but I can't find anything.
 
-I just kicked off another job - job #2 started after 2h25m. I had SGMON going, but wasn't logging to file and just caught the part where it started loading media.  Tomorrow night I will attempt to leave on logging unless someone has a better idea.  Another thing I noticed is that Job #1 was in a "updating catalog" status after the GRT backup of exchange, and sometime after the updating catalog status went away, it seemed like job #2 started, but I wasn't watching it.  Corrupt catalogs?  Can't imagine that changing by adding a new device.
 
I've opened a case with Sym support, but since I won't be at the customer site, other than remote access from the office, working with them will be slow.  If anyone has any suggestions, please let me know, thank you very much!
 
Tom Mucha
 
 
  • I thought it was a hardware issue as well, but when running manual test jobs using a few folders backed up to each drive, it runs fine.

    When I used SGMON yesterday, I remember seeing lots of stuff about the sharepoint servers.  I thought they could be normally communication messages, but I removed all sharepoint resources from the backup.  I ran the jobs at that point - they both started up immediately! 

    I then re-added my sharepoint farm, and then the servers that are part of that farm, and it looks like things are running fine.  When I created my first selection list, I only selected sharepoint at the server level rather the farm level.  I'm no sharepoint expert, but I remember the order or selections used to be an issue.

    Upon closer inspection the customer was NOT backing up sharepoint properly in their old job since the farm resources were not selected at ANY level.

    I'll let the jobs run tonight since I don't want to slow anything down during the day and we'll see what happens!

    Tom

  • Hi Tom,

     

    Does the second job start up normally if you just run it manually while Job 1 is idle?

    Have you removed the LEO license and readded it? What happens if you target each job to a specific tape drive?

    Thanks!

  • Hi Craig - great questions!  I did try some of these before but forgot to post them.

    If I run job#2 while job#1 is not running - it still sits queued (I have never waited the 2.5 hours to make sure it starts in this case, just give it 15-30min).  

    I have not re-added the LEO license, I will give that a shot. today!

    If I chaneg the policy to target specific drive, I still see the delay (again, never waited the full 2.5 hours, just give it 15-30min).

    I thought the policy could have been the issue, so I ran two maual jobs using the 2 selection lists, I had the same issue.  I recreated the selection lists from scratch - they were not reused.  I'll go recreate the selection list as well, but I still have a feeling something is doing a pre-scan on this selection list. I'll run a full debug log with SGMON at some point and see if I can catch something.

     

    Thanks!

  • After you changed the RL, did you recofigure it in BE? If not disable, delete the Tape Drives and Medium changer from BE. Make sure that both Tape Drives are using Symantec Drivers and Medium Changer using Microsoft Drivers. Restart BE services, it should re-detect the Robotic Library again.

  • OK, another thing to try is to disable tape drive 1, and then run job 1 and see if it goes through. If it still stalls, check for any alerts and then if nothing shows up there, then download the drive manufacturer's utilities and then run them agains the hardware to rule that out as the problem.

    Thanks!

  • Thanks Jaydeep - library was added properly - I think it has something to do with my selection list - I'll be replying to another post.

  • I thought it was a hardware issue as well, but when running manual test jobs using a few folders backed up to each drive, it runs fine.

    When I used SGMON yesterday, I remember seeing lots of stuff about the sharepoint servers.  I thought they could be normally communication messages, but I removed all sharepoint resources from the backup.  I ran the jobs at that point - they both started up immediately! 

    I then re-added my sharepoint farm, and then the servers that are part of that farm, and it looks like things are running fine.  When I created my first selection list, I only selected sharepoint at the server level rather the farm level.  I'm no sharepoint expert, but I remember the order or selections used to be an issue.

    Upon closer inspection the customer was NOT backing up sharepoint properly in their old job since the farm resources were not selected at ANY level.

    I'll let the jobs run tonight since I don't want to slow anything down during the day and we'll see what happens!

    Tom