Forum Discussion

muinfotech's avatar
muinfotech
Level 2
13 years ago

Concurrent B2D jobs BE2010 R3

Hi everyone!

I've got a bit of an annoying stumper going on.

We are backing up physical Windows (2000, 2003, and 2008,) Linux (SuSE SLES 10, 11 and OES2,) Solaris (10,) and NetWare (6.5 SP8) boxes with local storage, as well as Windows and Linux VMWare virtual machines. The Windows VMWare backups are all grouped together in one backup job; physical Windows servers are in another job, and Solaris servers are in a third. Linux and NetWare servers have an individual backup job per server. (By job I mean policy, selection list, and template - the jobs are automatically created by the policies.) This configuration results in 15 jobs trying to run at the same time each night. We did this because we have so much data to back up, and the BE OS agents tend to be so slow (~100-500MB/min,) that it takes a very long time to do backups (well into business hours.) So we had hoped to be able to give the backup system tons of throughput so that we could run all of the jobs simultaneously. But alas, things don't like to go as planned.

First, some details about our environment.

BackupExec is set up with two B2D folders, one non-GRT and one GRT. Both of these folders are subfolders on a 47TB mapped iSCSI LUN. That LUN points back to a dedicated host running SLES 11 SP1 as an iSCSI target, attached to a 20-disk RAID 6E array (Adaptec 52445 with the BBWC.) The LUN is connected to the BackupExec server using the Windows iSCSI Initiator, and is formatted with NTFS with the GPT partition table. Each of these B2D folders is set up with allowing 16 concurrent operations, and is configured for buffered reads and writes per TECH61036.

The iSCSI host has a 4GB network connection (four 1GB network links bonded together using the LACP protocol to a Cisco switch configured to handle jumbo frames.) The backup server has two 2GB network connections - one to the campus server VLAN and one to the iSCSI VLAN. The iSCSI VLAN is completely segrated from the rest of the network, and only iSCSI traffic passes over it.

The Windows VMs get backed up using the VMWare backup agent and go to the GRT folders, while the others are backed up using the native OS BackupExec agents and go to the non-GRT folder.

We're running BE 2010 R3 on Windows 2008 R2 x64 Standard with all the patches according to LiveUpdate. Server is an HP DL360 G5, dual Xeon 5150s (4 cores,) and 4GB of memory. CPU util stays around 20%, and there is around 1.5GB of memory free.

Things generally work ok for us except for one annoying problem: only six jobs run at once (the others sit there as "queued." But it's weirder than that.

It's not that it's six jobs that run from start to finish, but rather, it's six concurrent operations. As in, a job will run until it hits its first "loading media" (or whatever the next operation is,) and then will go to "queued." Job 7 will then start. This continues, cycling through each of the active jobs, where there will eventually be statuses on each of the jobs, but only six will be "not queued" at once. (screenshot attached.)

I also tried creating a second B2D folder on the device and adding it to the non-grt pool, but we're still stuck at just six concurrent operations.

I've noticed that when the queueing issue happens, we'll see 33808 errors in the Windows event log, which was supposed to be resolvable by configuring buffered reads and writes according to TECH61036, but doesn't seem to have made a difference.

I know that it's not a throughput issue, because when the backups are running, the iSCSI connection only sees between 300~600mbps of traffic, and if I try to copy a large amount of data from the BackupExec server itself onto the iSCSI mapped drive letter, while that's going on, it's able to max out the connection (and without slowing down the backups.)

Can you think of any reason BE will only allow us to have six active jobs at the same time? I've been Googling and researching for almost two months now, to no avail. I tried to put in a support request with Symantec, but they told me that even though the issue is with the BE server itself, I'd have to create three separate support tickets since we're using five different BE agents - one for Windows and NetWare (tier 1,) one for VMWare (tier 2,) and one for Solaris and Linux (tier 4.)

Thanks in advance!

  • After much "banging head on the desk" time, we were able to narrow it down to the RAID config. We knew that there was a speed penalty for using RAID6, but didn't realize that it was as extreme as it is. I've attached a disk benchmark that we ran on the server, and you can see how much of a difference it made.

    Both of these tests were performed on the exact same hardware. We started out at RAID6 because of the sub-par reliability of the disks we'd purchased, so did the initial speed test and found those terrible results. Running iostat on the server found incredibly high await times, and most of the cpu cycles of the server (a quad-core AMD Phenom with 8GB of memory) were being spent in iowait. Had the controller rebuild itself to RAID5 (which it was able to do on the fly with no data loss,) and after it was done, there was an incredibly dramatic performance gain, as you can see.

8 Replies

  • When you define each B2D device, you can specify how many concurrent accesses it will allow.

    Verify that this was not set too low

     

     

  • Hmm

    I've set the support flag to attract the attention of the Symantec Techs

  • When you run a lot of concurrent jobs, overall they might be slower than if they were to run consecutively.  This is because of resource contention.

  • That does make sense, however, it shouldn't put the jobs in a queued state should it? I'd think that it would run slower, not just stop running. I could see it queueing up jobs if we ran more jobs than we had "concurrent jobs" configured for the B2D device, but the device has 16 jobs slated for it even though only 6 can run at once at most.

    Right now, we have 4 jobs running, and all are stuck in the "queued" state in verify; one has completed 703GB, one 17GB, one 14GB, and one hasn't started. The jobs have been running for 165 hours, 86 hours, 85 hours, and 86 hours respectively, and the byte counts aren't incrementing anymore.

  • You also have to check contention on the source.  For example, if you are backing up the C: drive and another job wants to backup a directory on the C: drive, it would not be able to do so because the C: drive is being held by the first job.

    I don't know your approach.  For me, I would run all my jobs consecutively and make sure that everything is o.k. before I SLOWLY run them concurrently.  I don't have a big bang and run everything at once.   It is very difficult to troubleshoot any contention problems.

  • Not a fix, but I've had much better luck with MPIO on 4 1Gb links than using LACP.  LACP should be used on the front end NIC's on your BE server, and MPIO on the backend.  Each concurrent job opens a new TCP connection, thus chooses a new NIC within the group (assuming round-robin which is preferred).

     

    But being stuck at 6 running jobs, something is a miss that I've never seen before.   

  • After much "banging head on the desk" time, we were able to narrow it down to the RAID config. We knew that there was a speed penalty for using RAID6, but didn't realize that it was as extreme as it is. I've attached a disk benchmark that we ran on the server, and you can see how much of a difference it made.

    Both of these tests were performed on the exact same hardware. We started out at RAID6 because of the sub-par reliability of the disks we'd purchased, so did the initial speed test and found those terrible results. Running iostat on the server found incredibly high await times, and most of the cpu cycles of the server (a quad-core AMD Phenom with 8GB of memory) were being spent in iowait. Had the controller rebuild itself to RAID5 (which it was able to do on the fly with no data loss,) and after it was done, there was an incredibly dramatic performance gain, as you can see.