Re: BackupExec causes horrendous fragmentation - Page 4

Ross_Smith · ‎06-01-2005

I am seeing incredible fragmentation on the drive we use to store our BackupToDisk files. I'm posting more details below, but if you are using B2D, could you please check the fragmentation level for me and let me know if you are also having this problem.

We have a 570Gb NAS device used as the main target for all of our backups. This device has been in use for approximately 6 months and is used almost exclusively for B2D backups.

After numerous problems with BackupExec, I happened to run a defrag analysis on this device, with the following results: http://www.averysilly.com/Defrag%20report.jpg.

Even after deleting 350Gb of B2D files, we are still getting 99% fragmentation reported with files scattered across the entire disk, and with a 1Gb b2d file in as many as 40,000 fragments.

I am still speaking with Veritas, we've no clear idea what caused this, but on at least one occasion I have set BackupExec to allow concurrent operations to this B2D device, and we suspect this may be the cause of this fragmentation.

We are defragmenting this device today to see if it fixes our problems, but I would be very interested to know if anyone else is seeing similar effects on their B2D devices?

RossSubject line updated 20/6/05 by:
Ross Smith

Paul_Yip · ‎09-29-2005

diskeeper seems to be working for us. this is the first night we've ran the backups after defragmenting and it seems to be faster. the bad thing is we need to get the server enterprise version since we have over a terabyte and the standard edition only goes to a certain disk size.

this is unacceptable and we shouldn't be relying on other software to make the software we purchase working correctly. especially if it's going to cost me another 1000 dollars just to make it run normal.

Peter_Ludwig · ‎10-04-2005

Diskeeper was the best among the tested products, but also seemed to cause the BE engine to fail under high load (that means when many backup jobs are running at the same time. most of them B2D).
So I removed all of these tools. My B2D is not taht slow, so I leave it for now that way.
But the main reason was that I also would need that $1000 enterprise version, I cannot justify that.

greetings Peter

Vin_Latus · ‎10-04-2005

Does anyone know if Symantec/Veritas is even acknowledging this problem yet? Better yet, are they working toward a solution?

Thanks,
Vin

Paul_Yip · ‎10-04-2005

> Diskeeper was the best among the tested products, but
> also seemed to cause the BE engine to fail under high
> load (that means when many backup jobs are running at
> the same time. most of them B2D).
> So I removed all of these tools. My B2D is not taht
> slow, so I leave it for now that way.
> But the main reason was that I also would need that
> $1000 enterprise version, I cannot justify that.
>
> greetings Peter

after running our first weekend full backup (multiple jobs) it did not help at all. started friday and ended monday night. that is ridiculous. backup to tape would have been finished saturday morning.

Ross_Smith_2 · ‎10-06-2005

Ok, had my first failed job today that could be caused by fragmentation, with a few interesting results.

I had a job fail with the error:
Storage device "xxxxx" reported an error on a request to read data from media.
Error reported:
The specified network name is no longer available.
V-79-57344-33992 - The backup storage device has failed.
V-79-57344-65072 - The connection to target system has been lost. Backup set canceled.

Checking the event log on the backup server revealed:
An error occurred while processing a B2D command.
Drive: ReadMTFData() ReadFile failed (\\rob-030\e$\Data\BackupExec\B2D001201.bkf). Error=64

The job that failed was a duplicate job, reading data from the NAS device to copy to tape. When this occured, another job was busy backing up to the NAS device, so I know the NAS device hasn't failed and BackupExec is still connected to it.

** note: My NAS device is configured to only allow 1 concurrent operation. I have a new call to Veritas waiting to find out why BackupExec was allowing two operations **

I checked with Veritas Support, this kind of problem can be caused by a network or I/O timeout. Network diagnostics report under 17% utilisation even while the other job is still running, Veritas confirmed we're looking at the file system as the cause.

Checking the file reveals that it has 21,507 fragments (using contig from sysinternals.com). The Veritas engineer confirmed that this could have been the cause of the crash and informed me that Veritas recommend running a defrag utility.

Interestingly KB article 23744 explicitly states that fragmentation could cause jobs to fail...

Two comments:
- Firstly, that file *was not* listed as one of the most fragmented files in the windows fragmentation report, yet it has more fragments than any of the files in that report. This could imply that the data in this thread *seriously* underestimates the extent of the fragmentation being caused here.
- Secondly, my feeling is that BackupExec is originally a tape based system - tape drives don't deal with multiple simultaneous operations and don't expect delays, both of which should be expected with B2D and networks. Veritas don't apparently have any way to increase the timeout before BackupExec fails a job. That is a concern to me.

I'm still running my system here as a test and gathering as much evidence as I can about the problems fragmentation causes.

I'd be very interested to hear if anybody else has jobs failing because of timeouts when devices become fragmented.

Ross

Peter_Ludwig · ‎10-06-2005

> I'd be very interested to hear if anybody else has jobs failing because of timeouts when devices become fragmented

I would not think so, but I did also not really belive this fragmentation issue first....

The German magazine c't has a test on defragmentation software in the current issue.
I will give their 'winner' O&O Defrag 8.0 a try....

(although I have to say that fragmentation does not cause that serious troubles here as mentioned by other people)

greetings
Peter

Mike_Lee_2 · ‎10-06-2005

I'll post this to help you all out, or get more information. Been using this for about 4 months.

Iomage NAS 445r 1.6TB

http://img144.imageshack.us/my.php?image=frag7jw.gif

----------

Volume DATARAID
Volume size = 1,092 GB
Cluster size = 4 KB
Used space = 753 GB
Free space = 339 GB
Percent free space = 31 %

Volume fragmentation
Total fragmentation = 49 %
File fragmentation = 99 %
Free space fragmentation = 0 %

File fragmentation
Total files = 7,430
Average file size = 106 MB
Total fragmented files = 915
Total excess fragments = 327,081
Average fragments per file = 45.02

Pagefile fragmentation
Pagefile size = 2.01 GB
Total fragments = 1

Folder fragmentation
Total folders = 130
Fragmented folders = 16
Excess folder fragments = 321

Master File Table (MFT) fragmentation
Total MFT size = 17 MB
MFT record count = 9,695
Percent MFT in use = 55 %
Total MFT fragments = 3

--------------------------------------------------------------------------------
Fragments File Size Most fragmented files
13,337 1.81 GB \sqlbeast\sqlbeast_05-10-06.zip
12,116 1.81 GB \sqlbeast\sqlbeast_05-10-04.zip
11,968 1.81 GB \sqlbeast\sqlbeast_05-10-05.zip
8,898 1.76 GB \RECYCLER\S-1-5-21-1211826751-1692286541-581009308-500\Df1.zip
8,180 1.00 GB \b2d\TUESDAY\B2D005823.bkf
8,134 1.81 GB \sqlbeast\sqlbeast_05-10-03.zip
5,355 1.00 GB \b2d\THURSDAY\B2D006097.bkf
5,311 1.00 GB \b2d\TUESDAY\B2D005824.bkf
4,528 1.00 GB \b2d\TUESDAY\B2D005825.bkf
4,468 1.00 GB \b2d\FRIDAY\B2D006236.bkf
3,601 1.00 GB \b2d\THURSDAY\B2D006098.bkf
3,524 1.00 GB \b2d\MONDAY\B2D005661.bkf
3,494 1.00 GB \b2d\TUESDAY\B2D006386.bkf
3,329 1.00 GB \b2d\TUESDAY\B2D005948.bkf
2,992 1.00 GB \b2d\TUESDAY\B2D005949.bkf
2,955 1.00 GB \b2d\FRIDAY\B2D006253.bkf
2,664 1.00 GB \b2d\TUESDAY\B2D005941.bkf
2,323 1.00 GB \b2d\THURSDAY\B2D006112.bkf
2,212 1.00 GB \b2d\TUESDAY\B2D006387.bkf
1,988 1.00 GB \b2d\THURSDAY\B2D006111.bkf
1,969 1.00 GB \b2d\TUESDAY\B2D005927.bkf
1,641 1.00 GB \b2d\THURSDAY\B2D006096.bkf
1,553 1.00 GB \b2d\WEDNESDAY\B2D006413.bkf
1,375 1.00 GB \b2d\THURSDAY\B2D006127.bkf
1,238 1.00 GB \b2d\MONDAY\B2D005691.bkf
1,177 1.00 GB \b2d\MONDAY\B2D005822.bkf
1,174 1.00 GB \b2d\MONDAY\B2D005769.bkf
1,155 1.00 GB \b2d\THURSDAY\B2D006144.bkf
1,146 1.00 GB \b2d\TUESDAY\B2D005903.bkf
1,128 1.00 GB \b2d\TUESDAY\B2D005945.bkf

Peter_Ludwig · ‎10-11-2005

I now also abandoned O&O defrag, as it probably also caused problems and could not keep up with the fragmentation anyway.
I will now leave the fragmentation alone and have to live with it (and hope that Veritas will take of it one day).

greetings
Peter

Rob_Baumstark · ‎10-12-2005

I also had horrible fragmentation problems up till recently, with 2.5TB of storage dedicated to B2D. After playing with many things, I've now got it set to use 1GB max size B2D files (yes, back to default, I've played with up to 10GB files) because the defragmenter has an easier time moving them around. I use PerfectDisk with their defrag-only option (does NOT consolidate free space, just tries to put files back into one piece) and I changed all the settings for new/old files so that it doesn't have any priority as to where on the disk things go - defrag's take too long if it's moving around non-fragmented files. Defrags scheduled to run every day. I have yet to have any jobs fail from running backups and defrag at the same time - normally defrag fails on a file or two if BackupExec had them in use - they'll get defragged the next day though.

As for the root cause of the problem, and the solution, it's quite obvious to me. Watching BackupExec 9.1 with filemon it continuously appends data to B2D files 32K at a time (just like appending blocks to a tape). Windows however has no idea how large the file is going to end up - the software keeps asking for 32K chunks, so the NTFS driver keeps finding 32K holes to put them in. The solution - BackupExec needs to pre-allocate the entire 1GB file at the start of the job, rewind to beginning of file, then start appending data. Once thats done, NTFS will attempt to allocate the 1GB block in a single piece. In theory, on a freshly formatted disk, every file should start at a 1GB boundary, and as they expire and are overwritten it should re-occupy the same grid-space - your HD should look like a big brick wall with every brick being a 1GB B2D file. Let append times deal with empty space at the end of B2D files, or just waste a bit of HD, but lets get performance where it should be.

Lars_Lange · ‎10-17-2005

Hi all,
Some news regarding this issue.
Veritas is aware of this problem at a higher level. I have learned that last week they had a developers congress i Berlin, Germany where this issue was a topic on the agenda for 3 hours. They also clamed that not all customers using the B2D had this problem. So finally, we might get a solution to this issue. I hope for the best, otherwise I might order Netvault licenses.

Richard_Grint · ‎10-19-2005

OK this is long but if you read it you may find help!

I may have a workaround that might help some of you. I will describe our set up and the performance before and after the change.

We run 17 Windows servers more just coming onstream (mixed use e.g database, WTS, File servers backed up by a central backup server with a RAID array with 6 X 70Gb disks. Most of our network is Gb capable but some of our servers have slow disks and 100Mb cards.

We used to backup to tape but tape was not reliable (or fast for slow servers) we believe because the LTO-2 drops out of streaming mode and
then has to rewind if data rate falls below 1Gb per minute. Should say that all data rates here are for a compressed data stream.

So we moved to B2D primarily to increase tape reliability and as a result we monitored the performance rather carefully becuase we want a 1-2Gb per minute
compressed data rate to tape. We are using the disk array effectively as a speed matching mechanism between slower servers and the tape.

If we backup 1 server at a time (which takes way too long) we don't get much fragmentation. If we run with
parallelism set to 16 (i.e. the max) we get tons of fragmentation.

To give in idea of how much a 1Gb backup file has 18,000 fragments on an empty file system. i.e. 50kb per fragment. Whilst it is true that some file systems
cope better than others (e.g. on Linux XFS would probably cope better; ext3 possibly wouldn't) with fragmentation the primary cause is that BackupExec should
either request larger units of storage e.g like a database product extents or preallocate the files again like a database product.

The slowness is caused by the fragmentation defeating read ahead mechanisms on the disks and RAID array and by the duration of the seeks causing the tape buffers
to empty and then the tape stalls and rewinds. Stalls increase the number of tape passes and eventually trash tapes (so there is potential for data loss; it's not just a performance issue).

Should have said our backup volume is about 90Gb per night and about 200Gb uncompressed.

What we have done and there is some hope here!
================================================

The level of fragmentation is too large for most if not any defrag tool to resolve during the backup window. Also fragmentation occurs even on an empty file system
so a during the day tidy up doesn't help overnight too much. Running a defrag in background kills the backup speed by
causing the heads to seek thrashing the disk. Also most if not all defrags use the Microsoft apis so the amount of data moved seems to
be determined by the api (64kb?) rather than the defrag product itself. The basic problem with the defraggers is that we haven't found
one fast enough to defrag the entire array during our overnight window and the products are disk focused and therefore defrag stuff that
we have rolled off to tape and are about to overwrite in the media pool.

We found part of the answer at that great site www.sysinternals.com produced by Mark Russinovich and Brice Cogswell. The tool of relevance is
CONTIG. The reason this tool is cool for this problem is that it is file oriented and command line driven
so it is easily scriptable. If you run contig across your B2D directory it will
show by file the fragmentation levels for each file. Because it is file oriented you can tell it to make
an individual file contiguous.

There is one other trick here set the backup file size to say 1Gb. Too small and file system opens will cause problems too large and the
work units for the script will be too large.

So our overnight workflow is:
a) Backup all servers to disk at max parallelism. Data rates variable lowest 215Mb per minute, max 1.3Gn per minute for the local system.
Backups start at 23:00 and complete by 00:30 ish.
b) Our Python script runs and analyses the disk and defrags as much as it can during the overnight window.
c) After the Python script deadline the roll off to tape starts at 06:30 and finishes at 09:00. Typical data rate 1Gb to 1.6Gb per minute.

Do not allow these processes to overlap or the seeks will kill you!!!!!!!!!!!!!!!

OK what does the script do:
1) Analyse the disk.
2) Rank the files by number of fragments.
3) Run contig file by file until deadline or minumum fragments.
4) Time each contig runs and don't run the next contig if deadline would be exceeded. N.B. the standard file size 1Gb and the ranking means that each
run of Contig tends to be faster and a good estimate of the run time for the next run.

The script removes about 1 million fragments per night starting with files with 18,000 fragments and then terminating on time with
only files with 2,000 fragments left.

This still isn't as good as Veritas pre-allocating the files (sometime referred to by marketing departments
as Virtual Tape Libraries ;) but its a start.

We haven't been running for long enough yet so I have suspicion that ultimately free space fragmentation will be a problem. If this is correct the we
will probably buy Perfect Disk to to a free space consolidation over the weekend; I think it is the only defrag to guarantee a single
pass free space defrag.

Summary.
BackupExec has potential to be a great product (if it had the ability to preallocate files or control allocation extents). Mark Russinovich is brilliant; buy his book Windows Internals. Python is the best scripting tool I have used. All comments unsolicited testimonials

Hope this helps!!!!!!!!!!!!!!!!!!!!!!!!!

Brian_Nations · ‎10-21-2005

I'm using a Overland REO device and have Server 2003 installed on my backup server. The REO is 1 terabyte and is currently about 30% full. I've been getting major fragmentation since I first started backup-to-disk. I would post a screenshot, but I recently ran a defrag so it's nice and tidy for now. So far performance isn't affected, but we've only been using this for about 2 months.

Ross_Smith_2 · ‎10-26-2005

Hi guys,

Sorry I've been quiet for so long, but we've got some good figures from this forum now so I've approached Veritas again and have been speaking with a few of my contacts in the UK.

Veritas have arranged a conference call next Wednesday (hurricanes permitting...) between myself, one of their UK consultants, a Product Manager from France, and another Product Manager from Florida with has close links to engineering.

I think it's fair to say Veritas are now well aware of this issue. On Wednesday I'll be trying to find out exactly how serious they feel this is and what work (if any) they have done so far. I'll be asking them what their plans are for addressing this, and if I can I'll try to get an ETA out of them for a solution.

Bear in mind this is still very early days for Veritas so don't expect miracles. While we're all aware that fragmentation exists, I've been running this thread for 4 months and only have 2-3 conclusive reports of slowdowns due to the fragmentation. Also, at this stage we don't know how fundamental this problem is. It may be that it can be fixed simply now Veritas are aware of it, but this may be central to the design of BackupExec in which case any fix is going to take time.

I'll let you all know how the meeting goes.

Ross

Paul_Yip · ‎10-26-2005

i hope something is fixed soon because my backups are running SO SLOW. my full backups are taking more than just the weekend and my dailys are running into the next morning. this is unacceptable.

Ross_Smith_2 · ‎10-26-2005

Paul, can you drop me an e-mail, I'd like to check what kind of speeds you're getting. I've hit a couple of other issues in the past that have affected speed far more than the fragmentation has.

e-mail is: myxiplx - you can guess this bit - hotmail.com

Ross

Ross_Smith · ‎11-09-2005

One additional comment - Windows XP does not appear to suffer anywhere near as badly when it comes to fragmentation.

This doesn't help anyone running a NAS device and I still don't like the way BackupExec is allocating space for these files, but at least XP seems to handle it better.

We've now got a home grown NAS box running Windows XP and have just re-built our backup server and re-installed that with XP too. Our 'proper' NAS devices are now being used for our temporary archive store and our winzip backups.

Fragmentation wasn't the only motivation here - we don't have a dedicated network for backups and the server rebuild allowed us to put 2Tb of storage on the backup server itself. That alone has more than doubled our backup throughput and the reduced fragmentation is just a much appreciated bonus.

Ross

Ross_Smith · ‎11-09-2005

Ok, a question for people that know more about NTFS and drives than me:

At the moment BackupExec requests file space from the OS in small chunks. We all know the end result of this is the fragmentation we see here.

I just realised that requesting data in this way is also going to have a performance penalty:- if I'm right, you're asking the drive to constantly switch between writing data and updating the FAT (I know, I'm out of date with terminology, and this is ntfs not fat32, bear with me...)

We're all hoping Veritas will update their drivers to request the entire file in one go - NTFS will allocate the space and they can then stream the data straight to it. I'm thinking this could have a fairly significant performance boost. Seek times are traditionally the slow part of drive performance. Asking the read/write head to constantly switch between the fat and the data must be slower than just streaming data off the drive.

I'm no expert, and I've no idea how Veritas have optimised their drivers, but in theory you'd have thought reducing that head movement would have at least a 50% performance boost...

Ok, just done a little background reading - Hard disk benchmarks on PC Pro site for one drive report transfer rates of 14Mb/sec for small files and 60mb/sec for large files... that's a 4x boost comparing 100Mb file operations to 25kb operations... pretty similar to what we're hoping for here really.

Some drives even dropped to as low as 2.4Mb/sec when writing small files...

Very, very curious about this now. Does anybody else have any comments on this?

Ross

Peter_Ludwig · ‎11-09-2005

Ross, you wrote tha Windows XP is better, but not better than what.

And I see no reaosn why tehre shouldn't be a registry entry for setting the size of allocation in kB (or whatever).

People who are not happy with the default value can tune their system.

greetings
Peter

Ross_Smith · ‎11-09-2005

Windows XP has far less fragmentation when used as a B2D device. After a reasonable period of use, the bkf files stored on the XP NAS box have a few hundred fragments each. In an equivalent period you can expect the files on the Windows 2000 server NAS boxe to have tens of thousands of fragments.

Ross

Ross_Smith · ‎11-09-2005

Sorry Peter, there's no registry entry for this setting, it's fundamental to the way BackupExec writes data to disk devices.

Veritas are investigating to see if there is any way they can improve this, but it's too early to know yet what they will be able to do.

Ross

VOX

BackupExec causes horrendous fragmentation