Solved: Netbackup disk storage fragment size related......

grlazy · ‎05-28-2012

We have an installation of Netbackup 7.1 and some 3 media servers with msdp pools. The master is a ESX Virtual machine with windows 2k3 , 4 processors and 8GB or memmory Two media servers are identical (HP proliant BL620C G7) with 4Tbyte of DS8300 storage (SVC managed). One is the production and the second one is the one on whch the data are been duplicated (via SLP) Os is 2008r2 standard with 196GB ram.i have limited the netbackup use @ 32GB (couldn't start with more than 75%) each one of tham has a IBM 3584 with 4 L3 drives. We also have one small media server (dell 2950 ) with 850GB of internal storage (also MSDP pools at them) but standalone.windows 2008r2 standard with 16GB ram.the tape library is a 4048 HP with L4 drives. All tape libraries have been tested and working at optimal speeds. L3 usually write ar 79MB/sec and the l4 almost at 117Mbyte/sec.The source of my speed tests , was usually the phisical files of the msdp pool (on each one server) Maximum fragment size at all tape libraries is set at 15360 and the reduce maximum fragment size setting at disk storages at 15000, My issue is that while all traditional backups from the disk storage to the local tape library are working like a charm , when the time comes for "rehydration" -duplicate from msdp disk pool to tape- i can't see more than 42-47Mbyte/sec. I did manual rebasing and tryied the rebased images to be duplicated to tape.Nothing more happend (speed wise). This issue apears only at the proliant servers. the del 2950 on the other hand works like a charm.it rehydrates at 65-70Mbyte /sec with the very same NBU patch level (7.1.0.4) and the same settings. My suspitions are on the DS8300 storage and the reduce of the maximum fragment size. I have read that article http://www.symantec.com/business/support/index?page=content&id=HOWTO56051&actp=search&viewlocale=en_US&searchid=1338196697204 and i lowered the "reduce maximum fragment size " from 15000 to 10240. after that i run a duplicate of some recendly rebased images and i could see some 25% more rehydration speed. Does it make any sence to you ? Do i need to go lower (that reduce maximum fragment size setting )?

grlazy · ‎11-02-2012

Update that may help:

I did a couple of things (or more ) which may worked.
Had a look at the mp 7.1.0.4 technote.It contained the minimum hardware specs recomended.Stay above those specs.
I upgraded media/msdp servers to much higher specs but i saw minimal difference at rehydration speed.
Still i could see around 45Mbyte/sec at one Lto3 (from 27MByte/sec before).Still no good enough.

Yet the garbadge collection was running in a blink of the eye and that alone was a good sign.

I optimized the tape drive settings (buffers/number etc).In common local file backup i could see optimal speed.Did alot of restores also to be sure .

Checked the speed of my storage. Used the camel tool (it has a newer name now) and did copy to an other storage (storage not volume at the same hardware) and kept time .Then did the math to find the storage speed.Before the upgrade i had lower throughput but now I was above 120Mbyte/sec (well symantec says 150 but now i can see 65Myte/sec throughput with 117Mbyte/sec to be exact, and that at my lower spec hardware).Either way you have to be higher than this, It won't work with lower speeds.

Some time after i changed the maximum fragment size of the DDpool from 15360 to 10240.I think it was better but yet not satisfied.
It happened to have in premises an IBM storage expert.And he noticed that the mode the fiber LTO drives (and the tape library) was at L-port (some kind of loop) and not at N-port (point to point with the switch).That setting alone was extra 10-15Mbyte to throughput at traditional backups .

I also expiremented with the settings of the contentrouter.cfg ( msdp pool path\etc\puredisk). The settings you can change are noticed in the 7.1.0.4 release notes.Two significant settings which can fix or destroy your throughput speed are the "ReadBufferSize" and "PrefetchThreadNum".Change one at the time and while no rehydration jobs running do some 30-50GB duplicate from Msdp pool to tape.It is good enough to show you if it works.

I do daily a couple of things also.
I do garbadge collection at the pools and compaction start by hand.
I disable the rehydration to drives while backing-up staff.After the backup ends i start them (in windows scheduler i have a cmd script) but i have a tape storage (you know a copy of the single one i have) with one drive just to be sure that i won't have two or more reading from pool the same time.

View solution in original post

thesanman · ‎05-29-2012

I dream of 40-50MB/s! I have a 250MB/s capable storage but I'm lucky to get 20-25GB/hour in rehydration speeds making rehydration almost unusable. I have to rehydrate to disk; then roll that image off to tape to avoid shoe shining the tapes. Doesn't seem to matter wether I'm running 1 or 4 rehydrates at the same time; they're all too slow.

Yes, I'm running v7.1.0.4, Windows 2008 R2.

Do you mean changing the "Fragment Size" value of the underlying DiskPool storage unit? Mine are set to default values of 51,200MB. How did yours start so low?

Certainly, I can see your logic in reducing this number. It's not something I've seen before but maybe you've hit on something here.

There's been an expanding Appendix in recent release notes "Rehydration Improvements" which deals with various tweeks but most says don't touch these without approval by Symantec Technical support staff. The 2 that might be of interest are PreFetchThreadNum and ReadBufferSize but I have yet to worth though these given that in my case, everytime I stop and restart NetBackup it takes 1.5 hours to come back on line (my MSDP units are approx 12TB).

grlazy · ‎05-29-2012

thank you for the reply.

I had similar problems with some old media/msdp servers.

What we found out is that the original proposal/implementation of the hardware was wrong.

It was writen in the requirements that CPU has to be series 54xx and above and also with CPU clock equal or above 2.5GHz but our servers were way out of specs (older-slower cpus).

As soon as we upgraded to blade IBM bladecenter HS21 (Ε5430CPU x2) we started to see some 40-47Mbyte/sec using a storage which with both camel tool and iometer can give 130Mbyte/sec throughput.

Before the upgrade we also had to wait for some 30-40 minutes from the spoold to come up.

I also made quite some tests with every parameter the release notes of the 7.1.0.4 proposed.

Especially with PreFetchThreadNum and ReadBufferSize.

There was not much change at rehydration speed.

I also played with the disk buffers with no big gain.

I need to understand better the way the external storages operates.I also need to check some settings of the Qlogic card , especially the maximumSGlist which i think it may relates with my bottleneck.

About the "reduce to fragment size" setting now:

Initially the setting was set from a certified service provider.It was the Symantec recommended setting the 15000 size.After that i read about fragment size and the afect on backup and restores.

Due to our enviroment -which has 385 clients over was 2Mbit lines- we have no need for fast backups.So i could give way some backup speed for some increase at the rehyhdration speed.

Still the results are not clear.

I think that i have to keep rebasing the data of my clients.

thesanman · ‎05-31-2012

Been doing some testing in my QA NBU environment. Using same v7.1.0.4 Linux client over Gig e-net to a v7.1.0.4 Windows 2008 R2 MSDP server I changed the Max Fragment size on the Disk Pool storage unit down from 51,200MB to 25,600 to 12,800 and finally 6,400. I ran each test twice over a period of 24 hours. I did not stop and restart NetBackup between tests.

Backup time were essentially unchanged as were rehydrate times to a local SAN based staging volume.

Obviously in your environment you saw better performance; in mine I wasn't.

The media server in this case is a little under spec'd CPU clock speed wise but monitoring the server I didn't see a CPU bottleneck; nor did I see an IO one.

My understanding is that your MSDP server should rebase data in the background; yes you can force it but I was told not to bother; let the background processing sort it out. Over time the clients will get re-based although my reading of the release notes seems to indicate it will not start until the next day.

grlazy · ‎11-02-2012

Update that may help:

I did a couple of things (or more ) which may worked.
Had a look at the mp 7.1.0.4 technote.It contained the minimum hardware specs recomended.Stay above those specs.
I upgraded media/msdp servers to much higher specs but i saw minimal difference at rehydration speed.
Still i could see around 45Mbyte/sec at one Lto3 (from 27MByte/sec before).Still no good enough.

Yet the garbadge collection was running in a blink of the eye and that alone was a good sign.

I optimized the tape drive settings (buffers/number etc).In common local file backup i could see optimal speed.Did alot of restores also to be sure .

Checked the speed of my storage. Used the camel tool (it has a newer name now) and did copy to an other storage (storage not volume at the same hardware) and kept time .Then did the math to find the storage speed.Before the upgrade i had lower throughput but now I was above 120Mbyte/sec (well symantec says 150 but now i can see 65Myte/sec throughput with 117Mbyte/sec to be exact, and that at my lower spec hardware).Either way you have to be higher than this, It won't work with lower speeds.

Some time after i changed the maximum fragment size of the DDpool from 15360 to 10240.I think it was better but yet not satisfied.
It happened to have in premises an IBM storage expert.And he noticed that the mode the fiber LTO drives (and the tape library) was at L-port (some kind of loop) and not at N-port (point to point with the switch).That setting alone was extra 10-15Mbyte to throughput at traditional backups .

I also expiremented with the settings of the contentrouter.cfg ( msdp pool path\etc\puredisk). The settings you can change are noticed in the 7.1.0.4 release notes.Two significant settings which can fix or destroy your throughput speed are the "ReadBufferSize" and "PrefetchThreadNum".Change one at the time and while no rehydration jobs running do some 30-50GB duplicate from Msdp pool to tape.It is good enough to show you if it works.

I do daily a couple of things also.
I do garbadge collection at the pools and compaction start by hand.
I disable the rehydration to drives while backing-up staff.After the backup ends i start them (in windows scheduler i have a cmd script) but i have a tape storage (you know a copy of the single one i have) with one drive just to be sure that i won't have two or more reading from pool the same time.

Marianne · ‎11-02-2012

Thanks for the feedback!

Suggestion:

Publish your findings in a blog or an article.

Handy NetBackup Links

grlazy · ‎11-02-2012

My english are not that good.Will try......

Marianne · ‎11-02-2012

Perfectly understandable!

Handy NetBackup Links

grlazy · ‎11-02-2012

ok ,

i did gother the story and bloged it

https://www-secure.symantec.com/connect/blogs/my-deduplication-story-nbu-7

Hope it will help someone.....

VOX

Netbackup disk storage fragment size related......