my deduplication story with NBU7

grlazy · ‎11-02-2012

Well , here is my deduplication story with NBU 7…..

Be gentle, not good at compositions but had to Blog it (not used to it either) just in case anyone has the same issues.

These are personal experiences and may not work for you.May give you ideas though

The plan of the new setup was that one those two media srv’s -the first- receives the data from 385 wan clients and then optimized dudeped them to the second “disaster” media server.Later on we will merge one old nbu 6.0 with the newer one

The files were mostly user files (docs , pst’s, some images and exports of SQLs)

The initial hardware consisted of NBU 7.01 master server (ESX VM windows 2003 64Bit/2 cores/4GB mem/enough storage).

We had two media servers (must was ibm 3640 x64 , 8Gbyte ram ) with MSDP pools on some 2TB volumes (sata disks, svc managed , most likely stripe across one or more array ) .The tape library was a partition of an TS3584 fiber with two LTO3.

That was the initial specs which should –via the proposed implementation- to be able to rehydrate some 4TByte in some hours.

The client side dedup process worked like a charm from day one.

The rehydration to either media server never worked right. We had some 10mbyte/sec throughput maximum.

Opened case and after many hours via phone –or should I say webex- with the second level support and at the late stages with development (!!!), we –the support mainly- found that we were out of specs from the very start.

The media servers cpu’s couldn’t sustain the stress from the rehydration (almost more than 80% all the time) and the storage was slow (less than 60Mbyte/sec).

At that point and after the hardware appeared out of specs , the management decided to go for a total upgrade of the main structure.

After that, the master is a ESX Virtual machine with windows 2k3 , 4 cores and 8GB or memory

The two media servers now are identical (HP proliant BL620C G7) with 4Tbyte of DS8300 storage (SVC managed). One is the production and the second one is the one on whch the data are been duplicated (via SLP)
Os is 2008r2 standard with 196GB ram.I have limited the netbackup use @ 32GB (couldn't start with more than 75%).

The msdp pool runs onto the 4TB storage and the msdp Db on an other volume on separate storage.
Each one of tham has a IBM 3584 with 2 L3 drives .Later on the drives upgraded to LTO5

At some point later , we needed one small media server (dell 2950 ) with 850GB of internal storage (also MSDP pools at them) but standalone.windows 2008r2 standard with 16GB ram.the tape library is a 4048 HP with L4 drives.

All tape libraries have been tested and working at optimal speeds during traditional backups. L3 at the production usually write ar 79MB/sec and the l4 almost at 117Mbyte/sec. The source of my speed tests , was usually the physical files of the msdp pool (on each one server).

Maximum fragment size at all tape libraries is set at 15360 and the reduce maximum fragment size setting at disk storages at 15000,

My issue at this point was that while all traditional backups from the disk storage to the local tape library are working like a charm , when the time comes for "rehydration" -duplicate from msdp disk pool to tape- i can't see more than 42-47Mbyte/sec. Even that speed was achieved after patching the main servers to 7.1.0.4
I did manual rebasing numerous times and tried the rebased images to be duplicated to tape. Nothing more happened (speed wise).
That issue appeared only at the proliant servers.

The del 2950 on the other hand worked like a charm. It rehydrated at 65-70Mbyte /sec with the very same NBU patch level (7.1.0.4) and the same settings. And that with the INTERNAL SAS raid 5 pool .

I optimized again the 3584 tape drive settings (buffers/number etc).Tested the speeds and was good for LTO3.But when rehydrating the Dell was continuously giving some 65Mbyte/sec and the HP’s with that processing power and speedy storage couldn’t get passed the 55Mbyte/sec margin at top speeds.

Well at this point the god of luck finally smiled at me.

First we managed to upgrade the LTO3 drives to LTO5 and at the same time the storage guys –mostly an IBM storage expert in our country- saw that the comm. Method from the FC switch to the 3584 was some kind wrong. The mode was set to L-port (some kind of loop) and should be N-port (point to point).

And yes we can see 65Mbyte/sec at the ibm systems and on the LTO5 3584 drives.

Do not know whether was the LTo or the comm Mode the issue.

The fact is that YOU HAVE TO BE SURE THAT THE TAPES ARE AT OPTIMAL COMM AND SPEED.

Have done many changes on the software side and it is a long story.But here is a small list

I optimized the tape drive settings (buffers/number etc).

Checked the speed of my storage.

Used the camel tool (it has a newer name now) and also did copy test to an other storage (storage not volume at the same hardware) and kept time .Then did the math to find the storage speed .Before the upgrade i had lower throughput but now I was above 120Mbyte/sec (well symantec says 150 but now i can see 65Myte/sec throughput with 117Mbyte/sec to be exact, and that at my lower spec hardware).Either way you have to be higher than this, It won't work with lower speeds.

had fragment related suspicions on the storage and while reading i run onto an article related to the reduce of the maximum fragment size .I have read that article
http://www.symantec.com/business/support/index?pag...

and i lowered the "reduce maximum fragment size " from 15000 to 10240.My thinking was that with smaller fragments the rehydration speed -as the restore process does- should be faster

after that i run a duplicate of some recently rebased images and i could see some 25% more rehydration speed.

I also experimented with the settings of the contentrouter.cfg ( {msdp pool path}\etc\puredisk). The settings you can change are noticed in the 7.1.0.4 release notes. Two significant settings which can fix or destroy your throughput speed are the "ReadBufferSize" and "PrefetchThreadNum". Change one at the time and while no rehydration jobs running do some 30-50GB duplicate from Msdp pool to tape.It is good enough to show you if it works.
I do garbage collection at the pools and compaction start by hand.

The commands are:

Master server:

1. bpimage -cleanup -allclients

2. nbdelete –allvolumes

At each msdp pool server:

Crcollect –v –m +1,+2

Crcontrol –processqueue (two times).

After some time i do some

Crcontrol –compactstart

Had suspitions that the compaction is either not running or keeping some ghost processes and not releasing space on the Ddpools.

I do reboot my netbackup universe (core servers only) from time to time.

After that i can see some 10-15% more free space after running the compactstart command above.

I disable the rehydration to drives while backing-up staff to the pools.This leaves the pool quite unstressed fromt the rehydration process demands.

After the deduped backups end, i enable the rehydration jobs which are queued (in windows scheduler i have a cmd script) but i have a tape storage (you know a copy of the single one i have) with one drive just to be sure that i won't have two or more reading from pool the same time.

Stop command at windows scheduler of master server

nbstlutil inactive –destination {tape storage name}

And it starts with

nbstlutil active –destination {tape storage name}

That enable-disable target “feature” was used after i solved my big issues with the rehydration speed.At that point i was sure about the size i could rehydrate until the next backup session and the dedup job queue where emptied after that....

One more notice.

I made an slp which backed up at first step and then rehydrate to two drives at the same time.It seems that this has the same issue as the inline backup copy had (you know , that traditional backup which was targeted on two tape drives at the same time and had the throughput speed divided by the number of drives).

VOX

my deduplication story with NBU7