Backup speed/throughput question (these are fun!)

Michael_Flegg · ‎11-02-2006

G'day all,

we're having some fun trying to figure out where our bottleneck is in our backup environment; in particular backing up our NetApp NAS box.

Our master server is 5.1MP5 on 2K3 Ent Ed SP1 with 8GB RAM, dual 3.0GHz single core CPU's (HP DL580 G2) with 2 x LSI Logic U320 SCSI HBA's attached to 6 LTO2 drives in a Quantum P4000 (2 drives per channel, plus one channel controlling the robot). There are another 2 media servers, but lower spec'd and with a total of 4 drives between the two of them, all inside the same P4000 unit (I have more on that config if you need it).

The master server is the server/client that we use to backup our NAS CIFS shares through; that is, in our policy the master server is what we use to connect to the shares and the STU we use is on the master server also...

The STU is setup for 3 drives, 20 MPX and no max frag size. We are using HP LTO2 drivers at the OS level.

The policy in question is setup so that we have 14 streams (CIFS shares ranging from a few hundred GB to nearly 2TB's), MPX 5 and no limit jobs per policy; meaning we actually have 14 streams running at one time across 3 drives (5/5/4 split).

NUMBER_DATA_BUFFERS on the master server is set to 224, no SIZE_DATA_BUFFERS so default is 65536 and no NET_BUFFER_SZ meaning default 256Kb (or is it KB - I know it makes a difference!).

Having said that, under the client properties of the master server I have set "Communications buffer" to 257 kilobytes as suggested by Veritas (and a Veritas engineer).

Currently this little scenario results in backups taking anywhere from 20 odd hours (for the few hundred GB) to 3 days plus for the nearly 2TB. That bytes! (pardon the pun).

Here is my question - I have looked at our bptm and bpbkar logs from the master server, and I'm getting a lot of what I think is a good result in the bptm log of:

"fill_buffer: socket is closed, waited for empty buffer 0 times, delayed 0 times, read 747618304 bytes"

BUT for bpbkar logs, I get a lot of this:

"<4> tar_backup::OVPC_EOFSharedMemory: INF - bpbkar waited 7727 times for empty buffer, delayed 11643 times" and worse!

I've done some reading, but I can't get my head around where the problem acutally lies - with the client or the server?? My interpretation (and I'm probably wrong) is that we can't get the data from the client (NAS) quick enough to feed the tapes?

If someone could clarify for me that would be fantastic and MUCH appreciated. Any other thoughts/suggestions are truly welcomed also.

Thanks for listening!
Cheers
Mike

sdo · ‎11-02-2006

Hi Michael,

A good place to start, when analyzing the wait and delay figures is the performance and tuning guide by George Winter, here:
http://eval.veritas.com/downloads/van/4019.pdf

Regards,
Dave.

sdo · ‎11-02-2006

Hi Michael,

I have previously posted some tuning comments in this thread too:
http://forums.veritas.com/discussions/thread.jspa?threadID=67607

Regards,
Dave.

Dennis_Strom · ‎11-02-2006

good tuning forum discussion covers a lot.
http://forums.symantec.com/discussions/thread.jspa?threadID=67607&tstart=0

just realized I posted the same link; so ditto to what david posted very good thread.Message was edited by:
Dennis Strom

Michael_Flegg · ‎11-05-2006

Thanks fella's for the info, much appreciated.

I have now read most of all of those links and will use them to try and fine tune our environment. I had also read a few others on this forum before posting my question. What I was hoping was for someone to be able to comment on where they think the bottleneck might be occurring; either client side or master/media server side, based on the results of those logs.

I've read up on the examples etc, but I don't trust my own interpretation....! I think the problem is with the client, but I was hoping to get a second opinion?

Cheers
Fleggy

scored_well2 · ‎11-06-2006

Michael, what is your link between the NAS box and your media server? Your tape drives are waiting for data - as shown by the 0 wait times and counts in the bptm log. In other words, the bptm process has never had to wait for resources to become available.

If, I'm assuming here, that your Network-Attached-Storage is indeed using the network then that is likely to be your bottleneck. A single Gbit NIC will only offer 360GBytes an hour at full usage. Expect to see around 320GBytes in the real world. Divide that between multiple jobs and all of a sudden your throughput will look very slow.

Rgds,

Simon.Message was edited by:
Simon Caudwell

T_N · ‎11-06-2006

Hi Michael,

you can check the performance data in /usr/openv/netbackup/db/error/*log

Michael_Flegg · ‎11-06-2006

Simon, the server has 2 x 1GB NIC's, setup as SLB (switch assisted load balancing with fault tolerance) via HP Teaming s/w. I could have sworn we had three NIC's...

So that gives us about 680GB, and currently I have the policy setup for 14 streams with MPX 5 (I think I'm overdoing it with the streams at that MPX?) so should that - theoretically - work out to approx 48GB per stream still available over the wire?

We plan on putting either 2 or 4 more 1GB NIC's into that box also. The h/w is pending.

Cheers
FleggyMessage was edited by:
Michael Flegg

Dennis_Strom · ‎11-06-2006

How much memory is in that box?

Michael_Flegg · ‎11-06-2006

G'day Dennis, 8GB - peak useage (according to 'Task Manager') is around 5.5GB so far.

Michael_Flegg · ‎11-06-2006

Regarding Multistreaming and Multiplexing guys, all the literature I have read points to setting up streams per physical disk where possible, and use mpx for multiple clients.

How would you treat a Netapps NAS device? Would you base the 'physical' disk on volume or aggregate? ie. would you setup multiple streams based on what CIFS shares are on what volume, or on the aggregate they reside on?

OR, would you go the opposite and setup a policy with little or one stream and use a high MPX?

I know it may appear that I'm attacking too many things at once here, but I have someone looking into the networking side of things (we think it's not setup between the NIC's and switch correctly) and I'm looking at the NBU config side of things and trying to adopt best practices along the way.

In truth this all should have been done before the NAS was commissioned, but we weren't given enough (read any) time to test backing up a NAS. Now, as you can see, we are playing catch up....

scored_well2 · ‎11-07-2006

Michael,

One of the benefits of W2K3 and HP Teaming is that you can readily see the throughput on the NICs and any packet drops / losses. Take a look here and if you see large amounts of dropped packets, you'll have a duplex / speed misconfiguration somewhere from the NICs outwards. Also, look at the % of networking used under Task Manager. You should be seeing 70%+ if the data's getting off the network properly.

I don't have much knowledge about NAS configuration but for testing the physical disk / streaming question, create a test backup policy and storage unit. Have multiplexing set high so that you don't end up using a second drive. Gradually add one share at a time and see if the usage of the tape increases. You should see no reduction in KB/sec on the first share to be backed up as other shares are added. If you do see a reduction, it looks like the disks are having to share I/O and so multiple streams shouldn't be used in future. This could also show network bottleneck too though.

Make sure the NAS storage also has enough network bandwidth. You obviously have a good understanding of this but there's no point having 2GBit/sec going into your media server if the NAS box can only send data out at 100Mbit/s half-duplex, for example.

As an aside, the amount of buffers you're using seems very high. Normally, NBU needs around the 16 - 50 range. Each one of those buffers will grab memory, even if they don't end up being used. Try and set your SIZE_DATA_BUFFERS to the same as the LTO2 tape buffer (256K?) and reduce the number of buffers you're using. Watch out in W2K3 SP1 though. The OS stops buffer sizes of greater than 64KB but there's a hotfix available from MS - it requires changing the tape.sys file.

Good luck!

Simon.

eric_lilleness · ‎11-07-2006

I am having similar performance problems backing-up Netapps via NFS mounts to a Media Server. I have some jobs that run at 300 KB/second - this is with large Netapp volumes with all small 1-2 KB files.

My best performance is about 22 MB/second on large Oracle dbf files. For a logical volume implemented on a dozen disks, I would expect over 100 MB/second if this were SAN disks.

It's my hunch that in both our cases the biggest bottleneck is spelled NAS.

FYI; all out stuff is UNIX & uses NFS versus CIFS, so I cannot tell you what commands to run on the Netapp to troubleshoot CIFS. What I found with running nfsstat on the Netapp was that the filesystems with the small files had a 50% portion of all I/O ops being metadata/file attribute reads. Also, by default NetBackup writes a new access time to each file it B/U's, so this can add some overhead - adding D)_NOT_RESET_ACCESS_TIME to your bp.conf on the servers will change the behavior.

I totally agree with the previous post about your number of buffers being way too huge. I am not totally confident I understand buffer usage to advice whether or not the number of buffers can compensate in any way for having them set too small (they should be 256K for your drives) and number should be 16. I doubt that number can make-up for size, so the referenced patch is needed to acheive best tape performance.

Good luck, Eric

Dennis_Strom · ‎11-07-2006

Are you using NDMP? If you are not then going to NDMP with Netapps will provide speed and restore simplicity. There have been a couple of threads on that here I will try and dig them up.

Michael_Flegg · ‎11-07-2006

Thanks Simon. One of my colleagues has been fine tuning the network side of things for me (I'm a novice at that stuff). We did find some buffering errors, fixed them and did some further fine tuning (increased Rx buffers for the master server). We saw a MAJOR improvement in performance after that! You mentioned 70%, we are currently seeing up to about 40% so there should still be room for even more improvement.

I'm going to adopt the testing method you suggested also re: MPX/multistreaming. We'll see how that goes.

Um, it's not me that has the good understanding of the NAS; it's the colleagues that I rope into helping with this problem! We're pretty confident at this stage that the NAS is pushing out data at a good rate though.

Re: the buffers, I've been told in the past to have "number_data_buffers" as high as you can without running out of memory on the server. I concede that you may be right, but now I'm seeing better numbers, I' loathe to change anything.

With size_data_buffers, I'm keen to keep them at the Windows standard 64KB. We aren't using tape.sys as a driver, we are using HP's original (and STILL only) HP LTO2 driver - hplto.sys. I think we can attain what we need without changing the size of the data being written to tape. If it were UNIX, I would, but I just don't trust Windows!

Thanks Eric, hope you are able to solve your issues too!

Strom, no we aren't - yet!

We got our license activiation key for it last week, so our next step is to configure everything to use NDMP. People might think I'm crazy spending all this time fixing a problem that could be solved by going straight to NDMP, but I want things to work as they should using this method first, then change over to NDMP backups in a controlled manner - not a panicked manner! And also, by doing this we are finding problems and then fixing them.Message was edited by:
Michael Flegg

scored_well2 · ‎11-08-2006

Michael,

Glad you've seen the increase and I'm sure you'll improve it.

Just as info for the buffer situation, the relationship between the buffer size and no. of buffers is all about the amount of data you can send to the tape. So, for 1x buffer of 256KB, you need 4x buffers of 64KB to send the same amount of data. Using four times the amount of buffers will see roughly the same amount of data moved to your tape-drives but with far more overhead involved.

By all means, use your current method if you have the resource to do so. Just bear in mind that it's not as efficient and, as your environment grows, might impact at a later date.

Cheers,

Simon.

VOX

Backup speed/throughput question (these are fun!)