How to calculate SIZE_DATA_BUFFERS

rsm_gbg · ‎09-17-2013

Hi folks,

I'm looking into getting better performance on our backups.
I found this recent Community post that is quite interesting as I have exactly this problem.

Netbackup Data Consumer waiting for full buffer, delayed 2186355 times
https://www-secure.symantec.com/connect/forums/netbackup-data-consumer-waiting-full-buffer-delayed-2186355-times

I read
http://www.symantec.com/business/support/index?page=content&id=TECH1724

And have played around with different values in SIZE_DATA_BUFFERS and NUMBER_DATA_BUFFERS

There is no change in the
09/18/2013 11:31:40 - Info bptm (pid=6079) waited for full buffer 42749 times, delayed 154422 times
Whatever numbers I change.

The size and number changes per new started backup if I change the values.
09/18/2013 10:36:46 - Info bptm (pid=6079) using 131072 data buffer size
09/18/2013 10:36:46 - Info bptm (pid=6079) using 32 data buffers

But it also depends on the network settings, tape drive, probably interface to the tapedrive (SCSI/SAS/FCAL) and the label on the tape.

There must be a certain way of calculating what should be the optimal/default/best value!?

Trial and error doesnt appeal to me that much.
Just getting rid of the waiting for full buffer should improve things a bit.

All Solaris 10 hosts, SCSI Sun StorageTek SL48 Tape library with 2xLTO4

Cheers,

-Roland

mph999 · ‎10-07-2013

I'll have a think and come back to you. Either I'm misssing something simple, or something a little odd is going on.

mph999 · ‎10-07-2013

In /usr/openv/netbackup/db/config where you have SIZE and NUMBER databuffers, do you also have SIZE_DATA_BUFFERS_DISK NUMBER_DATA_BUFFERS_DISK If so, what are the values in them. (From your post above I can't tell if the buffer info is from the tape or disk backup ....) The tests you have kindly performed so far show that the network seems ok and the client read speed is ok, yet the tape is not far from x3 times slower than disk, which certainly doesn't seem right. I think at the moment it is important to keep away from multiplexed backups, way more 'complex' - and we want to see the true speed which non-mpx does. Unfortunately, for this sort of issue logs don't really help that much - the number of delays is really the only part we are intrested in,, but there is no reason logged, it is simply trial and error, try this try that (the reason being that the buffers are on the edge of NBU / outside world ) so they is no way to log what happens before the buffers as it's outside NBU. I'll have a log through the internal DB during lunch tomorrow, perhaps there could be a similar issue previously reported, that was not fixed by one of the usual solutions.

rsm_gbg · ‎10-07-2013

Hi,

Thanks for your effort in helping me, really appreciated.
Symantec forums is actually one of the few that is really helpful with good expert advise.

As to your query, yes these

10/07/2013 10:40:08 - Info bptm (pid=9136) using 262144 data buffer size
10/07/2013 10:40:08 - Info bptm (pid=9136) using 256 data buffers

10/07/2013 11:11:18 - Info bptm (pid=9136) waited for full buffer 33181 times, delayed 122425 times

Are from the disk backup.

root:/usr/openv/netbackup/db/config# ls -l
total 7
-rw-------   1 root     root           4 Sep 20 09:00 NUMBER_DATA_BUFFERS
-rw-------   1 root     root           7 Sep 19 17:54 SIZE_DATA_BUFFERS
drwxr-xr-x   2 root     root           2 Oct 8 07:43 shm
root:/usr/openv/netbackup/db/config#

I have turned off multiplexing, that really buggered everything down.

Is there a native Solaris command like tar or cpio that we could use to see if native speed is normal?

Could writing to two tapes be the issue? As explained earlier I do 2 copies.

Tomorrow we will have an engineer onsite conducting some tests on the drives.
Just to rule that out.

- Roland

rsm_gbg · ‎10-08-2013

Hi,

One of the todays backups is very very slow and this is what I see in the Solaris /var/adm/messages log.

Oct 9 08:24:13 pnms01 last message repeated 25 times
Oct 9 08:24:28 pnms01 tldcd[25080]: [ID 912152 daemon.notice] inquiry() function processing library HP MSL G3 Series G.70:
Oct 9 08:30:58 pnms01 last message repeated 26 times
Oct 9 08:31:13 pnms01 tldcd[25080]: [ID 912152 daemon.notice] inquiry() function processing library HP MSL G3 Series G.70:

Any clues to what this is?

rsm_gbg · ‎10-08-2013

Hi,

I got an IBM engineer out today to do some diag.

Drives are the latest firmware. (B63W)

I will upgrade the Library firmware today to H.20 (from G70.)

I will do a backup test today with a single tape.

- Roland

mph999 · ‎10-08-2013

So, the numbers were from disk - ok ... I think the best thing to do is log a call to get AppCritical run (network analysis) lets see if that shows anything - if nothing else, just for elimination. Post the case number up here so I can keep an eye on it. I managed to skip the bit that said you do two copies - that will skew the results a bit as for multiple copies the data in the memory buffer is sent to the tape1 and then immediately sent to tape2, so effectively, things slow down. The reason for this is because bptm is a single threaded process, so it can't do two things at once. Can you arrange a test backup that is is not multi-copy and always uses exactly the same data (we don't want a moving target ....). I suspect you are very careful with your tests but lets just be sure going forward. The difference in speed between the disk / tape backup can be confusing, I wonder if the following is happening: We have delays on waiting for full buffer - we know that and it causes xx mins of delays per yy hour (this is bptm waiting for full buffer = waiting for data from client) I am suspecting that on the tape backup we are additionally getting the following: (bpbkar log) - waiting for empty buffer xx times delay yy times ... and in the case of tape backup, these delays are 'significant' - so there is a delay in the memory buffer being emptied, that is a delay with the data getting from memory to the actual tape. If so, we know from past experience that 128 or 256 buffers of size 262144 should work - so we can 'most likely' discount this of causing the delay, which would therefore leave the possibility of tape drive fault / firmware level / driver that is at least contributing. It's just an idea, I'm not saying this is the case or trying to pass blame, but we need to consider anything and everything until it is 'proved' otherwise (hence apppcritical etc ...). Kindest regards, Martin

mph999 · ‎10-08-2013

I open my eyes eventually

Oct 9 08:24:28 pnms01 tldcd[25080]: [ID 912152 daemon.notice] inquiry() function processing library HP MSL G3 Series G.70:

This is just a NBU function that makes some checks about the config of the library.

jim_dalton · ‎10-09-2013

Roland

Tried a synthetic test? This uses nb to generate your data and stuffs it down whichever pipe you tell it, very useful for ruling out disks and files and you can churn out data very rapidly.NB loves large files: have a crack at synthetic tests locally and across the network , you'll surely find out something interesting when you know the data pipe has huge capacity. I think its only a Solaris policy option. Have a search for GEN_DATA : comes with a bunch of other directives...how much data, how many files, how random...

Large? Gb large...

With these data on input direct to tape you should see the drives hit peak 120M/s. Take it from there.

Jim

mph999 · ‎10-09-2013

Worth a go Jim. It's good to test with large files, as many small files is a challenge for any backup software (would usually use flash) . However a simple OS backup should be sufficient for testing, OK, you may not get the max speed (nice big DB files are good for that) but you should get 80+ MBs

rsm_gbg · ‎10-09-2013

Some interesting new data.

I upgraded to latest library firmware, but didn't do much as expected.

IBM engineer just run a healthcheck and said it was all OK....

I did a single tape backup test and that showed something interesting.
Backup to ONE tape (drive2) took 23min!
Backup to the other drive (drive1 drive2 downed) took 31min, but the catalog backup (to disk) started in the middle of that backup.
Still a lot of delays though.
From this singel tape backup:
Drive 2 10/09/2013 13:34:20 - Info bptm (pid=6466) waited for full buffer 20244 times, delayed 48033 times
Drive 1 10/09/2013 14:10:57 - Info bptm (pid=7312) waited for full buffer 14993 times, delayed 40645 times

So I would actually be better off running the backup twice one to local tape and one to remote.

I will try to start 2 backups and using the pools to get both backups running at the same time.

I will investigate how to run this synthetic test as well.

mph999 · ‎10-09-2013

Thats looking better, much better .... 40 000 delays is still quite a few though ... could be worth getting appcritical run va a support call.

rsm_gbg · ‎10-09-2013

Ok, I'll call support.

rsm_gbg · ‎10-09-2013

Case # 05267881 - AppCritical run has been created [ref:00D30jPy.5005OF6fj:ref]

rsm_gbg · ‎10-09-2013

AppCritical run showed nothing, full speed 1gb.

I ran a test where I splited the inline copy and had two policies one goung to Local tape and one going to the remote tape.

NBU happily started both simultanlessly and it took just 40min.

And this is the same OS backup but running in parallell!

mph999 · ‎10-14-2013

Yes, that is two seperate bptm processes, working 'at the same time'.

WIth multiple copies, you have one bptm process that has to do 'two' things and therefore takes twice as long.

M

rsm_gbg · ‎10-14-2013

I set up a test netbackup in the lab with the same settings.
And I get the same high delays.

This is a 43min OS backup, about 30GB.
10/15/2013 13:46:38 - Info bptm (pid=11094) waited for full buffer 37709 times, delayed 157410 times

mph999 · ‎10-18-2013

Sorry for delay. OK, on your test, was this using multiple copies ? And, was it a backing up itself, or a client across a network. I have a 7.5 test server (might be 7.5.0.5, not sure without checking) - Linux, with a VTL which I can run a test backup on of a windows server - let me see what I get. On your test backup - about 40 mins / 157410 delays - I make the delays add up to 39 mins. 30GB = 30720MB - if the drive was writing at say 100MB/s then the backup of this amount of data would be 3072 seconds, just over 5 mins. I think it is worth concentrating on the Solaris server backing up itself, I recall from above you had a backup of the server that controlled the robot, backup to disk stu was around 20 mins / to tape was >1 hr. Worth running a tar test to the drive, lets see what the drive can do outside NBU. This is certainly odd as I mentioned before - tape backup to itself is as simple as it gets and the number /size of buffers set to (128 or 256) / 262144 should give at the very least 'reasonable' performance. I'll check the case notes and have another think - waiting for full as we know is getting the data to the buffers, and the buffers are 'positioned' kinda on the edge of NBU between NBU and the outside world (so what I mean is there isn't much in NBU that can affect this). That said, the disk backup being quicker on the same server than tape clearly points at the tape part, but wouldhave expected to see waiting for empty backups being the cause : waiting for empty - can mean something on the media server side (ie. we have the data but cannot get it to tape quickly) (eg. bucket (buffer) is full but there is only a tiny hole in it so it doesn't empty out to the drive) waiting for full -- we're not getting the data quickly enough from the client (eg. bucket is only filling very slowly, and we have to wait until it is full before we can empty it) Right this second, I'm not seeing why changing between disk STU and tape is causing a change in 'waiting for full buffers' as in both cases (tape and disk) this part happens before we really care about what type of storage we are writing to.

rsm_gbg · ‎10-27-2013

Hi,

I've been doing a lot of testing and changing to different new ways of backing up our system.
One thing I think we need to do is changing one of the drives, it seems that it for some reason it is really slow.
Sort of intermittent, we will probably get that changed tomorrow and well see what happens after that.

My new strategy is to use both drives and run backups multiplexed to one copy only.
And then have Vault to do a duplication onto remote tapes.
Earlier we had a requirement from the customer to do verify and they agreed on that duplication would be the same sort of verify.
Vault then do catalog backup and ejects the remote tapes.

I'll keep you posted.

- Roland

mph999 · ‎10-27-2013

Hi Roland, Really late here in the uk, so off to bed shortly. If your strategy works and is acceptable then that is excllent. However I would like to get to the bottom of this as put bluntly, a backup run to your drive, esp if the backup is of the media server attached to the rive should fly along ... dodgy drive could slow things down - modern drives are very advanced and will re-write automatically if they have issues. Apart from the drop in speed, this is invisible to NBU (and other backup software) - so it could potentially be an issue - but I would expect to see delays in waiting for empty buffer, not waiting for full. That said, I haven't seen the full logs (bptm from media and bpbkar from client) so maybe there are some delays in waiing for empty - just that I have't seen them yet. If possible, I don;t mind having a quick look on webex - TZ could be a bit of an issue but if you let me know your availablity we might be able to work something out. M

rsm_gbg · ‎10-27-2013

Hi,

Unfortunatly webex is not possible on this system unless you get security clearance which is not easy...

I could get you the logs you need though if you just tell me what you looking after.

- Roland

VOX