09-17-2013 10:23 PM
Hi folks,
I'm looking into getting better performance on our backups.
I found this recent Community post that is quite interesting as I have exactly this problem.
Netbackup Data Consumer waiting for full buffer, delayed 2186355 times
https://www-secure.symantec.com/connect/forums/netbackup-data-consumer-waiting-full-buffer-delayed-2186355-times
I read
http://www.symantec.com/business/support/index?page=content&id=TECH1724
And have played around with different values in SIZE_DATA_BUFFERS and NUMBER_DATA_BUFFERS
There is no change in the
09/18/2013 11:31:40 - Info bptm (pid=6079) waited for full buffer 42749 times, delayed 154422 times
Whatever numbers I change.
The size and number changes per new started backup if I change the values.
09/18/2013 10:36:46 - Info bptm (pid=6079) using 131072 data buffer size
09/18/2013 10:36:46 - Info bptm (pid=6079) using 32 data buffers
But it also depends on the network settings, tape drive, probably interface to the tapedrive (SCSI/SAS/FCAL) and the label on the tape.
There must be a certain way of calculating what should be the optimal/default/best value!?
Trial and error doesnt appeal to me that much.
Just getting rid of the waiting for full buffer should improve things a bit.
All Solaris 10 hosts, SCSI Sun StorageTek SL48 Tape library with 2xLTO4
Cheers,
-Roland
10-07-2013 01:17 PM
I'll have a think and come back to you. Either I'm misssing something simple, or something a little odd is going on.
10-07-2013 04:48 PM
10-07-2013 05:59 PM
Hi,
Thanks for your effort in helping me, really appreciated.
Symantec forums is actually one of the few that is really helpful with good expert advise.
As to your query, yes these
10/07/2013 10:40:08 - Info bptm (pid=9136) using 262144 data buffer size
10/07/2013 10:40:08 - Info bptm (pid=9136) using 256 data buffers
10/07/2013 11:11:18 - Info bptm (pid=9136) waited for full buffer 33181 times, delayed 122425 times
Are from the disk backup.
root:/usr/openv/netbackup/db/config# ls -l
total 7
-rw------- 1 root root 4 Sep 20 09:00 NUMBER_DATA_BUFFERS
-rw------- 1 root root 7 Sep 19 17:54 SIZE_DATA_BUFFERS
drwxr-xr-x 2 root root 2 Oct 8 07:43 shm
root:/usr/openv/netbackup/db/config#
I have turned off multiplexing, that really buggered everything down.
Is there a native Solaris command like tar or cpio that we could use to see if native speed is normal?
Could writing to two tapes be the issue? As explained earlier I do 2 copies.
Tomorrow we will have an engineer onsite conducting some tests on the drives.
Just to rule that out.
- Roland
10-08-2013 02:34 PM
Hi,
One of the todays backups is very very slow and this is what I see in the Solaris /var/adm/messages log.
Oct 9 08:24:13 pnms01 last message repeated 25 times
Oct 9 08:24:28 pnms01 tldcd[25080]: [ID 912152 daemon.notice] inquiry() function processing library HP MSL G3 Series G.70:
Oct 9 08:30:58 pnms01 last message repeated 26 times
Oct 9 08:31:13 pnms01 tldcd[25080]: [ID 912152 daemon.notice] inquiry() function processing library HP MSL G3 Series G.70:
Any clues to what this is?
10-08-2013 03:48 PM
Hi,
I got an IBM engineer out today to do some diag.
Drives are the latest firmware. (B63W)
I will upgrade the Library firmware today to H.20 (from G70.)
I will do a backup test today with a single tape.
- Roland
10-08-2013 11:16 PM
So, the numbers were from disk - ok ... I think the best thing to do is log a call to get AppCritical run (network analysis) lets see if that shows anything - if nothing else, just for elimination. Post the case number up here so I can keep an eye on it. I managed to skip the bit that said you do two copies - that will skew the results a bit as for multiple copies the data in the memory buffer is sent to the tape1 and then immediately sent to tape2, so effectively, things slow down. The reason for this is because bptm is a single threaded process, so it can't do two things at once. Can you arrange a test backup that is is not multi-copy and always uses exactly the same data (we don't want a moving target ....). I suspect you are very careful with your tests but lets just be sure going forward. The difference in speed between the disk / tape backup can be confusing, I wonder if the following is happening: We have delays on waiting for full buffer - we know that and it causes xx mins of delays per yy hour (this is bptm waiting for full buffer = waiting for data from client) I am suspecting that on the tape backup we are additionally getting the following: (bpbkar log) - waiting for empty buffer xx times delay yy times ... and in the case of tape backup, these delays are 'significant' - so there is a delay in the memory buffer being emptied, that is a delay with the data getting from memory to the actual tape. If so, we know from past experience that 128 or 256 buffers of size 262144 should work - so we can 'most likely' discount this of causing the delay, which would therefore leave the possibility of tape drive fault / firmware level / driver that is at least contributing. It's just an idea, I'm not saying this is the case or trying to pass blame, but we need to consider anything and everything until it is 'proved' otherwise (hence apppcritical etc ...). Kindest regards, Martin
10-08-2013 11:20 PM
I open my eyes eventually
Oct 9 08:24:28 pnms01 tldcd[25080]: [ID 912152 daemon.notice] inquiry() function processing library HP MSL G3 Series G.70:
This is just a NBU function that makes some checks about the config of the library.
10-09-2013 08:54 AM
Roland
Tried a synthetic test? This uses nb to generate your data and stuffs it down whichever pipe you tell it, very useful for ruling out disks and files and you can churn out data very rapidly.NB loves large files: have a crack at synthetic tests locally and across the network , you'll surely find out something interesting when you know the data pipe has huge capacity. I think its only a Solaris policy option. Have a search for GEN_DATA : comes with a bunch of other directives...how much data, how many files, how random...
Large? Gb large...
With these data on input direct to tape you should see the drives hit peak 120M/s. Take it from there.
Jim
10-09-2013 10:07 AM
10-09-2013 03:06 PM
Some interesting new data.
I upgraded to latest library firmware, but didn't do much as expected.
IBM engineer just run a healthcheck and said it was all OK....
I did a single tape backup test and that showed something interesting.
Backup to ONE tape (drive2) took 23min!
Backup to the other drive (drive1 drive2 downed) took 31min, but the catalog backup (to disk) started in the middle of that backup.
Still a lot of delays though.
From this singel tape backup:
Drive 2 10/09/2013 13:34:20 - Info bptm (pid=6466) waited for full buffer 20244 times, delayed 48033 times
Drive 1 10/09/2013 14:10:57 - Info bptm (pid=7312) waited for full buffer 14993 times, delayed 40645 times
So I would actually be better off running the backup twice one to local tape and one to remote.
I will try to start 2 backups and using the pools to get both backups running at the same time.
I will investigate how to run this synthetic test as well.
10-09-2013 03:17 PM
10-09-2013 03:23 PM
Ok, I'll call support.
10-09-2013 03:53 PM
Case # 05267881 - AppCritical run has been created [ref:00D30jPy.5005OF6fj:ref]
10-09-2013 10:01 PM
AppCritical run showed nothing, full speed 1gb.
I ran a test where I splited the inline copy and had two policies one goung to Local tape and one going to the remote tape.
NBU happily started both simultanlessly and it took just 40min.
And this is the same OS backup but running in parallell!
10-14-2013 02:41 AM
Yes, that is two seperate bptm processes, working 'at the same time'.
WIth multiple copies, you have one bptm process that has to do 'two' things and therefore takes twice as long.
M
10-14-2013 09:29 PM
I set up a test netbackup in the lab with the same settings.
And I get the same high delays.
This is a 43min OS backup, about 30GB.
10/15/2013 13:46:38 - Info bptm (pid=11094) waited for full buffer 37709 times, delayed 157410 times
10-18-2013 02:15 PM
10-27-2013 05:24 PM
Hi,
I've been doing a lot of testing and changing to different new ways of backing up our system.
One thing I think we need to do is changing one of the drives, it seems that it for some reason it is really slow.
Sort of intermittent, we will probably get that changed tomorrow and well see what happens after that.
My new strategy is to use both drives and run backups multiplexed to one copy only.
And then have Vault to do a duplication onto remote tapes.
Earlier we had a requirement from the customer to do verify and they agreed on that duplication would be the same sort of verify.
Vault then do catalog backup and ejects the remote tapes.
I'll keep you posted.
- Roland
10-27-2013 05:42 PM
10-27-2013 07:29 PM
Hi,
Unfortunatly webex is not possible on this system unless you get security clearance which is not easy...
I could get you the logs you need though if you just tell me what you looking after.
- Roland