How to calculate SIZE_DATA_BUFFERS

rsm_gbg · ‎09-17-2013

Hi folks,

I'm looking into getting better performance on our backups.
I found this recent Community post that is quite interesting as I have exactly this problem.

Netbackup Data Consumer waiting for full buffer, delayed 2186355 times
https://www-secure.symantec.com/connect/forums/netbackup-data-consumer-waiting-full-buffer-delayed-2186355-times

I read
http://www.symantec.com/business/support/index?page=content&id=TECH1724

And have played around with different values in SIZE_DATA_BUFFERS and NUMBER_DATA_BUFFERS

There is no change in the
09/18/2013 11:31:40 - Info bptm (pid=6079) waited for full buffer 42749 times, delayed 154422 times
Whatever numbers I change.

The size and number changes per new started backup if I change the values.
09/18/2013 10:36:46 - Info bptm (pid=6079) using 131072 data buffer size
09/18/2013 10:36:46 - Info bptm (pid=6079) using 32 data buffers

But it also depends on the network settings, tape drive, probably interface to the tapedrive (SCSI/SAS/FCAL) and the label on the tape.

There must be a certain way of calculating what should be the optimal/default/best value!?

Trial and error doesnt appeal to me that much.
Just getting rid of the waiting for full buffer should improve things a bit.

All Solaris 10 hosts, SCSI Sun StorageTek SL48 Tape library with 2xLTO4

Cheers,

-Roland

mph999 · ‎09-17-2013

Nope, no way of calculating - just try and see.

Try SIZE of 262144 and NUMBER of 128. If this is better try 256 buffers.

Martin

Mark_Solutions · ‎09-18-2013

If you are testing this using tapes that have been written before with a different block size they can retain their original block size

So each time you do a test make sure it does say the correct block size in the job, otherwise label the tape (without the verify option) first and then try - the label process writes the header with the new block size

If you are duplication from disk to tape also remember to tune you DISK buffers

262144 is definately the best for LTO4 and numbers starting at 32 and going up - that all depends on how your media server copes with it.

As they are scsi drives make sure hba firmware etc is up to date - and if possible also make sure the drives thenselves are on the latest firmware

rsm_gbg · ‎09-19-2013

Ok, Trial and Error then.

If I hit the sweetspot will the
waited for full buffer 42749 times, delayed 154422 times
Disappear?

I set the size to 256k and now trying 128 buffers. (tried 16, 32 & 64)

If the log shows the new numbers as in

09/18/2013 10:36:46 - Info bptm (pid=6079) using 131072 data buffer size
09/18/2013 10:36:46 - Info bptm (pid=6079) using 32 data buffers

Does that definitely mean I don't need to relabel the tape?

I do straight backups to tape over the net, backups makes two copies one local and one for remote storage.

That is using the Schedules "Multiple copies"

Cheers,

- Roland

Mark_Solutions · ‎09-19-2013

No - if the log shows the data buffer size as 131072 when you have set it to 262144 then you need to label the tape to get it to use the new size

The numbers should also match but that it not affected by the tape itself

rsm_gbg · ‎09-19-2013

yes, of course, just a typo copy/paste mistake.

mph999 · ‎09-19-2013

If I hit the sweetspot will the
waited for full buffer 42749 times, delayed 154422 times
Disappear?

It won't disappear, but should reduce.

Ideally you want 0 and 0 but that is not likely to happen. It is the number of delays that is important, by default each delay is 15 ms.

If the total time cause by 'however'many delays (lets say 1 min) is a very small % of the total job time, forget them, if it's a significant % , needs looking at.

For example. ON a job that runs for several hours a few 10000s of delays is insignificant. But the same number of delays on a 1 hr job would be very significant.

M

rsm_gbg · ‎09-19-2013

Hi,

Here are some more data, 256 reduces the backuptime it seems.
But the delay stay mostly the same.

It seems the delay is really significant ~26min per hour of backup.
These tweaks doesn't really improve the delays at all.

This is a log for the same backup for 3 different days

log.091813:09:35:36.774 [4408] <2> set_job_details: Tfile (42165): LOG 1379460936 4 bptm 4408 using 262144 data buffer size
log.091813:09:35:36.775 [4408] <2> set_job_details: Tfile (42165): LOG 1379460936 4 bptm 4408 using 16 data buffers
log.091813:10:33:16.109 [4408] <2> set_job_details: Tfile (42165): LOG 1379464396 4 bptm 4408 waited for full buffer 27600 times, delayed 108182 times
Backuptime 00:58:59
KB/sec 13200
Size 42GB
Delay=108182x15ms= ~27min

log.091913:10:02:11.279 [16451] <2> set_job_details: Tfile (42228): LOG 1379548931 4 bptm 16451 using 65536 data buffer size
log.091913:10:02:11.279 [16451] <2> set_job_details: Tfile (42228): LOG 1379548931 4 bptm 16451 using 64 data buffers
log.091913:11:02:00.981 [16451] <2> set_job_details: Tfile (42228): LOG 1379552520 4 bptm 16451 waited for full buffer 23516 times, delayed 99255 times
Backuptime 01:00:27
KB/sec 12500
Size 42GB
Delay=992552x15ms= ~25min

log.092013:09:31:18.347 [27507] <2> set_job_details: Tfile (42291): LOG 1379633478 4 bptm 27507 using 262144 data buffer size
log.092013:09:31:18.347 [27507] <2> set_job_details: Tfile (42291): LOG 1379633478 4 bptm 27507 using 256 data buffers
log.092013:10:17:47.844 [27507] <2> set_job_details: Tfile (42291): LOG 1379636267 4 bptm 27507 waited for full buffer 26911 times, delayed 115749 times
Backuptime 00:46:58
KB/sec 16800
Size 42GB
Delay=115749x15ms= ~29min

- Roland

Mark_Solutions · ‎09-20-2013

Is this just one job going to one drive - or do you have several jobs running to several drives at the same time on the same media server?

We now need to look at other possible bottlenecks in the system - where does the data come from? (over the network or local)

If it comes over the network what type of data is it and what is the network speed?

You look to be peaking at 16MB/s which is very poor for LTO4

So it may be the hosts / network being used

One test would be to set multiplexing on the storage unit to 6 and fire off 6 jobs at the same time and see what that gives you

It just does not sound like you can feed the drive(s) fast enough with that single stream

rsm_gbg · ‎09-24-2013

Had to tend to a server crash for a couple of days....

We have 2 drives one for local (in the library) tapes and one for remote tapes that gets ejected every night with Vault.

So all backups goes to 2 tapes/drives.

I tried to do multiplexing but it seems like the network is the bottleneck.

I get 15000kb/sec on OS backups (many small files)
On big data files like RMAN filesI get 50000kb/sec.

Backing up the Media server OS is about 20000KB/sec

When multiplexing I sort of get a combined 20000kb/sec a bit faster but still slow.

What could I expect on a LTO4 over SCSI?

I'll check the network atm, I think my problem is there.

- Roland

Yasuhisa_Ishika · ‎09-24-2013

native(uncompressed) speed of Ultrium 4(LTO-4) is 120MB/s. 256kB buffer size and 64 buffers is enough for

http://en.m.wikipedia.org/wiki/Linear_Tape-Open

Have you already tried to check bpbkar read performance for *exactly* same data without transfering data through network? If not, try it following this technote. This delay may be brought by client rather than network.

http://www.symantec.com/docs/TECH17541

rsm_gbg · ‎09-24-2013

Hi,

I'm running Solaris and the bpbkar procedure is for windows.
I tweaked it a bit for Solaris but,

The doc says something like this should appear in the log.

tar_base::backup_finish: TAR - backup: 15114 files
tar_base::backup_finish: TAR - backup: file data: 995460990 bytes 13 gigabytes
tar_base::backup_finish: TAR - backup: image data: 1033197568 bytes 13 gigabytes
tar_base::backup_finish: TAR - backup: elapsed time: 649 secs 23099898 bps

I set VERBOSE = 5 in bp.conf but I don't get anything like this.
./bpbkar -nocont /etc 1> /dev/null 2> /dev/null

This is my last bit of the log.

15:47:41.349 [4036] <4> bpbkar main: INF - Client completed sending data for backup
15:47:41.349 [4036] <2> bpbkar main: INF - Total Size:73328388
15:47:41.349 [4036] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/hardlink_info for prefix = hardlinks_ and older than 30 days
15:47:41.350 [4036] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/hardlink_info/root for prefix = hardlinks_ and older than 30 days
15:47:41.350 [4036] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/logs/user_ops for prefix = jbp and older than 3 days
15:47:41.351 [4036] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/logs/user_ops/nbjlogs for prefix = jbp and older than 3 days
15:47:41.351 [4036] <4> bpbkar Exit: INF - bpbkar exit normal
15:47:41.351 [4036] <4> bpbkar Exit: INF - EXIT STATUS 0: the requested operation was successfully completed
15:47:41.351 [4036] <2> bpbkar Exit: INF - Close of stdout complete
15:47:41.351 [4036] <4> bpbkar Exit: INF - setenv FINISHED=1

- Roland

rsm_gbg · ‎09-25-2013

Hi,

I did this simple network test.
it seems pretty good to me.

root@nbu-mediaserver01:~# /usr/local/bin/iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 48.0 KByte (default)
------------------------------------------------------------
[ 4] local 10.0.0.1 port 5001 connected with 10.0.0.2 port 40078
[ ID] Interval       Transfer     Bandwidth
[ 4] 0.0- 9.2 sec 1.00 GBytes   936 Mbits/sec
root@nbu-mediaserver01:~#

root@solaris-client:/var/tmp# /usr/local/bin/iperf -n 1G -mc 10.0.0.1
------------------------------------------------------------
Client connecting to 10.0.0.1, TCP port 5001
TCP window size: 48.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.2 port 40078 connected with 10.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[ 3] 0.0- 9.2 sec 1.00 GBytes   938 Mbits/sec
[ 3] MSS size 1460 bytes (MTU 1500 bytes, ethernet)
root@solaris-client:/var/tmp#

- Roland

rsm_gbg · ‎09-25-2013

Hi,

When I started multiplexing I can't find anything like this anymore

waited for full buffer 26911 times, delayed 115749 times

There is an awful lot of jobs that starts up though so its hard to find.
Should there be such an entry somewhere?

- Roland

revarooo · ‎09-25-2013

bpbkar and bptm logs.

What did you set multiplexing to?

Yasuhisa_Ishika · ‎09-25-2013

To measure read performance on Solaris, run bpbkar as this technote. Don't mind if no output displayed in console - just check how long it takes with your real backup target.

http://www.symantec.com/docs/HOWTO56131

BTW, 'waiting for full buffer' counter will be logged in bptm once for each multiplxing session. Grep bptm log after all the jobs in multiplexing finish.

rsm_gbg · ‎09-30-2013

I have set the multiplexing to 6
I've done this in 1 policy with many clients, which are all OS backups.

I've run bpkar on the same server that has the robot. Backup of the filesystem / takes 23min. Running the normal backup to tape takes 1:17hours.

I can't find any delayed xxx for the multiplexed clients in the logs.

rsm_gbg · ‎10-06-2013

Any more ideas guys?

mph999 · ‎10-06-2013

OK from this: I've run bpkar on the same server that has the robot. Backup of the filesystem / takes 23min. Running the normal backup to tape takes 1:17hours. If I understand correctly, bpbkar test (-nocont ) takes 23 mins but a real backup take 1.17 hrs. If I am correct: Set up a basic disk STU and re-run the same backup to that instead of tape (on the same media server you used). Do we get a similar number of waits /delays.

rsm_gbg · ‎10-06-2013

A testbackup to a disk-unit takes 31min with the normal tape backup also running.
The disk-unit is on the same mediaserver as the tapes.

10/07/2013 10:40:08 - Info bptm (pid=9136) using 262144 data buffer size
10/07/2013 10:40:08 - Info bptm (pid=9136) using 256 data buffers

10/07/2013 11:11:18 - Info bptm (pid=9136) waited for full buffer 33181 times, delayed 122425 times

Loads of delays though.

Summary:

bpbkar to /dev/null = 23min
diskunit = 31min
tape = 1:17hours

- Roland

VOX

How to calculate SIZE_DATA_BUFFERS