Solved: Duplication, performance of "bpdm.exe"...

sdo · ‎03-07-2015

Env: NetBackup v7.6.0.1 on RHEL 6.4. Master/media server (single node). Fairly new build, about 3 weeks old.

Issue: Duplication to tape (over FC) from MSDP disk pool (itself on SAN/FC storage) seems slow.

One small MSDP disk pool of c. 10TB, which has several weeks of backups now within it, and is about 55% occupied. Currently duplicating a set of full backups (circa 300 backup images, totalling about 7TB of backup data) via two 'bpduplicate' commands. The backups being duplicated, are the first set of full backups ever ingested by this environment).

No other backups, replications or duplications are running.

No legacy log folders exist. VxUL logging was the typical usual default of Debug=1 and Diagnostic=6.

Nothing scary seen in /var/log/messages.

I'm seeing about 50 MB/s to each of two tape drives.

Also seeing two "bpdm.exe" processes regularly hitting 99%/100% of CPU, and a total of 25% of machine CPU used. RHEL reports server has 8 CPU cores.

NetBackup KMS is configured, and the duplications are writing to an "ENCR_" volume pool for KMS.

NetBackup application daemons were restarted recently (a few days ago).

ulimit and semaphores checked and are ok.

CPU(s) appears to be... two physical CPU, each quad core, each single thread.

[user@server ~]# cat /proc/cpuinfo | egrep "MHz|physical id|core id|^$"
cpu MHz         : 1200.000
physical id     : 0
core id         : 0

cpu MHz         : 1800.000
physical id     : 0
core id         : 1

cpu MHz         : 1200.000
physical id     : 0
core id         : 2

cpu MHz         : 1800.000
physical id     : 0
core id         : 3

cpu MHz         : 1200.000
physical id     : 1
core id         : 0

cpu MHz         : 1200.000
physical id     : 1
core id         : 1

cpu MHz         : 1800.000
physical id     : 1
core id         : 2

cpu MHz         : 1200.000
physical id     : 1
core id         : 3

(I think the CPUs must be 1.2 GHz with boost to 1.8GHz, because, for each CPU two cores show at 1.2 GHz, and two cores show at 1.8 GHz).

I wasn't involved in the environment sizing, design, build or implementation - so I've no idea (yet) what the server tin actually is.

I suspect that the fact that each "bpdm.exe" maxes out a CPU core is causing the limitation of each duplication to tape to run at around 50MB/s.

I would have expected "bpdm.exe" to simply be a data mover, receiving data from spoold, and forwarding to bptm, and I wouldn't have expected to see each process consuming 100% of a CPU core.

Can anyone offer any reason as to why bpdm maxes out a CPU core?

Thanks in advance.

mph999 · ‎03-08-2015

As a real simple test, backup some data to a basic disk STU, then duplicate that, how fast is it ?

View solution in original post

Nicolai · ‎03-09-2015

1: We had a in-house Linux admin to look at our setup. And this was his advice

2: You can see if hyper-threading is enabled in the BIOS. You enable/disable in BIOS as well.

Regarding the use of swap space set

vm.swappiness = 1 in /etc/sysctl.conf

Explained here : https://lonesysadmin.net/2013/12/11/adjust-vm-swappiness-avoid-unneeded-disk-io/

View solution in original post

Nicolai · ‎03-09-2015

I would increase NUMBER_DATA_BUFFERS to 128 or 256

MSDP encryption is documented in the deduplication guide

http://www.symantec.com/docs/DOC6466

Page 34. As I read it, its Netbackup that does the encryption/decryption. But Netbackup may very well uses library call to other modules in the OS. The manual say "Deduplication uses the Blowfish algorithm for encryption"

View solution in original post

Nicolai · ‎03-09-2015

strace can do it - but you need kernel skills in decoding the output.

http://en.wikipedia.org/wiki/Strace

View solution in original post

revarooo · ‎03-07-2015

No idea. Have you tried enabling bpdm? For a dedupe duplication 50mb/s to tape is not terrible considering the rehydration overhead.

Nicolai · ‎03-08-2015

Try disabling hyper-threading. Single threaded process will not benefit from hyber-threading. actual you may end up with a worse performance because to single threaded process will fight for resources from the same core.

Besides that the its the quality of disk under the msdp pool that decides what re-hydration speed you will get.

sdo · ‎03-08-2015

Hey thanks Nicolai.

1) What (in your experience) brings you to suggest that?

2) How do I determine whether Hyper-Threading is active? Then I can go to O/S admin team, and see whether it's something that can be easily enabled/disabled so that I can perform some testing either way.

sdo · ‎03-08-2015

On Friday/Saturday, I noticed that 4GB of swap space was consumed by the O/S, and that this server has 23GB of RAM - but today, on the now quiet system, zero swap space is currently in use.

IMO, a NetBackup master/media should not be making use of any swap space.

My rough calculations are:

RHEL O/S 6 GB

NetBackup application 8 GB

NetBackup buffers 2 GB (although not configured at the moment)

MSDP (1.5 * 10TB) 15 GB

...so, possibly, this system should have at least 31GB of RAM - which might go some way to explain why I saw 4GB of swap space used during busy periods.

I don't know enough about RHEL O/S to know whether recalling lots of data from O/S swap space consumes CPU or where that CPU time/cost is charged too - so I don't know whether the use of swap space was somehow causing CPU cycles to be consumed within the bpdm processes when I saw them hitting 100%.

Anway, I grabbed the bptm data consumer wait and delay times (from the activity monitor job log for the duplication jobs) and there's many hours lost for bptm waiting for full buffer, but unfortunately, I didn't have logging enabled at the time, so I don't have the bpdm data producer waits/delays for empty buffer - but I did some further duplication testing (with logging enabled) on a small-ish (15GB) backup image and bpdm was spending some time waiting too - but during the duplication test (from MSDP to tape) I didn't see the bpdm process go above an average of 35% of a CPU core.

The LTO6 tape media (2.5TB native) which did reach full, both show as circa 3.5 TB occupied - so the tape drive heads are definitely able to SCSI T10 compress the incoming data - which says to me that the tape drives are not receiving 'encrypted or compressed' data to write.

I had previously enabled MSDP encryption at rest (in the pd.conf file), so I'm now wondering if maybe something about the mixture of NetBackup and RHEL and the Intel CPU model/type inside the server tin is altogether somehow not very good at decrypting the MSDP de-duped data blocks read from MSDP pool.

Does anyone know where the 'encryption/decryption' of NetBackup MSDP is performed? I mean, is encryption/decryption performed by:

a) software local to NetBackup MSDP, i.e. do Symantec use their own software routines.

b) or does NetBackup call a RHEL O/S library routine to perform encryption/decryption

c) or does NetBackup (and/or the O/S library call) make use of some extended hardware instructions within the Intel CPU itself?

d) or maybe, it's option c) first, and if the required CPU hardware instructions for encryption/decryption are not present within the CPU model, then option b) or a) is performed.

Which then makes me want to ask, how do I actually determine where NetBackup MSDP encryption/decryption is peformed?

Thanks.

mph999 · ‎03-08-2015

As a real simple test, backup some data to a basic disk STU, then duplicate that, how fast is it ?

sdo · ‎03-08-2015

yes - more testing is needed - just did a quick test of duplicating from MSDP to DSU on same 10TB LUN in RHEL, and it achieved 154MB/s - I'll try to find time to do a wider variety of storage unit to storage unit duplication test, with enough bpdm logging - and maybe some buffer tuning testing too, after I've base lined the basics.

Nicolai · ‎03-09-2015

1: We had a in-house Linux admin to look at our setup. And this was his advice

2: You can see if hyper-threading is enabled in the BIOS. You enable/disable in BIOS as well.

Regarding the use of swap space set

vm.swappiness = 1 in /etc/sysctl.conf

Explained here : https://lonesysadmin.net/2013/12/11/adjust-vm-swappiness-avoid-unneeded-disk-io/

sdo · ‎03-09-2015

Thanks for the swappiness thing Nicolai. Not changed that yet, but shared info with sys admin.

Thanks Martin. I did various tests between DSU, MSDP, LTO6 - to and from - with around 175GB of data, and have more than doubled MSDP to LTO6 duplication performance simply by setting size_data_buffers to 262144, and number_data_buffers to 64.

I'll have to wait for the next big batch of duplications to see whether it has any real world benefit.

The RHEL question re Hyper-Threading intefering with a single threaded process - is probably not for here...

...so, I think my remaining NetBackup question is:

1) How do I determine where MSDP encryption/decryption is actually taking place? In CPU hardware, or via RHEL module, or via NetBackup software?

Thanks.

Haniwa · ‎03-09-2015

What do you find to be the best method to measure duplication speed, especially when dealing with a large batch of images ?

Ken W

Nicolai · ‎03-09-2015

I would increase NUMBER_DATA_BUFFERS to 128 or 256

MSDP encryption is documented in the deduplication guide

http://www.symantec.com/docs/DOC6466

Page 34. As I read it, its Netbackup that does the encryption/decryption. But Netbackup may very well uses library call to other modules in the OS. The manual say "Deduplication uses the Blowfish algorithm for encryption"

sdo · ‎03-09-2015

Yes - I tried 128 and 256 number_data_buffers, for not much gain. I'll see how it goes with 64 for this config, and then perhaps try 128 next month.

Re: where is encryption performed, I found this re Intel CPUs:

http://en.wikipedia.org/wiki/AES_instruction_set

...so it looks like certain Intel CPUs can assist AES encryption, but no mention of blowfish encryption.

And this paper describes the possibility of extending Intel MMX and SSE instructions to provide hardware assistance to other encryption algorithms:

http://www.dii.unisi.it/~giorgi/papers/Bartolini08c.pdf

And this short note lists how AVX and AVX2 intsrcution set of some Intel CPUs can be used by blowfish algorithm:

https://www1.cs.fau.de/avx.crypto

All way beyond me.

It would be really cool if there was a NetBackup command (maybe undocumented until now) which reports whether NetBackup will make use of hardware (within CPU) acceleration for blowfish encryption within MSDP, or whether it has to use a (probably) slower implementation fully in software.

Maybe Symantec could supply a standalone tool so that customers can check/report their server tin for appropriateness for encryption? (for MSEO and/or MSDP even?).

And I still haven't found out why the "bpdm" processes that I saw were flat-lining at 100% of a CPU core.

Does anyone know of any commands I can run agianst the PID of a bpdm process to reveal what it's doing inside?

Thanks.

sdo · ‎03-09-2015

Hi Ken,

The quickest would be to literally take the 'actual elapsed duration' run time and the KB from the activity monitor and drop these in to Excel, and apply some simple cell formulae - but this can be a bit crude, and doesn't take in to consideration tape mount delays, resource contention etc. But this should not require any scripting.

The second method might be to script up: grab the contents of the activity monitor job log record (via bpdbjobs), extract it, parse it (use programming code to detect and handle resource delays), and spit what's required in to an Excel CSV file. Extracting the 'job log' can be tricky if you've not done it before - but easy once you know how.

The third, would be to parse the actual bpdm and bptm (+other) logs, and programmatically collate the stats from the 'sets' of related process IDs (PIDs) and munge the data to produce truely accurate performance stats on a per media server, or per policy, or per policy type, or per client, or per client type, or per day, etc etc etc.... basicaly write your own performance reporting tool. Not easy, but definitely doable, as the NetBackup logs are nicely structured and fairly consistent.

As for testing - here's some thoughts:

In a NetBackup installation I usually like to create two 'sets' of test policies, which permanently remain configured within the list of policies, but are usually de-activated, and have policy names that clearly distinguish them as either 'functionality' test policies, or 'performance' test policies, e.g. ZZZ_FUNC_TEST_MSDP_1, ZZZ_FUNC_TEST_MSDP_2,... ZZZ_PERF_TEST_TAPE_BEST, ZZZ_PERF_TEST_TAPE_WORST, ZZZ_PERF_TEST_TAPE_TYPICAL, etc etc... The trick is to thibk about what it is your are trying to achieve and to name the polcies such that they relate to each other.

For the functional test policies, I usually use GEN_DATA to generate small amounts of test data (around 1GB) and use very short 1 day retentions on the schedules. If your media servers are Windows, then the trick is to leave a folder called ?:\NBU-DO-NOT-DELETE on a drive on each media server, drop a 1GB file in there and specify that folder path in your functional test policies.

You could create some functional test policies configured such that they perhaps test all tape drive paths, or test cross-site conectivity. If you have a site with two or more media servers, then you can also do something like, have one test policy per media, which uses the previous media server as its client, e.g.

policy 1 - storage is MSDP on media server 1, but client is media server 3

policy 2 - storage is MSDP on media server 3, but client is media server 1

policy 3 - storage is MSDP on media server 3, but client is media server 2

And perhaps another bunch of SSO functional test policies, configured in such a way that all media servers all test IO to all tape drives in all tape libraries. If configured correctly, these should not consume more scratch media than there are drives (i.e. make sure all schedule in all test policies always use the same short 1day retention).

This way I always have a set of functional test policies to hand ready to prove that NetBackup appears to be ok - simply by running them. Ideally a complete run of all 'functional' test policies should take no longer than a few minutes to run.

For performance tests:

If the master, or master/media, or media server is Unix/Linux based - or if I have decently powerful and well connected (>= 4 Gb/s) Unix/Linux client - then for the performance test policies (again using 1 day retention on the schedule) I like to use the GEN_DATA policy include directives to generate medium sized amounts of data (around 20GB). I tend to have three sets of performance policies, best case compression and de-dupe, worst case compression and de-dupe, and then typical. Specifying the best and worst case is easy - just use GEN_RANDOM=0 or =100, and GEN_DEDUPE=100 or 0.

For simulated real world tests:

It's getting the 'typical' performance GEN_DATA settings for something useful which can be difficult. At the end of the day - it's probably best to use real data if you can.

To try to generate data patterns that look like real data - it's a case of sitting down - looking at the number of files in a typical client, looking at how well it de-dupes, or how well it compresses to tape, and trying to create a GEN_DATA profile which gets close to that.

HTH.

Nicolai · ‎03-09-2015

strace can do it - but you need kernel skills in decoding the output.

http://en.wikipedia.org/wiki/Strace

Haniwa · ‎03-13-2015

SDO,

Thank you for your detailed reply on duplication rates and testing. Your willingness (and others) to share info is commendable

In addition to the duplication rate methods you stated, I have used two other (rough) methods...

1. When tape drives are dedicated to "tape-out", look at the "Tape Drive Throughput" report in Ops Center. It will show an average KB/sec, for each, and for all tape drives. Use that number as the duplication rate.

2. Divide " Total MB" output of 'nbstlutil -report', by the elapsed time of all duplications (start time of 1st job, finish time of last job). May have to 'diff' the nbstlutil output, before and after, if it is a running total.

Lastly, there is a Tech Note on MSDP Read Performance (you probably know about it, but for benefit of others):

http://www.symantec.com/business/support/index?page=content&id=HOWTO101014

KW

sdo · ‎03-14-2015

I hadn't seen that tech note. So thank you for that.

Personally, I'm not sure about the validity of using sequential IO to test raw performance of the volumes upon which MSDP resides, as to my mind, MSDP IO will be entirely random.

With v7.6.0.x it appears to me that MSDP keeps itself busy re-organizing - all the time - whilst there is no IO fromt/to NetBackup. So I suspect all of ths background IO is entirely random.

Does anyone else have a vew on whether normal backup data arriving at MSDP results in IO that is all sequential, or mostly sequential, or mostly random, or entirely random?

VOX

Duplication, performance of "bpdm.exe"...