Solved: NBU 7.7.3 - Slow catalog backup

Kaspar_Brygger · ‎08-09-2017

Experiencing poor performance in data transfer speed when running catalog backup. Setup is as follows:

- VCS clustered master servers running NBU 7.7.3
- OS RHEL 6.6
- Storage unit is a DataDomain 990
- NIC 10Gb/Sec.
- Catalog size apr. 1.6TB
- Catalog resides on SAN
- Observed data transfer speed in the area of 20-25MB/Sec.

I've tested the storage unit using GEN_DATA and can see transfer rates up to 2-300MB/Sec. from the master servers.

Anybody have an idea why the transfer speed is only using a small part of the bandwidth and/or how to troubleshoot further?

/Kaspar

Jim-90 · ‎08-17-2017

"Yes it has been much better long time ago when the environment was fresh. I have almost forgot how it was. But much better."

Try defraging the catalogue volume/filesystem.

Why: Deterioration over time it points to the same problem I had in the a Windows master server I cannot see why its no different on Linux. I suspect its volume fragmentation. When I tested it everybody thought the theory was rubbish. "WMware servers don't require defraging" "Our storage has no performance issues...you are wasting your time defraging". I defraged the catalogue volume down to <5% and let it fragment for a month. Found that it fragmented the catalogue volume by 2% everyday. I stopped it when fragmentation reached about 80% because catalogue backups were taking too long. At the start of the exercise backups were ~3.5 hrs and ended up taking ~14 hrs. A plot of catalogue backup duration against percentage of fragmentation showed there was almost a perfect linear correlation.

View solution in original post

sdo · ‎08-09-2017

The bulk of the catalog is the "images database", i.e. a collection of highly structured meta-data in file-system layout.

The folder tree is:

/usr/openv/netbackup/db/images

I would experiment with reading this folder tree using the "bpbkar to null" test.

See @Nicolai's tech tips here:

http://www.mass.dk/netbackup-quick-hints/netbackup-admins-test-tool/

Michael_G_Ander · ‎08-09-2017

I would start with checking the read speed on the catalog disk, think you can use the bpbkar to /dev/null for this like on a client.

Have your talked with your SAN administrator ? Not all SAN disks (LUNs) are equal, and a lot of people think that backup disks doesn't need good performance.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Kaspar_Brygger · ‎08-09-2017

In my case I can only perform some tests limited to the OS and NBU. As we're an larger corp. segregations of duties prevents me from accessing other hardware in our infrastructure.

I have now tried to run the suggested:

/usr/openv/netbackup/bin/bpbkar  -nocont -nofileinfo -nokeepalives /var  > /dev/null 2> /tmp/file.out

... but what am I looking for? I can read the bpbkar log and see a lot of information. Does it specific tell me the actual disk I/O, speed or other usefull information?

1) What has changes? Well I actually don't know. Many people are working on SAN and network so a chance for something was changed is huge.

2) The manual? It is @Nicolai that has written the install manuals for our master/media servers. So I beleive that it is in order and according to Veritas recommendations.

3) Any tech notes? I've browsed some other tech notes to look for clues. But hasn't come up with anything that explains my current situation.

Thanks for you time until now answering my questions :)

/Kaspar

Thiago_Ribeiro · ‎08-09-2017

Hi,

Is this backup was working before?

Had some change in your environent?

Post here bptm and bpbkar logs please.

Other thing,you are almost in the size limit of the Catalog recommended by Veritas

Recommendation for sizing the catalog - https://www.veritas.com/support/en_US/article.000076312

Regards,

Thiago

Kaspar_Brygger · ‎08-09-2017

Hi Thiago.

Yes it has been much better long time ago when the environment was fresh. I have almost forgot how it was. But much better.

I cannot supply more information regarding changes done to the sorrounding hardware. NBU was updated to 7.7.3 a couple of months ago, also performing poor before. I haven't had the time to follow up on it until today.

The size can be discussed. Yes Veritas has some recommendations to size but we run other environments with catalog sizes over 4.5TB and they run fine (also transfer speed to storage).

bptm and bpbkar logs are attached as tgz files for your information :)

/Kaspar

sdo · ‎08-09-2017

This command:

/usr/openv/netbackup/bin/bpbkar  -nocont -nofileinfo -nokeepalives /var  > /dev/null 2> /tmp/file.out

...only reads the /var folder tree. You haven;t really tested anything, or at best you've tested the wrong area.

The largest part of any catalog backup is nearly always the image catalog.

So then, really your test should be this:

/usr/openv/netbackup/bin/bpbkar  -nocont -nofileinfo -nokeepalives /usr/openv/netbackup/db/images  > /dev/null 2> /tmp/file.out

...and if it is still slow then the problem is 100% certainly not with NetBackup. bpbkar will read disk as fast as the disk can possibly keep up, you are only limited by OS and file-system and CPU for bpbkar.

So, if during the test no single CPU core hits 100% (for bpbkar) then the problem is 100% definitely with HBA or SAN fabric or storage array.

The images database does not contain that many files, and so if the read is slow then the problem is not with folder structure or with file count in the structure. But it could be due to file fragmentation. But Unix file-systems don't suffer fragmentation do they?! (oh yes they do).

Jim-90 · ‎08-17-2017

"Yes it has been much better long time ago when the environment was fresh. I have almost forgot how it was. But much better."

Try defraging the catalogue volume/filesystem.

Why: Deterioration over time it points to the same problem I had in the a Windows master server I cannot see why its no different on Linux. I suspect its volume fragmentation. When I tested it everybody thought the theory was rubbish. "WMware servers don't require defraging" "Our storage has no performance issues...you are wasting your time defraging". I defraged the catalogue volume down to <5% and let it fragment for a month. Found that it fragmented the catalogue volume by 2% everyday. I stopped it when fragmentation reached about 80% because catalogue backups were taking too long. At the start of the exercise backups were ~3.5 hrs and ended up taking ~14 hrs. A plot of catalogue backup duration against percentage of fragmentation showed there was almost a perfect linear correlation.