Buffer Delay in our new Netbackup Environment is t...

jayarc_derama · ‎01-18-2023

Hi,

I'm running a new NBU (10.1.1) environment, and we have struggled with the fine tuning of NBU slow performance. It has been 9 Days now since we started the implementation and finally we kinda have some performance improvement. Though, I wanted to ask if this can further be improved?

Previously we came from MF DP 10 (Prev. HPE) - my autoloader does not stops and idle then after a few seconds it writes again.

I have already approached Veritas Tech support team for assistance but the solution they provided made my backup ran slower both writing and elapsed time.

Our Environment:

Both servers are HPE DL380 Gen 10 / Media Library attached to the Media Server is HPE Store-Ever 1/8 Gen 2 (LTO-8) 300MB/s Native write speed to tape, / SAS Cable Transfer rate: 6.2GBps

Master server is on a separate Window Server 2016. 64GB Ram 8TB Data storage free space.

Media/Client server is on a same Windows Server 2016 with 256GB Ram and 28TB storage capacity.

Right now I'm are running Full Backups every night.

We modified these settings in the Veritas\db\config directory:

Modified the following settings with these values: (This is the best we can do right now)
* CHILD_DELAY = 5
* NUMBER_DATA_BUFFERS = 512
* PARENT_DELAY = 3
* SIZE_DATA_BUFFERS = 1048576

And we have around 4.2TB of Data to backup and it now takes about 6 and a half-hours to complete.

But without the optimization using all default values it took around 21Hours to complete.

I read in another Vox Article that delays and buffers are bad for Autoloaders, since they are not designed for STOP/WRITE functions, which could damage both the tape and drive.

I would appreciate if you can provide a link for buffer / delay tuning guide for my environment. Media server is writing directly to the connected autoloader (HBA SAS Cable).

I consulted with an HPE Server / Autoloader Engineer, according to him, there are no settings to configure on the Autoloader front panel control itself.

Everything is controlled and configured by software (i.e. Backup Software - NBU) which means fine tuning the NBU to improve the performance and minimize the delay.

Appreciate if you can give provide help and advice.

Thank you and Regards,

Faysal · ‎01-18-2023

Hello Mr, Jay,

1. For HPE Server / Autoloader you can contact vendor to ask that these setting are useful for autoloader or not. I mean increase of buffer can damage the autoloader or not. If it is ok then you can continue other wise you use veritas default setting for this.

2. you can download NetBackup™ Backup Planning and Performance Tuning Guide (for further tunning)

https://www.veritas.com/content/support/en_US/doc/21414900-146141073-1

Nicolai · ‎01-18-2023

Ohhh - performance tuning. What a topic :)

You should know the NetBackup™ Backup Planning and Performance Tuning Guide

https://www.veritas.com/content/support/en_US/doc/21414900-146141073-0/v19526785-146141073

This sections mentions waited for empty/full buffer:

https://www.veritas.com/content/support/en_US/doc/21414900-146141073-0/v19528173-146141073

You want as many "waited for empty buffers as possible" meaning backup data is in the buffers (NUMBER_DATA_BUFFERS & SIZE_DATA_BUFFERS) and are ready to be written to tape/disk.

A high "waited for full buffer" is bad, because it means buffers are empty, and backup data is missing from the client, resulting in the tape drive doing start/stop operations (a.k.a. shoe shining). You will never get a 0 "waited for full buffer", but monitoring the running average is recommended. You can't fix slow incoming data by tuning NUMBER_DATA_BUFEFRS/SIZE_DATA_BUFFERS. Instead you need to optimize the data path from client to NBU media servers. Possible culprits are antivirus scanners, network bottlenecks or disk bottlenecks at the client.

A way to test a client minimum run time is running in bpbkar by hand like outlined in tech note :

https://www.veritas.com/support/en_US/article.100006447

It's basically tells bpkar to read the backup data and throw the read data to /dev/null. Meaning if it takes 6 hours for bpbkar to read all data and throw it away, you will never ever be able to transfer the data to Netbackup media servers faster, unless optimization is being done on the client.

Best Regards

Nicolai

davidmoline · ‎01-18-2023

Hi @jayarc_derama

Further to the excellent advice already provided - can you change the backup to write multiple streams at once?

The backup looks like a MS-Windows type policy with two volumes to backup. If these can both run together and write to the same tape drive (increase the multiplexing on the STU), then you should get better performance.

This should all be covered in the tuning guide already recommended.

FInally - you seem to have lots of disk available - can you convert the backup to initially write to disk, then duplicate to tape. The backup should complete much faster, and then the duplication to tape isn't affecting the client performance.

Cheers
David

jayarc_derama · ‎01-19-2023

Hi Everyone thank you for your feedbackup, I will check and read all the resources and link you have provided.

One thing I don't get is when I talked to the HPE Engineer. He said our Autoloader is rated to go 300MB/s native read/write to the LTO-8 Tape.

On our NBU Environment as first we barely struggle to even reach 100MB/s and just changing the Buffer size - that alone has greatly improved the Backup time from 21 Hours down to 12 Hours.

This is before we had apply these "magic settings" that improve the NBU's backup performance. Is now down from 12 Hours to 6:30 Hours. Hence the Fine tuning is required, and we can not go with the out of the box default settings of NBU.

Modified the following settings with these values:
* CHILD_DELAY = 5
* NUMBER_DATA_BUFFERS = 512
* PARENT_DELAY = 3
* SIZE_DATA_BUFFERS = 1048576

Questions:

1.) I only have 1 Master server > connected via Gigabit LAN (Cu) > Media/Client Server > Media and Client server is on 1 server box alone. And why is it buffering > It should be constantly streaming data to the Autoloader?

It makes no sense, why would HP design and sell a Hardware that is rated for 300MB/s write/read speeds natively to tape, if the backup software can not even go near those figures.

2.) I'm still fairly new to the NBU environment and I do not know if what davidmoline - said about write multiple streams. I think it would only work if the Autoloader has 2 Drives inside right? My Autoloader is on 1/8 - 1 Drive 8 Tape slots.

3.) Backup writing times is around less 150MB/sec. But when I do the restore - the tape can ran at full speed even breaking its rated 300MB/sec - My test restore perform restore speeds 352MB/sec. Why the difference? Is the disk slow to write but when the Autoloader reads the tape and restore it back to disk, my server suddenly writes it faster? I don't get it.

More reading - can't wait to become a NBU Guru myself someday.

Nicolai · ‎01-19-2023

Hi @jayarc_derama

I think got a couple of things wrong.

1:Netbackup buffers incoming backup data into a buffers that makes up a SCSI block, once a block of data is ready, it is send off to the tape drive. So it does stream. Disk drives also has buffers, it just called cache.

Software doesn't run faster than the underlying hardware. Software can't make hardware run faster, if software doesn't run fast, look at the underlying infrastructure.

And you sort of already told what the problem is. 1 Gigabit is roughly 110MB/sec. If you want to write with >300 MB/sec then you need a least a 10 Gigabit ethernet connection.

2: Yes, it is good advice. Take a look at the disk staging unit (DSSU) in Netbackup.

https://www.veritas.com/support/en_US/article.100023197

The idea is slow backup client write their backup data to disk on Netbackup master/media instead of wasting precious writing time on the tape drives. IF and only IF the disk on the Netbackup master/media server is capable of it, the tape drive write the temporary stored backup images with full speed to tape. In your case 300MB/sec.

3: You should be able to figure this out by now. Also remember tape drives does compression. The compression is built in the hardware.

Deb_Wilmot · ‎01-19-2023

I would suggest that you look at NET_BUFFER_SZ - a good article is here: https://www.veritas.com/support/en_US/article.100030830

Here is why I say that; from the job details you posted:
1. bptm (pid=16188) using 1048576 data buffer size
**This means currently the SIZE_DATA_BUFFER is 1048576. I'm just pointing this out as this is already large - at 1 Mebibyte.

2.  bptm (pid=16188) waited for full buffer 219160 times, delayed 629651 times
**This means that the media server bptm process waited for a buffer to be full 219160 times; sometimes checking multiple times before a buffer was full (the multiple checks is why the delayed is 629651). What this means is that we are not receiving the information fast enough from the client to keep the buffers full. With this said - I wouldn't recommend starting with the media server buffer sizes - instead I would start with the network buffer size (or NET_BUFFER_SZ).
**** It also appears that you may have network resiliency enabled based on the following in the job details:
nbjm (pid=16108) Resiliency information was persisted, job is resilient.
You may want to disable this as a test as it can cause slow performance. See  https://www.veritas.com/support/en_US/doc/18716246-157328516-0/v57758733-157328516 for details

3. bpbkar32 (pid=11808) bpbkar waited 54283 times for empty buffer, delayed 132114 times
** What this tells us is that bpbkar is waiting for network buffers to empty 54283 times - and had to check multiple times for an available network buffer. This indicates to me that the bottle neck is on the client side, and that there simply aren't enough, or large enough, network buffers. Hence - check out the Network buffer size article I popped in at the top of this reply. In addition, you may want to check out the following to make sure the tcp send and receive space is configured correctly:
https://www.veritas.com/support/en_US/article.100016112

Good luck!
Deb

Nicolai · ‎01-20-2023

NET_BUFFER_SZ is deprecated. Pls don't change the default value.

Best practices for NET_BUFFER_SZ and Buffer_size, why can't NetBackup change the TCP send/receive space

https://www.veritas.com/support/en_US/article.100016112

jayarc_derama · ‎01-21-2023

Thank you everyone for the response and information, I appreciate it.

Our setup does not rely on network to much, this is what I requested when we are still in meeting with Veritas Solutions.

The Master server just commands the Client/Media server to start the backups.

All the heavy work load is done on the media server where the Auto loader is directly connected via HBA SAS Cable.

I spoke with Veritas regarding our environment and they said that Vertias NBU can work with my environment. We are still not able to have very fast network to be able to take advantage of Netbackup.

So we will just settle for backing up to tape for now. Maybe in the a few years if company decides to upgrade our network infra then we could have 10Gig Network speeds.

And right now trying to optimize the NBU's performance via software settings seems to work.

VOX

Buffer Delay in our new Netbackup Environment is this normal?