Forum Discussion

backup-botw's avatar
10 years ago
Solved

Buffer Size Data for My Site

I am sure this has been answered 100 times over and I have read a bunch of different documents, forum posts, whitepapers and anything else I could find. In my environment we are having issues where we appear to not be getting very good throughput during the first weekend of each month when we are busiest. Based on another recommendation I have setup a duplication window to run during the day so they dont interfere with backups and the replication SLPs run 24x7. What I am seeing in bptm is less than 100MB/sec write speeds when duplicating images to tape. I have 4 dedicated LTO 6 drives which should be getting upwards of 160MB/sec so I am quite certain that my issue is that my environment is simply not tuned accordingly to accomdate our requirements.

I have found one website and a best practices document from Vision that show the below settings as a starting point.

  • SIZE_DATA_BUFFERS = 262144
  • SIZE_DATA_BUFFERS_DISK = 1048576
  • NUMBER_DATA_BUFFERS = 256
  • NUMBER_DATA_BUFFERS_DISK = 512
  • NET_BUFFER_SZ = 1048576
  • NET_BUFFER_SZ_REST = 1048576

Currently on my media servers that have the disk pools attached I have nothing tuned and I really am not sure where to begin to ensure this is done correctly.

Now I do see in /usr/openv/netbackup/db/config on these 3 media servers attached to disk pools the below two files with the following settings...

DPS_PROXYDEFAULTRECVTMO - 3600

DPS_PROXYDEFAULTSENDTMO - 3600

Now my master server which is not connected to a disk pool has some configurations set on it which are as follows...

/usr/openv/netbackup

NET_BUFFER_SZ - 262144

And...

/usr/openv/netbackup/db/config

NUMBER_DATA_BUFFERS - 64

NUMBER_DATA_BUFFERS_RESTORE - 64

SIZE_DATA_BUFFERS - 262144

Now all of these current settings pre-date me. I have not set or adjusted any of them and my predecessor was not here when I took this position so I have no information as to what was done to come up with these numbers. I do know that the site I am at used to be a DR location that didn't do much and is now the primary site and the workload has quadrupled.

I have 4 LTO 6 drives attached to the media servers that are attached to the disk pools and the drives are dedicated to doing duplication jobs so they dont interfere with backups. I plan to add a 5th in the near future once it has been zoned.

I have adjusted some of the SLP Parameters and mainly focused on the min and max size to try and match that of an LTO 6 tape.

slpparameters.png

My max concurrent jobs setting is currently at 55 for the 3 storage units that are attached to disk...

concjobs.png

But my current max i/o streams setting is not used so it is unlimited...

iostreams.png

I do have an issue with memory usage on the master which we are fixing as soon as our new memory we ordered comes in and I can load it up. The media servers however have a ton of memory so I dont see any OS performance issues on them at all. This is also the only thing preventing me from upgrading the site to 7.7.1 so once that is done I will also get the upgrade done which should resolve some of our issues as well.

I am pretty confident that this is a lack of performance tuning issue as we can restart Netbackup throughout the environment and everything starts working much better. Also we have zero issues throughout the month at all until we get to this full monthly weekend. When we checked on write speeds to the LTO 6 drives yesterday we saw them as low as 15MB/sec, but after restarting everything this morning we are seeing them in the 90 to 99MB/sec range. Still believe we should be faster since they are LTO 6 drives though.

I have also attached some bptm logs from 10-3...when the full monthly backups started...and from 10-9 which is today. I can grab other logs from days in between if needed as well.

  • @backup-botw - in another recent post, user deeps posted a link to this:

    NetBackup Media Server Deduplication (MSDP) in the Cloud

    https://www.veritas.com/support/en_US/article.000004584

    ...which contains a bit of useful info about the requirements of MSDP media servers, which clearly is still relevant whether the MSDP media server is situated in a cloud estate (or not).

    How do the CPUs in your MSDP media servers match up to the minimum requirements?  (i.e. 8 cores of 2 GHz are better for MSDP than 2 cores of 8 GHz)

    Is your VNX array capable of delivering sustained minimum of 3000 IOPS for MSDP?  The linked doc doesn't delineate whether this 3000 IOPs is purely for MSDP backup ingest - or whether 3000 IOPs is enough to sustain bulk MSDP ingest and bulk MSDP re-hydration - to me it would seem unlikely that anyone would be using tape in the cloud - and so, to me, it would seem that 3000 IOPs is a minimum requirement for MSDP ingest only, and thus the addition of an MSDP re-hyration workload would therefore demand a higer level of sustained IOPs.  Then again Symantec/Veritas could be playing a very conservative game by quoting quite a high minimum.

    .

    As a separate point, here's something that I've noticed when monitoring the disk underneath MSDP... that the disk IO queue depth never gets very high... it's as if the IO layer right at the bottom of MSDP is sensitive to the response time and latency and queue depth and actively avoids swamping the disk driver/interface/HBA layer with lots of outstanding pending incomplete queued disk IO... which says to me that you could have a situation where the disks don't look overly strained, and respond at what appears fairly nice levels, but because the VNX IO is actually not that responsive (from your graphs), i.e. not sub 3ms and more around 10ms, then the CPU isn't going to get swamped - because in all honesty it's spending most of it's time waiting for the few disk IO (which have been queued/requested) to complete - and so... the issue would appear to not be a disk issue... because you can't see oodles of disk IO that are being delivered late or queueing up.  It's as if MSDP is actively trying to avoid tripping itself up.  I see MSDP make disks 100% busy, but the queue depth never gets very high - to me the software is avoiding requesting too many in-flight disk IOs. - with the visible net effect of not looking overstained, and not looking like a problem... but at the same time not doing very much... and yet so very capable of doing so much more if only the disk sub-system were more responsive (<3ms).

    .

    Anotther thing to look for... Are the media servers re-hydrating and duplicating/sending between themselves - a quick way to check is to look for "media-server-A -> media-server-B' in the 'Media Server' column in the Activity Monitor - for the duplication jobs.   If you see '->' then the re-hydrated data is being sent, full fat, across the LAN from one media server (MSDP source) to another media server (tape writer) - and this could slow things down horribly and could potentially swamp your LAN NICs and or LAN switches - which are at the same time trying to move oodles of incoming backup data from clients.