cancel
Showing results for 
Search instead for 
Did you mean: 

Buffer Size Data for My Site

backup-botw
Level 6

I am sure this has been answered 100 times over and I have read a bunch of different documents, forum posts, whitepapers and anything else I could find. In my environment we are having issues where we appear to not be getting very good throughput during the first weekend of each month when we are busiest. Based on another recommendation I have setup a duplication window to run during the day so they dont interfere with backups and the replication SLPs run 24x7. What I am seeing in bptm is less than 100MB/sec write speeds when duplicating images to tape. I have 4 dedicated LTO 6 drives which should be getting upwards of 160MB/sec so I am quite certain that my issue is that my environment is simply not tuned accordingly to accomdate our requirements.

I have found one website and a best practices document from Vision that show the below settings as a starting point.

  • SIZE_DATA_BUFFERS = 262144
  • SIZE_DATA_BUFFERS_DISK = 1048576
  • NUMBER_DATA_BUFFERS = 256
  • NUMBER_DATA_BUFFERS_DISK = 512
  • NET_BUFFER_SZ = 1048576
  • NET_BUFFER_SZ_REST = 1048576

Currently on my media servers that have the disk pools attached I have nothing tuned and I really am not sure where to begin to ensure this is done correctly.

Now I do see in /usr/openv/netbackup/db/config on these 3 media servers attached to disk pools the below two files with the following settings...

DPS_PROXYDEFAULTRECVTMO - 3600

DPS_PROXYDEFAULTSENDTMO - 3600

Now my master server which is not connected to a disk pool has some configurations set on it which are as follows...

/usr/openv/netbackup

NET_BUFFER_SZ - 262144

And...

/usr/openv/netbackup/db/config

NUMBER_DATA_BUFFERS - 64

NUMBER_DATA_BUFFERS_RESTORE - 64

SIZE_DATA_BUFFERS - 262144

Now all of these current settings pre-date me. I have not set or adjusted any of them and my predecessor was not here when I took this position so I have no information as to what was done to come up with these numbers. I do know that the site I am at used to be a DR location that didn't do much and is now the primary site and the workload has quadrupled.

I have 4 LTO 6 drives attached to the media servers that are attached to the disk pools and the drives are dedicated to doing duplication jobs so they dont interfere with backups. I plan to add a 5th in the near future once it has been zoned.

I have adjusted some of the SLP Parameters and mainly focused on the min and max size to try and match that of an LTO 6 tape.

slpparameters.png

My max concurrent jobs setting is currently at 55 for the 3 storage units that are attached to disk...

concjobs.png

But my current max i/o streams setting is not used so it is unlimited...

iostreams.png

I do have an issue with memory usage on the master which we are fixing as soon as our new memory we ordered comes in and I can load it up. The media servers however have a ton of memory so I dont see any OS performance issues on them at all. This is also the only thing preventing me from upgrading the site to 7.7.1 so once that is done I will also get the upgrade done which should resolve some of our issues as well.

I am pretty confident that this is a lack of performance tuning issue as we can restart Netbackup throughout the environment and everything starts working much better. Also we have zero issues throughout the month at all until we get to this full monthly weekend. When we checked on write speeds to the LTO 6 drives yesterday we saw them as low as 15MB/sec, but after restarting everything this morning we are seeing them in the 90 to 99MB/sec range. Still believe we should be faster since they are LTO 6 drives though.

I have also attached some bptm logs from 10-3...when the full monthly backups started...and from 10-9 which is today. I can grab other logs from days in between if needed as well.

1 ACCEPTED SOLUTION

Accepted Solutions

sdo
Moderator
Moderator
Partner    VIP    Certified

@backup-botw - in another recent post, user deeps posted a link to this:

NetBackup Media Server Deduplication (MSDP) in the Cloud

https://www.veritas.com/support/en_US/article.000004584

...which contains a bit of useful info about the requirements of MSDP media servers, which clearly is still relevant whether the MSDP media server is situated in a cloud estate (or not).

How do the CPUs in your MSDP media servers match up to the minimum requirements?  (i.e. 8 cores of 2 GHz are better for MSDP than 2 cores of 8 GHz)

Is your VNX array capable of delivering sustained minimum of 3000 IOPS for MSDP?  The linked doc doesn't delineate whether this 3000 IOPs is purely for MSDP backup ingest - or whether 3000 IOPs is enough to sustain bulk MSDP ingest and bulk MSDP re-hydration - to me it would seem unlikely that anyone would be using tape in the cloud - and so, to me, it would seem that 3000 IOPs is a minimum requirement for MSDP ingest only, and thus the addition of an MSDP re-hyration workload would therefore demand a higer level of sustained IOPs.  Then again Symantec/Veritas could be playing a very conservative game by quoting quite a high minimum.

.

As a separate point, here's something that I've noticed when monitoring the disk underneath MSDP... that the disk IO queue depth never gets very high... it's as if the IO layer right at the bottom of MSDP is sensitive to the response time and latency and queue depth and actively avoids swamping the disk driver/interface/HBA layer with lots of outstanding pending incomplete queued disk IO... which says to me that you could have a situation where the disks don't look overly strained, and respond at what appears fairly nice levels, but because the VNX IO is actually not that responsive (from your graphs), i.e. not sub 3ms and more around 10ms, then the CPU isn't going to get swamped - because in all honesty it's spending most of it's time waiting for the few disk IO (which have been queued/requested) to complete - and so... the issue would appear to not be a disk issue... because you can't see oodles of disk IO that are being delivered late or queueing up.  It's as if MSDP is actively trying to avoid tripping itself up.  I see MSDP make disks 100% busy, but the queue depth never gets very high - to me the software is avoiding requesting too many in-flight disk IOs. - with the visible net effect of not looking overstained, and not looking like a problem... but at the same time not doing very much... and yet so very capable of doing so much more if only the disk sub-system were more responsive (<3ms).

.

Anotther thing to look for... Are the media servers re-hydrating and duplicating/sending between themselves - a quick way to check is to look for "media-server-A -> media-server-B' in the 'Media Server' column in the Activity Monitor - for the duplication jobs.   If you see '->' then the re-hydrated data is being sent, full fat, across the LAN from one media server (MSDP source) to another media server (tape writer) - and this could slow things down horribly and could potentially swamp your LAN NICs and or LAN switches - which are at the same time trying to move oodles of incoming backup data from clients.

View solution in original post

33 REPLIES 33

sdo
Moderator
Moderator
Partner    VIP    Certified

OS of the media servers?

backup-botw
Level 6

All 3 that are attached to disk are...

Red Hat Enterprise Linux Server release 6.7 (Santiago)

Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP Tue Mar 10 17:01:00 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

The master server is Solaris 10.

Current Netbackup version is 7.6.0.2.

Sorry I should have posted all that to begin with.

sdo
Moderator
Moderator
Partner    VIP    Certified

No worries... IMO, in general, most performance issues nearly always come down to disk (either client side or media server side).   I know you've put a lot of detail up re buffers and settings and config, and that is really appreciated, as it really helps us understand the setup - in fact it really is a nice level of detail.

Now, the odd thing with performance investigations... is... that they are just that... "investigations".  The first step is done (you've done it above) which is establish the high level application view of the base/original/starting point of the configuration.

We now need to dig a little bit deeper - and look at how that "captured config" is actually behaving... i.e. it's one thing to capture the settings... but another to capture the actual behaviour.

I'm not sure how you're going to take this... but I would first do some research on the "iostat" command and/or any other defaut tool that comes with your RHEL installation - and look at only one or two things initially (to rule them in or rule them out - for now)... during times of heavy load, and lighter load (i.e. during the heavy monthly jobs, and the not so heavy weekly jobs)... to specificaly look at disk IO latency, and disk IO queue lengths.  See if you can find some nice examples (in this forum, or on Google) of capturing the iostat numbers into a CSV file and then graphing them in Excel - or maybe there already exists a really nice tool for capturing and graphing?

I don't want to give any specific commands to run, because, TBH, it would be much much better, not only for you yourself, but also for the backup estate/environment and your business... if you research exactly which iostat commands you will use - because once you start digging in to the capabilities of these performance capture tools - then you need to be able to adapt the switches/options of the tools to suit your environment - hence I'm not going to give you any specific advice like... you must run so-and-so command, etc.   It is better if you work out how best to use the peformance reporting tools to suit your needs and to better match the config of the environment.

In short - it is the disk behaviour I would look at first - because you have stated CPU and RAM are all good, and I think you'd kind of already know if you had any glaring network issues.

backup-botw
Level 6

Does something like this help? This is a report on the LUNs that make up the disk pool on this one media server. This is the one week average response time.

lunresponse008.png

sdo
Moderator
Moderator
Partner    VIP    Certified

And please please take this the right way... when I see statements like:

"The media servers however have a ton of memory so I dont see any OS performance issues on them at all."

...we need to remember what it is that we mean by "OS" or "media server".  We know what an OS is (a set of binary files which can be executed by a CPU), and we know that an OS is nothing until it is executed... yet, to be executed, a general purpose commodity OS typically needs CPU, RAM and disk.  These are the fundamental building blocks of the modern general purpose sequential processing machine that we call a "computer".  Now, just because the media servers have a ton of memory does not necessarily mean they are capable of performing in an optimal manner.  Give a Cray supercomputer a few disk shelves of storage - and said Cray will never perform to the expected level.

How about this... the best used computer system will be one that is finely scoped/designed/spec'd/built/balanced/tuned such that... the CPU is 100% busy, AND the RAM is 100% used, AND the disks are 100% busy, but still gets the required job done within the required run window.  i.e. you've spent the minimum amount on hardware to get the job done.  Just because a CPU is 100% busy is not necessarily a problem.  If it gets the job done, then is there a problem?  No.

On a heavily loaded system where our "expectations", of getting a job done within a specific time frame, are not being met... then usually at least one of the three basic fundamental building blocks will *always* be a bottleneck.  Now, whether you have cause to notice this bottleneck or not is another matter.  But, note this, there could be two (or more) concurrent bottlenecks, which is better phrased as... there could be two or more elements of the computing system which are both running at 100% at the same time.

This means that if one does not take a holistic view of the entire system, that one can easily be lost chasing down one aspect of performance...and invest time and money in effort and hardware addressing the "discovered limitation"... to later on only find that the situation has not improved at all... because another (second) element was already also at around 100% utilization - and so improving only one area/facet/element resulted in no appreciable benefit.

sdo
Moderator
Moderator
Partner    VIP    Certified

Ok - re that response time graph...

Firstly, I'm not really sure that I can (or should) say "this is good" or "that is bad"... I don't think that's the way we should play this.  I think it would be better if we think about helping you consider things, i.e. help you make the deductions and decisions.

Secondly, a performance investigation is more like a game of chess.  You have an opponent, and he/she is constantly changing position - and you need to keep an eye on the whole game whilst attacking certain specific areas.  What I'm trying to say is...

1) Smoothed average graphs over such a long period, are great for managers - not good for us.  We need musch much finer detail.  Also, you don't state whether that graph covers a good time, or a bad time, or an indifferent time.  The reason is that the "weekly average" graph is like looking at a mountain from a distance and deciding that there is no gold to be mined, just because we can't see the rich seam from such a great distance.

2) We don't have any idea of how said storage is configured, connected or shared - nor what it really is.   Do you see why I'm so reluctant to actually say anything about that graph?  :)

3) A response time (latency) graph on it's own is not much use (BTW - I'm not asking you for more graphs) - you need the average IO packets sizes (to better understand the application) and IO queue depth (to better understand the OS driver) and the average MB/s (to better understand throughput) - all together - but in much finer detail covering the actual time periods of "accpetable load" and "unacceptable performance".

4) In short you need a raft of different graphs and views at your fingertips whilst you press on.

backup-botw
Level 6

To be perfectly honest I dont know how any of that helps me resolve my current situation. I am looking for help with tuning my current Netbackup environment based on what objects I have in it. I can provide the information on current CPU and RAM allocated to the media servers and report on their usage if required. I appreciate the responses, but at this time I have not seen much by way of helping resolve the current issue I am facing. Every search I have done for this issue has turned up with buffer related findings.

And I do agree here...

"Now, just because the media servers have a ton of memory does not necessarily mean they are capable of performing in an optimal manner."

What I am saying is that the server is functioning without issue, but the application installed on it is not. I need to tune/correct the issues with the application. In this instance the application installed on it is Netbackup.

sdo
Moderator
Moderator
Partner    VIP    Certified

Look more closely at disk.  If the CPU is not 100% and the RAM is not 100%, then it's probably disk.  You might be getting 10ms reponse times - but if the IO queue depth is like 2+ then for every IO in flight yopu already have at least another waiting.  Do you see how IO response times on their own can be misleading?

backup-botw
Level 6

What I am looking for is information on how to collect the required data in order to tune Netbackup to my environment.

backup-botw
Level 6

I dont as I dont know how to ascertain any part of this sentence...

"You might be getting 10ms reponse times - but if the IO queue depth is like 2+ then for every IO in flight yopu already have at least another waiting."

I am attempting to be very clear and precise here that I am looking for assistance/guidance with how to get the needed information in order to tune my environment.

I am assuming you have been through this process before? Can you share what was done in order to tune your environment or one you have worked on?

sdo
Moderator
Moderator
Partner    VIP    Certified

Ok 27 LUNs.  How big are the LUNs?  What is the connectivity between media server and disk storage?  DAS?  SAN?  iSCSI?  FC?  SAS?  SCSI?   What are the HBAs?  Speeds?  Multi-pathed?  Multi-fabric?  SAN switch topology?  Switch ports confirmed zero errors?  If SAN FC, have you confirmed that buffer to buffer credit exhaustion is not occuring?

Depending upon the above the next set of questions changes.

WVT
Level 4
Partner Accredited Certified
You said the master has 262144 for buffer size in...... /db/config , what about the three media servers. please confirm this: tape is slow, but when you bounce everything and the tape dupe jobs are the only thing running, it is much faster.

backup-botw
Level 6

The 3 media servers dont have any buffer settings configured. Only the two files that I mentioned in my initial post show on all 3.

When we bounced everything today we had leftover replication jobs and duplication jobs yet to run. When they started running after the reboot yes we are indeed seeing them run faster at this time.

On a regular basis we have no issues with backup jobs, restores and replication jobs running simultaneously, but once we introduce the duplications to the mix during that first weekend of each month it seems to cause issues.

WVT
Level 4
Partner Accredited Certified
And that is the only time you cut fulls to tape, right?

backup-botw
Level 6

Yes that is correct. We have a few policies that run on the weekend that dump some database stuff off to tape, but its very minimal and we have never seen an issue on a normal full weekend. Only when those duplication jobs come in.

WVT
Level 4
Partner Accredited Certified
So you are rehydrating a lot of data in order to copy to tape, and it has to dig through a big tree of segments to assemble the required data. Sorry, I dont have my correct glasses, but is your puredisk pools on a EMC array? Do you have dedicated spindles for your pools? You are storing backup data on the same array that your clients use for applications?

backup-botw
Level 6

Yes we are on a VNX 5600.

We arent using pools for this setup...just a bunch of RAID groups. 31 RAID groups...all RAID 6.

No client data is on this array...only the backup related storage.

backup-botw
Level 6

How big are the LUNs? Size is in GB
LUN 100  1000
LUN 101  5000
LUN 102  5000
LUN 200  1000
LUN 201  5000
LUN 202  5000
LUN 300  1000
LUN 301  5000
LUN 302  5000
LUN 400  1000
LUN 401  5000
LUN 402  5000
LUN 500  1000
LUN 501  5000
LUN 502  5000
LUN 1600 1000
LUN 1601 5000
LUN 1602 5000
LUN 1900 11264
LUN 1901 10747.571
LUN 2400 11264
LUN 2401 10747.571
LUN 2500 11264
LUN 2501 10747.571
LUN 2800 11264
LUN 2801 10747.571
LUN 3000 1000

What is the connectivity between media server and disk storage?  DAS?  SAN?  iSCSI?  FC?  SAS?  SCSI?
SAN FC

What are the HBAs?
Emulex AJ763B/AH403A FV1.11A5 DV10.2.8020.1

Speeds?
8GB

Multi-pathed?
Believe is VxDMP...not sure on this one or what you are looking for exactly. Each server does have 4 initiators.

Multi-fabric?
???

SAN switch topology?
???

Switch ports confirmed zero errors?
Yes

If SAN FC, have you confirmed that buffer to buffer credit exhaustion is not occuring?
I dont know what this is or how to get the information.

sdo
Moderator
Moderator
Partner    VIP    Certified
Can you detail how the LUNs are formed into volumes ar the host layer? Is the VNX shared amongst several backup servers? Are the backup server initiators and the storage array targets cabled in to the same SAN switch? SAN switch model please? When I asked for topology - I meant please descibe list exactly how the servers and storage are cabled and zoned? If you do not have these level of detail then you will never be able to rule out poor patching. Why are the LUNs numbered the way they are? Why are the LUNS all different sizes? Why are some of the LUNs so big? You say there are 31 x RAID-6 parity groups - please can you explain how these are formed? Was this array previously used for something else? I ask because I would have expected to see some kind of uniformity and symmetry in a single purpose storage array? Does the VNX have any form of write caching, and are you able to monitor the consumption / usage levels of this write cache? Do you have any monitoring tools to investigate: - the front end FC ports of the array - cachre levels and hits - hot ports, hot LUNs, hot parity groups, hot back-end SAS buses