Re: Backup policy recovering speed slowly (disk)

_charm_pi · ‎04-22-2020

Dear NetBackupers,

We have a NetBackup version 8.1.1 server and mostly Solaris clients for backup, server is connected with clients via LAN fibre optics and backup storage Dell EMC Data Domain 2500 is mounted via NFS to our NetBackup server also fiber optics. Our issue manifests as two specific policies interleave and the first that is run looses its backup speed and doesn't achieve max potential speed thus backup time of policy that usually finishes in 20 hours lasts around 30 hours.

How could we configure client or server settings that when one of the affected policies finishes other regains max potential speed?
We have done server tuning as recommended by backup storage vendor as this:

-----------------------------------------------------------------------------------------------------

✔Include the following lines in /etc/system

for TCP:

set ndd:tcp_wscale_always=1
set ndd:tcp_tstamp_if_wscale=1
set ndd:tcp_max_buf=16777216
set ndd:tcp_cwnd_max=8388608
set ndd:tcp_xmit_hiwat=2097152
set ndd:tcp_recv_hiwat=2097152

for NFS:

set nfs:nfs3_nra=16
set nfs:nfs3_max_transfer_size=1048576
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_threads=256
set rpcmod:clnt_max_conns=64

✔Add the NFS client mount options "rsize=1048576,wsize=1048576"

✔NFS patch 147268-01 or kernel patch 147440-05 (or later) is installed for Solaris.
----------------------------------------------------------------------------------------------------------------

Changes were done on the media server of NetBackup as we had some troubles with policies speed after that speeds were better but we are still observing before mentioned problems with those two policies when they interleave and the backup speeds in my opinion are not recovering quicky enough when one of those backup policies finishes.

Should we make reccomended changes on NetBackup clients also so backup speeds would recover more quickly?

TLDR: We can reschedule policies with little effort but how to configure them to achieve maximum speed in less time.

sdo · ‎04-22-2020

If job A runs on its own then it completes within 20 hours.

If job B runs on its own then it completes within 20 hours.

If job A plus job B overlap and run together at any time for any reasonable length of time, then one or both of the job A and/or job B then complete within 30 hours - and the overall average speed is reported to be slower (i.e. the reported speed in Activity Monitor of NetBackup Admin Console).

If the above is true, then the reported lower average speed would be entirely normal because the value of speed reported by NetBackup is only ever the average of the total moved by the time elapsed and is not (AFAIK) the actual current speed. During any backup job there will be times when the backup is actually running faster than the average, and times when it is actually running slower than the average - but the reported average will only change gradually the longer that a backup runs faster or slower.

But also, look at it this way... if job A runs in 20 hours on its own, and job B runs on its own in 20 hours, but if job A and job B start together at the same time, then they both take nearly 40 hours then this tells us that there is at least one bottleneck at some point along the data movement path.

IMO, your only course of action is a thorough review of all elements of infrastructure. IMO I wouldn't bother tweaking any configuration settings of anything until I understood exactly where the bottleneck is.

e.g. along the entire data path, all the way from source data on source disk, to target backup data on target storage unit, check all of these:

1) If the CPUs of any NetBackup Server, and/or any active NetBackup Client, and/or the CPUs of the Data Domains, are NOT maxed out during times when both job A and job B are running then that rules out a CPU limitation.

2) then check memory

3) then check disk LUN throughput

4) then check throughput on every SAN interface

5) then check throughput on every NIC interface and bond

.

The thruth is that no finite network system ever has limitless throughput capacity. i.e. all networks always have at least one bottleneck somewhere.

Nicolai · ‎04-23-2020

How may streams are you running concurrent on the data domain ?

What DDOS version are you running ?

Do you have theb BOOST license ?

From my experience, even thou Dell-EMC claim they can have many hundred concurrent connection, the injection max is reached around 60 concurrent connections.

One indication of bottleneck is the average transfer speed, if all backup to data domain somewhat have the same transfer speed in KB/sec, then for sure there is a infrastructure bottleneck.

_charm_pi · ‎04-24-2020

Will try as suggested and report back to forum.

_charm_pi · ‎04-24-2020

Hi Nicolai,

When affected policies interleave maximum streams is 48 or 24 per policy. Don't have Boost licence, DDOS version is 6.1.2.30-615547.

Nicolai is there any possibility to configure settings for policy that IF policies interleave for some reason that the policy B that is still ongoing(A policy starts earlier than other) recovers its speed to maximum?

This is what we are getting in output of finished job(A policy):

―

• begin writing
• Info bpbkar (pid=2579) 1 entries sent to bpdbm
• Info bpbkar (pid=2579) bpbkar waited 638919 times for empty buffer, delayed 116487240 times
• Info bptm (pid=7200) waited for full buffer 17241 times, delayed 63729 times
• Info bptm (pid=7200) EXITING with status 0 <....
• Info bpbrm (pid=2571) validating image for client enmlombs3
• Info bpbkar (pid=2579) done, status: 0: the requested operation was successfully completed
• end writing; write time: 36:45:16

―

It seems that streams from policy A cannot achieve their maximum potential if policies interleave, is this dependant on hardware?

VOX

Backup policy recovering speed slowly (disk)