Media server hardware replacement leads to slower ...

JeanB · ‎03-18-2015

I replaced hardware for a media server according to TECH77811 and am now getting much slower throughput on tape backups. The master is Linux RHEL 6.3 running NB 7.6.0.4. Original media server was Solaris 10 SUN FIRE X4270 with 4X1GB aggregate NIC, running NB 7.5.0.7 . The new media server is HP DL380p G8 with 10GB NIC, RHEL 6.3 running NB 7.6.0.4 (actually installed 7.5.0.7 prior to hardware swap, added the media server to 7.6.0.4 master, upgraded to 7.6.0.1, and then 7.6.0.4). No changes on LTO4 robot.

Clients are running a mix of 6.5.4 and 7.5.0.7. What is happening is that many, if not all, backups are throughputting at 7 MB/s, which were previously at 30 MB/s, on the SUN media server.

On the new media server:

sysctl -a | grep kernel.sem
kernel.sem = 300 307200 32 1024

I played around with SIZE_DATA_BUFFERS = 229376 and 262144, NUMBER_DATA_BUFFERS = 10 and 128, respectively but I haven't noticed any improvemnts. Note that the original Sun media server had no SIZE_DATA_BUFFERS nor NUMBER_DATA_BUFFERS files (so it was using defaults)

Backups of the media server's files straight to tape give 80MB/s so this does not appear to be a HBA issue.

I'm running out of ideas on where and what to look for. Support says they can't help me unless I hire Support services.

JeanB · ‎03-18-2015

Note that the new media server hostname and IP remain the same as previous media server

mph999 · ‎03-18-2015

OK, so :

Backups of the media server's files straight to tape give 80MB/s so this does not appear to be a HBA issue.

So this is 'reasonable' perhaps not the best possible, but certainly not 'slow'.

Size buffers of 262144 , and number of 128 or 256 (try both) should give reasonable performance.

So we'll call 80MB/s the baseline.

If client backups are going slower, it's not the buffer settings, the backup of the media server itself shows what is possible, so it's something between the client and media server.

Usually this is either read speed of the clients, or network. Given that this is mutiple clients its not going to be readread speed, so the most likely is the network.

The network is two parts, 1. the switchs and cables

and the part that is most often forgotten

2. The TCP 'stack' within the OS

Activity Monitor for a completed slow job should show the number of waits/ delays. I suspect it will show a large number for bptm waiting for a full buffer - which simply confirms the issue is on the client > media side, as opposed to the media > tape drive side.

So questions:

In /usr/openv/netbackup/db/config is there a file for NET_BUFFER_SZ, if so does it have a vaule in it. If not, create this file and put 0 (zero) in it.

Does that change things (better / worse /same)

Another test to do is FTP some data from a 'slow' client to the media server, how fast does this go ... If you use a fairly decent sized file for this.

These settings are from Sym Engineering for OS tuning on an NBU Appliance. I know that's not what you have, but the Appliance runs Linux (SUSE) - so it is at least reasonable to consider these settings.

I'm not sure if these settings translate exactly the same to RH, I'll leave that for you to check (late here and I need to go to bed).

net.ipv4.tcp_max_orphans= 400000
net.core.netdev_max_backlog= 250000
net.ipv4.tcp_keepalive_time= 900
net.ipv4.tcp_keepalive_intvl= 30
net.ipv4.tcp_max_syn_backlog= 16384
net.ipv4.tcp_synack_retries= 1

Also increase ring buffer RX setting on eth2

ethtool -G eth2 rx 4096 (or eth1 if thats what you use)

Hope this at least helps in some way.

Martin

Nicolai · ‎03-19-2015

You likely have more than one issue. First make sure network actually can carry the new load.

We as default add these variables to /etc/sysctl.conf. Apply with sysctl -p /etc/sysctl.conf

net.ipv4.tcp_keepalive_time=1800
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.ipv4.tcp_rmem = 8192 524288 16777216
net.ipv4.tcp_wmem = 8192 524288 16777216
net.core.netdev_max_backlog = 30000

Our media servers do 5-8GB so the setting should work.

Verify tape can write with full speed. There is a random data directive in Netbackup that can help you generate data in memory for performance base-lining

http://www.symantec.com/docs/TECH75213

JeanB · ‎03-19-2015

I did a Gen-data test before and after using Nicolai tuning params and got no change. But..

But if I send the backup to another SUN media server (very similar to the one replaced), the numbers talk

For the new media server I got 5MB/s and to the other media server I get 57Mb/s(with LTO4 drive). The other media server has 2 GB NIC (2X 1GB)

So the problem is network. I also had MTU set to 1500 on the 10GB NIC which I changed to 9000... no effect

NET_BUFFER_SZ set to 0 has no effect either.

/usr/openv/netbackup/bin/bpbackup -p bd205-user -s User -w GEN_DATA GEN_KBSIZE=2000000 GEN_MAXFILES=10 GEN_PERCENT_RANDOM=100

RonCaplinger · ‎03-19-2015

Are the switch ports for the 10G connection set to 10G (not 10/100 default) ?

Are you using 10G SFP's for the network connections to this media server?

Is the network connection set to Full Duplex across switches and media server?

What about the NIC report of packets dropped? If the NIC is dropping packets, you might have an overheating NIC.

Nicolai · ‎03-19-2015

JeanB - Talk to your network admin.

Backup server require a non-block switch. No Cisco FEX also.

Verify network performance by using iperf and bring the number to network admin if they reveal a problem.

http://www.symantec.com/docs/HOWTO64302

Do some cross text across media server, often test show a certain switch, data center or subnet does not deliver the performance required. And by providing these observation to the network admin, you show, you did the home work, before bugging him.

sdo · ‎03-20-2015

Saw that you, at one point, changed to using jumbo frames - this is a very bad idea on a network which is not all jumbo frames. One of the requirements of a jumbo frame network is that *ALL* devices on the network use jumbo frames. Another thing is that not all devices can talk 9000 byte jumbo frames. What you may find is that some devices can only talk 8500, or 8000 - and if so then all devices on the network have to be set to use the smallest size of the range of different sizes of supported jumbo frames. Anything else will always cause lots of overhead or lost/dropped packets.

I like Nicolai's point re "non-block switch"... and I'll try to describe what he means... Not all LAN or SAN switches are created equal. For bulk data movers like NetBackup - it is highly recommended to put NetBackup media servers (data movers) on non-blocking LAN and SAN switch ports - and usually as close to the core of the network as possible (but not always).

What me mean by non-blocking is to ensure that the switch port is never contended. So, for example, let's assume you might be using a LAN switch which has forty 10 Gb LAN ports on it, and all forty ports are used, and NetBackup media server has one of the 10Gb ports... what might be hapenning in the back-end is that the switch ports are in what are known as 'port groups', and it is a 'port group' which is capable of sustained 10Gb - so worst case is, for example four 10Gb ports in a 10Gb 4:1 port group all being busy means that no single port can ever have sustained more than 2.5Gb - and network tarffic being network traffic and very bursty (at the milli-second layer it always appears bursty) - then when a NetBackup Media server NIC/HBA port resides within a contended portgroup on a switch, it regularly is unable to burst to the max 10Gb - and the switch has to work lots of overtime managing this.

In my experience, most models of Brocade SAN switches are non-blocking - i.e. all ports - and I do mean all ports - on a Brocade SAN switch can always all run at max speed - e.g. Brocade SAN switches typically have the capability to allow every single SAN switch port to always run fully sustained 8Gb transmit (Tx) and 8Gb receive (Rx) - always and concurrently - i.e. they are not contended.

Whereas I have regularly and typically seen LAN switches from other vendors where the LAN switch ports are in contended port-groups of for example 4:1 port group contention - and so what we typically see is, for example, a LAN switch module of say for example forty 10Gb LAN switch ports - but only actually ten connected ports - i.e. only every fourth port is plugged in - and thirty LAN switch ports forever sit idle (with no SFP's plugged in :) - because we need to be sure that all 10Gb ports can always burst to, and sustain, 10Gb data traffic - so that backup traffic is never contended - not ever. Design thinking such as this is the only way to build a network that is truely capable and which can therefore never be impacted by traffic loads.

Always check your switch vendor documentation for port contention. I saw 8:1 on a LAN switch once - but I'm not naming names.

sdo · ‎03-20-2015

Your old Sun server was using 4 x 1Gb aggregated NICs, and your new HP server is using at least one 10Gb NIC. I assume therefore that the LAN switch is also different - if so, is it further (more hops) away from the backup clients? Re: the inter-switch link (i.e. route of hops) between your new 10Gb LAN switch and the switches where the clients are - are these trunks/links swamped? Maybe even your core LAN switch is suffering port-contention... e.g. if you have four 10Gb 'edge' LAN switches all plugged in to one 'core' LAN switch port-group, then this is also a recipe for performance issues.

JeanB · ‎03-23-2015

I can't explain why two backups from the same client on the same tape and same drive run at two very different speeds, one at 52MB/s et the other at 3MB/s. Below are the session logs:

KB per second : 52785 KB written: 2040811

03/22/2015 07:16:37 - Info nbjm (pid=54914) starting backup job (jobid=288945) for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m
03/22/2015 07:16:37 - Info nbjm (pid=54914) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=288945, request id:{E81652C8-D084-11E4-A4CE-D1DE63791BF7})
03/22/2015 07:16:37 - requesting resource MO-BK005-SL500-1
03/22/2015 07:16:37 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:16:37 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:16:37 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:16:37 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:16:37 - granted resource VL0524
03/22/2015 07:16:37 - granted resource HP.ULTRIUM4-SCSI.005
03/22/2015 07:16:37 - granted resource MO-BK005-SL500-1
03/22/2015 07:16:37 - estimated 0 kbytes needed
03/22/2015 07:16:37 - Info nbjm (pid=54914) started backup (backupid=VL-MO-APG104-FAE2_1427022997) job for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m on storage unit MO-BK005-SL500-1
03/22/2015 07:16:38 - Info bpbrm (pid=52669) VL-MO-APG104-FAE2 is the host to backup data from
03/22/2015 07:16:38 - Info bpbrm (pid=52669) telling media manager to start backup on client
03/22/2015 07:16:38 - Info bptm (pid=52671) using 229376 data buffer size
03/22/2015 07:16:38 - Info bptm (pid=52671) using 10 data buffers
03/22/2015 07:16:38 - Info bpbrm (pid=52669) spawning a brm child process
03/22/2015 07:16:38 - Info bpbrm (pid=52669) child pid: 11815
03/22/2015 07:16:39 - Info bpbrm (pid=52669) sending bpsched msg: CONNECTING TO CLIENT FOR VL-MO-APG104-FAE2_1427022997
03/22/2015 07:16:39 - Info bpbrm (pid=52669) listening for client connection
03/22/2015 07:16:39 - Info bpbrm (pid=52669) INF - Client read timeout = 300
03/22/2015 07:16:39 - connecting
03/22/2015 07:16:48 - Info bpbrm (pid=52669) accepted connection from client
03/22/2015 07:16:48 - Info dbclient (pid=0) Backup started
03/22/2015 07:16:48 - Info bpbrm (pid=52669) Sending the file list to the client
03/22/2015 07:16:48 - connected; connect time: 0:00:00
03/22/2015 07:16:48 - begin writing
03/22/2015 07:17:17 - Info dbclient (pid=0) done. status: 0
03/22/2015 07:17:24 - Info bpbrm (pid=52669) media manager for backup id VL-MO-APG104-FAE2_1427022997 exited with status 0: the requested operation was successfully completed
03/22/2015 07:17:24 - end writing; write time: 0:00:36
the requested operation was successfully completed (0)

KB per second : 2986 KB written: 4518534

03/22/2015 07:33:57 - Info nbjm (pid=54914) starting backup job (jobid=288958) for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m
03/22/2015 07:33:57 - Info nbjm (pid=54914) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=288958, request id:{542F64B6-D087-11E4-BA03-D7D76731DDDB})
03/22/2015 07:33:57 - requesting resource MO-BK005-SL500-1
03/22/2015 07:33:57 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:33:57 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:33:58 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:33:58 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:33:58 - granted resource VL0524
03/22/2015 07:33:58 - granted resource HP.ULTRIUM4-SCSI.005
03/22/2015 07:33:58 - granted resource MO-BK005-SL500-1
03/22/2015 07:33:58 - estimated 0 kbytes needed
03/22/2015 07:33:58 - Info nbjm (pid=54914) started backup (backupid=VL-MO-APG104-FAE2_1427024038) job for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m on storage unit MO-BK005-SL500-1
03/22/2015 07:33:58 - Info bptm (pid=52671) using 229376 data buffer size
03/22/2015 07:33:58 - Info bpbrm (pid=52669) VL-MO-APG104-FAE2 is the host to backup data from
03/22/2015 07:33:58 - Info bptm (pid=52671) using 10 data buffers
03/22/2015 07:33:58 - Info bpbrm (pid=52669) telling media manager to start backup on client
03/22/2015 07:33:58 - Info bpbrm (pid=52669) spawning a brm child process
03/22/2015 07:33:58 - Info bpbrm (pid=52669) child pid: 13513
03/22/2015 07:33:59 - Info bpbrm (pid=52669) sending bpsched msg: CONNECTING TO CLIENT FOR VL-MO-APG104-FAE2_1427024038
03/22/2015 07:33:59 - Info bpbrm (pid=52669) listening for client connection
03/22/2015 07:33:59 - Info bpbrm (pid=52669) INF - Client read timeout = 300
03/22/2015 07:33:59 - connecting
03/22/2015 07:34:08 - Info bpbrm (pid=52669) accepted connection from client
03/22/2015 07:34:08 - Info dbclient (pid=0) Backup started
03/22/2015 07:34:08 - Info bpbrm (pid=52669) Sending the file list to the client
03/22/2015 07:34:08 - connected; connect time: 0:00:00
03/22/2015 07:34:08 - begin writing
03/22/2015 07:59:18 - Info dbclient (pid=0) done. status: 0
03/22/2015 07:59:26 - Info bpbrm (pid=52669) media manager for backup id VL-MO-APG104-FAE2_1427024038 exited with status 0: the requested operation was successfully completed
03/22/2015 07:59:26 - end writing; write time: 0:25:18
the requested operation was successfully completed (0)

JeanB · ‎03-23-2015

I can't explain why two backups from the same client on the same tape and same drive run at two very different speeds, one at 52MB/s et the other at 3MB/s. Below are the session logs:

KB per second : 52785 KB written: 2040811

03/22/2015 07:16:37 - Info nbjm (pid=54914) starting backup job (jobid=288945) for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m
03/22/2015 07:16:37 - Info nbjm (pid=54914) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=288945, request id:{E81652C8-D084-11E4-A4CE-D1DE63791BF7})
03/22/2015 07:16:37 - requesting resource MO-BK005-SL500-1
03/22/2015 07:16:37 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:16:37 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:16:37 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:16:37 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:16:37 - granted resource VL0524
03/22/2015 07:16:37 - granted resource HP.ULTRIUM4-SCSI.005
03/22/2015 07:16:37 - granted resource MO-BK005-SL500-1
03/22/2015 07:16:37 - estimated 0 kbytes needed
03/22/2015 07:16:37 - Info nbjm (pid=54914) started backup (backupid=VL-MO-APG104-FAE2_1427022997) job for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m on storage unit MO-BK005-SL500-1
03/22/2015 07:16:38 - Info bpbrm (pid=52669) VL-MO-APG104-FAE2 is the host to backup data from
03/22/2015 07:16:38 - Info bpbrm (pid=52669) telling media manager to start backup on client
03/22/2015 07:16:38 - Info bptm (pid=52671) using 229376 data buffer size
03/22/2015 07:16:38 - Info bptm (pid=52671) using 10 data buffers
03/22/2015 07:16:38 - Info bpbrm (pid=52669) spawning a brm child process
03/22/2015 07:16:38 - Info bpbrm (pid=52669) child pid: 11815
03/22/2015 07:16:39 - Info bpbrm (pid=52669) sending bpsched msg: CONNECTING TO CLIENT FOR VL-MO-APG104-FAE2_1427022997
03/22/2015 07:16:39 - Info bpbrm (pid=52669) listening for client connection
03/22/2015 07:16:39 - Info bpbrm (pid=52669) INF - Client read timeout = 300
03/22/2015 07:16:39 - connecting
03/22/2015 07:16:48 - Info bpbrm (pid=52669) accepted connection from client
03/22/2015 07:16:48 - Info dbclient (pid=0) Backup started
03/22/2015 07:16:48 - Info bpbrm (pid=52669) Sending the file list to the client
03/22/2015 07:16:48 - connected; connect time: 0:00:00
03/22/2015 07:16:48 - begin writing
03/22/2015 07:17:17 - Info dbclient (pid=0) done. status: 0
03/22/2015 07:17:24 - Info bpbrm (pid=52669) media manager for backup id VL-MO-APG104-FAE2_1427022997 exited with status 0: the requested operation was successfully completed
03/22/2015 07:17:24 - end writing; write time: 0:00:36
the requested operation was successfully completed (0)

KB per second : 2986 KB written: 4518534

03/22/2015 07:33:57 - Info nbjm (pid=54914) starting backup job (jobid=288958) for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m
03/22/2015 07:33:57 - Info nbjm (pid=54914) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=288958, request id:{542F64B6-D087-11E4-BA03-D7D76731DDDB})
03/22/2015 07:33:57 - requesting resource MO-BK005-SL500-1
03/22/2015 07:33:57 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:33:57 - requesting resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:33:58 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_CLIENT.MAXJOBS.VL-MO-APG104-FAE2
03/22/2015 07:33:58 - granted resource vl-mo-bk001.ip.videotron.ca.NBU_POLICY.MAXJOBS.AUTO_ORA_APG
03/22/2015 07:33:58 - granted resource VL0524
03/22/2015 07:33:58 - granted resource HP.ULTRIUM4-SCSI.005
03/22/2015 07:33:58 - granted resource MO-BK005-SL500-1
03/22/2015 07:33:58 - estimated 0 kbytes needed
03/22/2015 07:33:58 - Info nbjm (pid=54914) started backup (backupid=VL-MO-APG104-FAE2_1427024038) job for client VL-MO-APG104-FAE2, policy AUTO_ORA_APG, schedule Default-App-6m on storage unit MO-BK005-SL500-1
03/22/2015 07:33:58 - Info bptm (pid=52671) using 229376 data buffer size
03/22/2015 07:33:58 - Info bpbrm (pid=52669) VL-MO-APG104-FAE2 is the host to backup data from
03/22/2015 07:33:58 - Info bptm (pid=52671) using 10 data buffers
03/22/2015 07:33:58 - Info bpbrm (pid=52669) telling media manager to start backup on client
03/22/2015 07:33:58 - Info bpbrm (pid=52669) spawning a brm child process
03/22/2015 07:33:58 - Info bpbrm (pid=52669) child pid: 13513
03/22/2015 07:33:59 - Info bpbrm (pid=52669) sending bpsched msg: CONNECTING TO CLIENT FOR VL-MO-APG104-FAE2_1427024038
03/22/2015 07:33:59 - Info bpbrm (pid=52669) listening for client connection
03/22/2015 07:33:59 - Info bpbrm (pid=52669) INF - Client read timeout = 300
03/22/2015 07:33:59 - connecting
03/22/2015 07:34:08 - Info bpbrm (pid=52669) accepted connection from client
03/22/2015 07:34:08 - Info dbclient (pid=0) Backup started
03/22/2015 07:34:08 - Info bpbrm (pid=52669) Sending the file list to the client
03/22/2015 07:34:08 - connected; connect time: 0:00:00
03/22/2015 07:34:08 - begin writing
03/22/2015 07:59:18 - Info dbclient (pid=0) done. status: 0
03/22/2015 07:59:26 - Info bpbrm (pid=52669) media manager for backup id VL-MO-APG104-FAE2_1427024038 exited with status 0: the requested operation was successfully completed
03/22/2015 07:59:26 - end writing; write time: 0:25:18
the requested operation was successfully completed (0)

sdo · ‎03-23-2015

Are you able to share the backup policy config?

bppllist AUTO_ORA_APG -U

RonCaplinger · ‎03-23-2015

Nicolai, sdo, and I have given you some things to check, have you had a chance to check things like the new media server's connection to the network? Since it sounds like you can back up the media server at a decent speed, there is nothing with the tape path that should be causing this. So the logical place to look is your network infrastructure between your clients and the media server.

If you used to get 57MB/sec but you are now only getting 5MB/sec, that looks like something in your network path is misconfigured, likely at the 10G connection on your switch. The difference is about 10:1, which is what would happen if the switch's port was defaulting to 100Mbit/sec.

Check the switch's port settings and make sure it is HARD CODED to use 10G, not AUTO, not 10/100, not 1GB. Please let us know what you find.

sdo · ‎03-23-2015

1) You say the original server used 4 x 1 Gb bonding - but can I ask exactly what bonding type?

2) And you say new server is using 10Gb NIC - just the one? Or bonded too? If so, exactly which type of bonding? And if bonded - on the same switch or across physically deiverse LAN switches - or across "stacked" LAN switches ("stacked" = "separate but not really separate") ?

3) Are any of the servers (master, media, or client) in this scenario virtualised ? And did the media server change virtualise from physical to virtual?

JeanB · ‎03-23-2015

Bonding on the old server was on the same switch as the new server and consisted of a port channel made up of 4 1GB links. The new media server is using a single 10GB witthout bonding. No virtualizing is involved on any clients or media server. All servers are physical.

RonCaplinger · ‎03-23-2015

Your first email indicated: "many, if not all, backups are throughputting at 7 MB/s, which were previously at 30 MB/s". So do you have some backups going through the new RHEL media server that are running just as fast as they did on the old SunOS media server? Or are all backups that are going through the new RHEL media server slow?

I'm wondering...maybe there is a hard-coded IP (like in the "hosts" file) or a network path that is routing some of the backups through the wrong ethernet interface on the media server, but other clients maybe don't have that same configuration and are coming in through a different interface. Can you check the output of the "ifconfig -a" command? Is there more than one "eth" connection defined on the new media server? And could one of them be a 100Mbit connection?

JeanB · ‎03-23-2015

Yes some clients are running just as fast as before on the new RHEL media server.

There is another NIC on the new media server and this has a new IP and no backups are routed there. Tcpdump on that NIC shows no traffic from backup clients.

JeanB · ‎03-24-2015

Yes, I did some checks with network admin and looked at switch configs and the new media server is connected to a 10GB port, and lokking at backup speeds from several clients (before and after the new media server) and it does represent many parameters and data to analyse. I can't seem to find anything striking execpt that the new media server seems to perform less than before with slower clients.

For example, two bad clients that used to perform 10X give the following iperf results

Client A: 139 Mbits/sec, 299 Mbits/sec, 486 Mbits/sec

Client B 795 Mbits/sec 199 Mbits/sec 146 Mbits/sec

Client C, which has not changed in speed since the upgrade gives:

Client C 610 Mbits/sec 846 Mbits/sec 844 Mbits/sec

Clients A and B have become slow and they are on the same switch. C is on another switch. Perfaps some interswitch link is saturating now that the new media server has a 10GB inetrface.

sdo · ‎03-24-2015

1) In this new configuration is it always the same selections/parts/databases of the same clients which now always run slow - or is the slowness intermittent and so appears to hop around different backup policies and/or different policy types and/or different backup clients?

2) Your iperf stats look quite variable. Is it worth running an iperf test a couple of times a day, over a period of several days, and recording the numbers? Any patterns emerge?

JeanB · ‎03-24-2015

Only specific clients appear to be slower than before, independent of the policy type (filesystem or oracle database) and of storage unit ( we have both LTO4 tape and disk (Data Domain), and in all cases, backup jobs on thse clients always converge to a 5-7 MB/s. Sometimes, but not always, backup speeds are inititally high, say 40MB/s, but quickly (within minutes) they continuously slow down to 7 MB/s. However, occsionally, some jobs on these slow clients are eratically fast, as shown in logs above which show two jobs fromm the same client occuring 17 minutes apart on the same storage unit, tape dive, and tape: one at 52MB/s and the other at 3 MB/s, each several GBytes in size.

VOX

Media server hardware replacement leads to slower tape throughput