I replaced hardware for a media server according to TECH77811 and am now getting much slower throughput on tape backups. The master is Linux RHEL 6.3 running NB 126.96.36.199. Original media server was Solaris 10 SUN FIRE X4270 with 4X1GB aggregate NIC, running NB 188.8.131.52 . The new media server is HP DL380p G8 with 10GB NIC, RHEL 6.3 running NB 184.108.40.206 (actually installed 220.127.116.11 prior to hardware swap, added the media server to 18.104.22.168 master, upgraded to 22.214.171.124, and then 126.96.36.199). No changes on LTO4 robot.
Clients are running a mix of 6.5.4 and 188.8.131.52. What is happening is that many, if not all, backups are throughputting at 7 MB/s, which were previously at 30 MB/s, on the SUN media server.
On the new media server:
sysctl -a | grep kernel.sem
kernel.sem = 300 307200 32 1024
I played around with SIZE_DATA_BUFFERS = 229376 and 262144, NUMBER_DATA_BUFFERS = 10 and 128, respectively but I haven't noticed any improvemnts. Note that the original Sun media server had no SIZE_DATA_BUFFERS nor NUMBER_DATA_BUFFERS files (so it was using defaults)
Backups of the media server's files straight to tape give 80MB/s so this does not appear to be a HBA issue.
I'm running out of ideas on where and what to look for. Support says they can't help me unless I hire Support services.
Now looking for any correletions...
1) Are the clients which go slow... of the same NetBackup Client version level, or perhaps a sub-set of version levels/range, or a complete mix of different NetBackup Client versions?
2) Similar question, but this time re OS families... i.e. are all clients that go slow of the same OS version and patch level, or even are they all of the same OS family (e.g. Windows, RHEL, Solaris...) ?
3) And similar question yet again... are all the clients that go slow off the same hardware type? And by this I guess I mean do they all use the same NIC model, and same NIC firmware and same (notwithstanding OS family differences) NIC driver revision?
4) Do any of the clients which go slow still have 'jumbo' frames left enabled (from another exercise in the past) ?
5) I know you've said before that the clients that go slow are spread across different LAN switches, but are all of the clients which go slow all on the same LAN switch 'blade' or LAN switch 'stack module'? By this I mean are all of the clients which slow on just a few blades or modules - or are the clients which go slow spread all over the physical LAN infrastructure?
6) Are all of the clients that go slow in the same building/site/room but a different site/building/room to the NetBackup servers?
Tricky questions I know - but might be worth being able to answer.
We did some network sniffing between a slow Solaris client and the the new media server and the Network tech spotted a TCP WINDOW FULL warning after which traffic appered to downgrade and suggested that it might be related to the tcp window scale option being off. I checked many clients and the server and the option is ON (OS default) everywhere. All this is quite new to me, but I'm wondering if this is not due to fact that the media server has 10GB NIC and the client has a 1GB. The default windows size on the client is 48K, and 512K on the server. Not sure what I need to tweek.
TCP congestion handling is not something we server guys understand very easily. While congestion is not avoidable and will show from time to time, seeing a lot of them indicate a bottleneck issue.
Also take a look of the distribution of IP package size (wireshark/tcpdump can do this). In a well working network the majority of IP packages should be >= 1500 bytes. Large amount of IP size 64 bytes and something is not working correctly.
I googled "TCP WINDOW FULL" and found these posts that seem to help explain this:
"the usual solution is to increase the receive window size"
After many weeks of tuning tcp paramaters on clients and media servers, we still cannot arrive at a clearcut identification of the problem. The only pattern that looks realtively well defined is that Solaris clients (mostly Solaris 10) perform badly with the new RedHat 6 media servers. RedHat clients perform very well, almost perfectly with 108 MB/s speed. It appears from wireshark traces that RedHat clients deal better with packet drops than do Solaris clients, which react by radical slowdown. It's as if TCP stacks of Solaris clients are incompatible with that of RH media servers. My two cents as tweeking TCP params is still quite new to me.
I have had a similar problem, albeit on a HP windows media server with a HP 10Gb 530FLR-SFP+ against a Cisco Catalyst 6509 swith, in that case the solution was
Enabling flow control on the switch port and
On the Netcard "TCP Connection Offload" should be unchecked and
"Recieve Buffers increased to 3000"
I understood it as flow control meant that the switch stopped sending packets when the netcard sent a full message. And that it gave an over all better performance as the client(s) did not have to retransmit the packets.