NB 7.6.0.1 - socket writer faile (24)/file read fa...

X2 · ‎04-07-2014

Hi,

I'm testing backups for two Win32 bit servers and am getting either socket write failed (24) or file read failed (13). This testing is to make sure that the various servers that we have will work with 7.6.0.1 before we upgrade the remaining 2 NBU domains.

Environment:

Master: RHEL5 x64 running NB 7.6.0.1

Media: Win 2008, x64 SP1 NB 7.6.0.1

Clients: 1) Win 2003, x86 and 2) Win 2008 x86

Below is the excerpt from the bpkar log on client (full log attached as zip):

13:50:08.785 [6764.6984] <2> TransporterRemote::write[2](): DBG - | An Exception of type [SocketWriteException] has occured at: |   Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.55 $ , Function: TransporterRemote::write[2](), Line: 338 |   Local Address: [::]:0 |   Remote Address: [::]:0 |   OS Error: 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

) |   Expected bytes: 262144 | (../TransporterRemote.cpp:338)

13:50:08.821 [6764.6984] <16> tar_tfi::processException:

An Exception of type [SocketWriteException] has occured at:

Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.55 $ , Function: TransporterRemote::write[2](), Line: 338

Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.91 $ , Function: Packer::getBuffer(), Line: 652

Module: tar_tfi::getBuffer, Function: D:\NB\NB_7.6.0.1\src\cl\clientpc\util\tar_tfi.cpp, Line: 311

Local Address: [::]:0

Remote Address: [::]:0

OS Error: 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

)

Expected bytes: 262144

13:50:08.837 [6764.6984] <4> tar_base::V_vTarMsgW: INF - tar message received from tar_backup_tfi::processException
13:50:08.837 [6764.6984] <2> tar_base::V_vTarMsgW: FTL - socket write failed
13:50:08.844 [6764.6984] <16> dtcp_write: TCP - failure: send socket (384) (TCP 10053: Software caused connection abort)
13:50:08.844 [6764.6984] <16> dtcp_write: TCP - failure: attempted to send 26 bytes
13:50:08.844 [6764.6984] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
13:50:08.844 [6764.6984] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
13:50:08.844 [6764.6984] <2> tar_base::V_vTarMsgW: INF - Client completed sending data for backup

This has happened before I upgraded the client on both the x86 servers from 7.5.0.5 to the latest one. Error happens after random duration (e.g. 5minutes, 2hrs, etc) But somehow didn't show up last week and I thought that a couple of changes that I had done on the media server might have fixed the issue. Other servers e.g. Win x64 and especially Linux servers have not had this error during my testing i.e. only Win x86 servers are showing this erratic behaviour.

Has anyone seen this earlier? or during the upgrade to 7.6.0.1? Any suggestions will be appreciated.

SymTerry · ‎04-07-2014

Hello,

Are you getting these errors after you upgrade the test clients?

mph999 actually posted a great list for troubleshooting a status 24. While this list is large, and might take a while to read, it covers the basses. He also mentions that he hardly sees a status 24 as result of a NetBackup issue. I totally agree. This is usually a OS/hardware/network tuning issue. The trick is finding it, hence all the steps to troubleshoot.

You could also disable the TCP chimney/offload for the 2003 server as a start: http://www.symantec.com/docs/TECH60844

VerJD · ‎04-07-2014

@NCS-SAN,

SymTerry gave you some great info above, which might be what you need for the Windows 2003 Server. The other thing to remember here is that since you're dealing with two different clients with two different operating systems, that means there could also be two different solutions to these failures. Could you post a bpbkar log for the W2K3 server?

As for the bpbkar log, it appears to be from your Windows 2008 Server, which showed the OS Error: 10600, and then failed with status 24. This technote appears to match those errors, TECH193354, let me know if you find it useful. Otherwise, do you have any corresponding issues in the OS event logs? Thanks in advance!

JD | Veritas NetBackup Support

X2 · ‎04-08-2014

Thanks for the information. I'm going through mph999's list now.

I also wanted to add that the two 32bit servers giving the error are VMs. They are normally backed up using VMWare method and that works fine. It is only when I try to back them up via traditional method that the error happens.

@SymTerry - no, this error happened earlier when I was testing the same clients with 7.5.0.5 and master/media upgraded to 7.6.0.1. There was quite some troubleshooting done in case #06149697 if you want to have a look. I might have to open a new case for this if this doesn't work.

Mark_Solutions · ‎04-08-2014

If you open the clients Host Properties - Windows Client - Client Settings

what is the communication buffer size set to?

It looks high in the log (unless you have a NETBUFFER_SZ set?

Try it with a value of 32 and see if that helps

It could be a keep alive setting though .. maybe the ESX server has its internal firewall set but it seems to fail after 25 minutes (1500) which is an odd time.

Having said that the loggin level you have set may be holding the job up so much that it fails!

X2 · ‎04-08-2014

Comm buffer size is set to: 256 kB (our standard value)

and I verified on the Win 2003 x86 server, the settings for EnableTCPChimney, EnableRSS, EnableRSS are all set to 0 in registry.

The logging level was set to (2,5) so as to see if there is an indication of the problem. Same behaviour when logging was set to (0,0).

Mark_Solutions · ‎04-08-2014

Work trying 32 to see if that helps - that use to be the standard in the good old days - maybe there is something in the VMware nics that dont like it?

SymTerry · ‎04-08-2014

If you do open another case, let me know. I will track and make sure you get the attention needed.

X2 · ‎04-08-2014

Testing with smaller values of buffer size.

Setup 32kb buffer size on one server - tracker indicates backup happening at a very slow 200kB/s :(

Second server set with 128kB (from 256kB).

Update: the server with 32kB buffer size stopped with status 24 after writing about 1GB data! Second server still going on with 128kB buffer.

X2 · ‎04-08-2014

Both backups (with 32kB and 128kB) buffers failed with status: socket write failed (24)

One failed after 53m:49s, and second failed after 1h05m23s.

So is it safe to presume that neither buffer size nor chimney/offload are causing this?

X2 · ‎04-24-2014

Thinking that the issue could be due to the VM infrastructure, I tested using a physical machine and got the same errors. A new case has been raised with Symantec Ent support.

06472906 - Unsuccessful backups for Win x86 systems (Win 2003 SP2 with NB 7.6.0.1)

mph999 · ‎04-24-2014

Try setting the comm buffer size to 0 ...

mph999 · ‎04-24-2014

... I meant net buffer size - for some reason I cannot edit my previous post

Kris_Skoglund · ‎04-29-2014

I’m also having this problem.

I lowered the client com buffer to 16K with better result.

One interesting value is what NET_BUFFER_SZ you have on the mediaserver?

I’m currently testing to set it to 0.

_____

Kris Skoglund
Senior Storage Specialist
www.innovationgroup.se

rnosal · ‎04-29-2014

I had a very simlilar problem, but only backing up VM machines,

In my case only 1 Media server was effected (was a 5200 Netbackup Media/Storage Appliance)

After looking at all of the buffers we started to narrow it down to NICs on the appliance that may have been faulty.

We removed the media server from rotation and the problem cleared right up.

Im still testing the appliance to see if it is the NICs or if there was a speed mismatch between the nics and the switches

VOX

NB 7.6.0.1 - socket writer faile (24)/file read failed(13)