cancel
Showing results for 
Search instead for 
Did you mean: 

Looking for ideas for troubleshooting a client error 24: socket write failed

new2nbu
Level 4

Hello community,  I am have been troubleshooting a client issue for about 4 days and was looking to see if anyone would have any additional ideas on resolving this.    I'll do my best as to help explain and give some background information:

Client:  Windows 2003 x64 Enterprise with NBU v7.1.0.4 client installed (Physical server).  Policy is a Windows filesystem policy with ALL_LOCAL_Drives (no multistreaming enabled)

3 Media & Master Server:  Windows 2008 Standard x64 SP2  with NBU client v7.1.0.4 (*media and master servers  are all separate physical servers totally 4)

Error occurs writing to either Tape or a Data Domain device

Brief timeline of events

  • Friday Full and Monday differentail ran successfully
  • Tuesday differential failed (server owner confirmed no changes made.  The differential and full have consistently been failing with a Status: 24 Socket Write Failed.
    • This appears in the bpbkar log:
      • 1:25:59.101 PM: [19572.32028] <2> TransporterRemote::write[2](): DBG - | An Exception of type [SocketWriteException] has occured at: | Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54 $ , Function: TransporterRemote::write[2](), Line: 321 | Local Address: [0.0.0.0]:0 | Remote Address: [0.0.0.0]:0 | OS Error: 10053 (An established connection was aborted by the software in your host machine.

        ) | Expected bytes: 16384 | (../TransporterRemote.cpp:321)

        1:25:59.101 PM: [19572.32028] <16> tar_tfi::processException:

        An Exception of type [SocketWriteException] has occured at:

        Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54 $ , Function: TransporterRemote::write[2](), Line: 321

        Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.89 $ , Function: Packer::getBuffer(), Line: 656

        Module: tar_tfi::getBuffer, Function: H:\7104\src\cl\clientpc\util\tar_tfi.cpp, Line: 312

        Local Address: [0.0.0.0]:0

        Remote Address: [0.0.0.0]:0

        OS Error: 10053 (An established connection was aborted by the software in your host machine.

        )

        Expected bytes: 16384

    • and I also see this in bpbkar:

    • :26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - FS_DleBEAO::DeInit - exiting.

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedssql2.dll

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedsshadow.dll

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedsss.dll

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedsadgran.dll

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedsnt5.dll

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedsev.dll

      1:26:00.194 PM: [19572.32028] <2> ov_log::V_GlobalLog: INF - unloading bedsxese.dll

      1:26:00.194 PM: [19572.32028] <16> dtcp_read: TCP - failure: recv socket (592) (TCP 10053: Software caused connection abort)

      1:26:01.194 PM: [19572.32028] <16> dtcp_read: TCP - failure: recv socket (592) (TCP 10053: Software caused connection abort)

      1:26:02.194 PM: [19572.32028] <16> dtcp_read: TCP - failure: recv socket (592) (TCP 10053: Software caused connection abort)

      1:26:03.194 PM: [19572.32028] <16> dtcp_read: TCP - failure: recv socket (592) (TCP 10053: Software caused connection abort)

      1:26:03.194 PM: [19572.32028] <4> OVShutdown: INF - Shutdown wait finished

  • The following troubleshooting :
    • NBU client restarted
    • Server was rebooted
    • Added TcpTimedWaitDelay to 30 seconds, then reboot & rebooted http://www.symantec.com/business/support/index?page=content&id=TECH150369
    • Open a case with Symantec support, tried increase Client connect timeout and client read timeout from 600sec to 1800 secs.
    • Ran AppsCritical report and found ~21 % packet reordering  &*** following up with Network team to determine any changes in network
    •  TCP Offload and Chimney on client were already disabled
    • Confirmed NIC offload settings were disabled as well
    • No NIC Teaming enabled
    • Confirmed netstat -a does not exhibit large amount of timed_Wait
    • Forward/Reverse DNS resolution fine
    • Ping fine
    • bpclntcmnds all working fine:
      • bpclncmd -ip --> from both client and server
        * bpclntcmd -hn / * bpclntcmd -pn / run bpcoverage -c clientname

  • Does anyone have any other suggestions to help troubleshoot? Or am I missing anything??

Thank you for any help.

  •  
2 ACCEPTED SOLUTIONS

Accepted Solutions

mph999
Level 6
Employee Accredited

Apologies, I have to go out so have not read all the details you posted, will do so later.

In general, status 24 is not NBU, so you need to look OS level, in fact I have never seen NBU cause a 24.

Check this post for comments, and in particular the post I made (sorry it is long, but it may give you a solution).

https://www-secure.symantec.com/connect/forums/netbackup-status-code-24-possible-parameters-check

Also this one,

https://www-secure.symantec.com/connect/forums/netbackup-solaris-10-media-server-issue

 

Make sure all interfaces are fully resolvable, a very common cause of 24s

Martin

View solution in original post

sri_vani
Level 6
Partner

So you mean to say that u r unable to take the backup only for C drive right?

What is the free space and utilizes space for C?

Please verify this

https://www-secure.symantec.com/connect/forums/incremental-backup-failing-only-d-drive-status-code-4224-full-backup-completing-successfully#comment-9435631

Check C-drive for fragmentation.

Large drive with heavy fragmentation may cause timeout while 'walking' the filesystem looking for changed files.

Please post all text in Job Details for failed job and ensure all of the following log folders exist:

On media server: bptm and bpbrm

On client: bpbkar and bpfis

View solution in original post

3 REPLIES 3

mph999
Level 6
Employee Accredited

Apologies, I have to go out so have not read all the details you posted, will do so later.

In general, status 24 is not NBU, so you need to look OS level, in fact I have never seen NBU cause a 24.

Check this post for comments, and in particular the post I made (sorry it is long, but it may give you a solution).

https://www-secure.symantec.com/connect/forums/netbackup-status-code-24-possible-parameters-check

Also this one,

https://www-secure.symantec.com/connect/forums/netbackup-solaris-10-media-server-issue

 

Make sure all interfaces are fully resolvable, a very common cause of 24s

Martin

new2nbu
Level 4

Thank you for the reply.

I've gone through the links and we've also gone through the TCP Chimney settings, DNS resolutions etc.

The latest change in our symptom description of the issue is that I can successfully back up the D:\ and other non OS partitions successfully.  However, once it gets a certain way through the C:\ it fails with a 24.  This being said, I think it's safe to rule out a network issue or a DNS issue.  I have ruled out the virus scan directories, NBU directories however still seeing a failure.

 

Thanks for the feedback.

 

sri_vani
Level 6
Partner

So you mean to say that u r unable to take the backup only for C drive right?

What is the free space and utilizes space for C?

Please verify this

https://www-secure.symantec.com/connect/forums/incremental-backup-failing-only-d-drive-status-code-4224-full-backup-completing-successfully#comment-9435631

Check C-drive for fragmentation.

Large drive with heavy fragmentation may cause timeout while 'walking' the filesystem looking for changed files.

Please post all text in Job Details for failed job and ensure all of the following log folders exist:

On media server: bptm and bpbrm

On client: bpbkar and bpfis