Backup of particular drive is failing for windows ...

tejas9024 · ‎08-19-2013

Hi,

NBU master server: 7.5.0.4

Client: 7.5.0.3

Suddenly, full backup of a particular drive (different drive in different clients) is failing with status code 24:socket write failed in many windows clients. However, differentials are completing successfully for the same. The full backups were running fine until the previous week. No changes have been made from Netbackup end and also from the server end.

8/19/2013 11:59:06 AM - Info nbjm(pid=11272) starting backup job (jobid=236417) for client YYYY, policy XXXX, schedule Full

8/19/2013 11:59:06 AM - Info nbjm(pid=11272) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=236417, request id:{D438AC1E-26CD-455D-AAEF-7E341ED37B11})

8/19/2013 11:59:06 AM - requesting resource mediaNBU-STU

8/19/2013 11:59:06 AM - requesting resource masterNBU.NBU_CLIENT.MAXJOBS.YYYY

8/19/2013 11:59:06 AM - requesting resource masterNBU.NBU_POLICY.MAXJOBS.XXXX

8/19/2013 11:59:06 AM - granted resource masterNBU.NBU_CLIENT.MAXJOBS.YYYY

8/19/2013 11:59:06 AM - granted resource masterNBU.NBU_POLICY.MAXJOBS.XXXX

8/19/2013 11:59:06 AM - granted resource MediaID=@aaaac;DiskVolume=qqq-lsu;DiskPool=qqq-ost-dp;Path=qqq-lsu;StorageServer=ST_Server;MediaServer=mediaNBU<mailto:MediaID=@aaaac;DiskVolume=qqq-lsu;DiskPool=qqq-ost-dp;Path=qqq-lsu;StorageServer=ST_Server;MediaServer=mediaNBU>

8/19/2013 11:59:06 AM - granted resource mediaNBU-STU

8/19/2013 11:59:07 AM - estimated 23000355 Kbytes needed

8/19/2013 11:59:07 AM - Info nbjm(pid=11272) resumed backup (backupid=YYYY_1376931376) job for client YYYY, policy XXXX, schedule Full on storage unit mediaNBU-STU

8/19/2013 11:59:08 AM - started process bpbrm (10736)

8/19/2013 11:59:15 AM - connecting

8/19/2013 11:59:17 AM - Info bpbrm(pid=10736) starting bpbkar32 on client

8/19/2013 11:59:17 AM - connected; connect time: 00:00:02

8/19/2013 11:59:33 AM - Info bpbkar32(pid=1056) Backup started

8/19/2013 11:59:33 AM - Info bptm(pid=10184) start

8/19/2013 11:59:33 AM - Info bptm(pid=10184) using 1048576 data buffer size

8/19/2013 11:59:33 AM - Info bptm(pid=10184) setting receive network buffer to 4195328 bytes

8/19/2013 11:59:33 AM - Info bptm(pid=10184) using 32 data buffers

8/19/2013 11:59:34 AM - Info bptm(pid=10184) start backup

8/19/2013 11:59:35 AM - Info bptm(pid=10184) backup child process is pid 10612.10536

8/19/2013 11:59:35 AM - Info bptm(pid=10612) start

8/19/2013 11:59:36 AM - begin writing

8/19/2013 11:59:49 AM - Info bpbkar32(pid=1056) change journal NOT enabled for <C:\>

8/19/2013 12:00:13 PM - Critical bpbrm(pid=10736) from client YYYY: FTL - socket write failed

8/19/2013 12:00:20 PM - end writing; write time: 00:00:44

socket write failed(24)

bpbkar and bpcd logs of the client are attached for your reference.

Will_Restore · ‎08-19-2013

Don't see any attachments.

From the Status Code Guide

A possible cause is a high network load. For example, this problem occurs with

Cannot write to STDOUT when a Windows system that monitors network

load detects a high load. It then sends an ICMP packet to other systems to

inform them that the route those systems use was disconnected.

ashish_patil01 · ‎08-20-2013

Would also need bpbrm logs from the media server for further investigation.

Wiriadi_Wangsa · ‎08-20-2013

Hi Tejas9024,

Better log a case with NetBackup Support. They may ask to run AppCritical tool to check your network health between media server and client.

tejas9024 · ‎08-20-2013

Hi All,

Thanks for the suggestions. Please find the log details from client.

Will_Restore · ‎08-20-2013

seeing a lot of these in your bpcd log

ABC is not a master server
ABC is not a media server either
FTL - BPCD EXIT STATUS 46
Server access denied

seems you need to ABC to client YYYY's Server list

tejas9024 · ‎08-26-2013

hi

ABC is actually the domain name. Please ignore the previous logs and find the latest logs from the client and media server.

weekend full backups have failed again with 24, however differentials are done without any issues.

Thanks

Will_Restore · ‎08-27-2013

bpbrm handle_backup: client CLIENT1 EXIT STATUS = 24: socket write failed

http://www.symantec.com/business/support/index?page=content&id=TECH150369

Solution

1. Change client read timeout parameter from 300 to 9600

2. Change Communication buffer size from 32K to 128K. Go to Host Properties > Clients > Client Properties > Windows Client > ClientSettings > Communication buffer size = 128

3. If antivirus software is running, disable it troubleshooting proposes.

4. Disable autotuning and chimney features. From a command prompt, run:

netsh int tcp set global autotuning=disabled
(on Windows Server 2003) netsh int tcp set global chimney=disabled
(on Windows Server 2008) netsh int ip set global chimney DISABLED

5. Create the registry key TcpTimedWaitDelay (of type REG_DWORD) in HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters and set the value to 30 seconds.

Reference: http://technet.microsoft.com/en-us/library/cc757512(WS.10).aspx

6. Reboot the server.

tejas9024 · ‎08-29-2013

Thanks wr, we have forwarded the recommendations to Windows team.

However, i would like to highlight that around 10 clients among 25 clients across 3 policies are failing with the same error. Sometimes the differential backups failing with 24, complete upon multiple restarts.I m not able to find the root cause. Please help.

tejas9024 · ‎08-29-2013

Also, the following 2 errors in bpbkar are common in all those clients.

[20708.19300] <16> tar_tfi::processException:
An Exception of type [SocketWriteException] has occured at:
Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.54.126.1 $ , Function: TransporterRemote::write[2](), Line: 338
Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.90.44.1 $ , Function: Packer::getBuffer(), Line: 652
Module: tar_tfi::getBuffer, Function: D:\NB\NB_7.5.0.3\src\cl\clientpc\util\tar_tfi.cpp, Line: 311
Local Address: [0.0.0.0]:0
Remote Address: [0.0.0.0]:0
OS Error: 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
)
Expected bytes: 131072

10:31:00.424 AM: [20708.19300] <2> tar_base::V_vTarMsgW: FTL - socket write failed
10:31:00.424 AM: [20708.19300] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
10:31:00.424 AM: [20708.19300] <4> tar_backup::backup_done_state: INF - number of file directives found: 3
10:31:00.424 AM: [20708.21620] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
10:31:00.424 AM: [20708.19300] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
10:31:00.424 AM: [20708.19300] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
1

===============================================

<16> dtcp_write: TCP - failure: send socket (1772) (TCP 10054: Connection reset by peer)
5:57:14.528 AM: [6780.9912] <16> dtcp_write: TCP - failure: attempted to send 220 bytes
5:57:14.543 AM: [6780.9912] <16> dtcp_write: TCP - failure: send socket (1772) (TCP 10054: Connection reset by peer)
5:57:14.543 AM: [6780.9912] <16> dtcp_write: TCP - failure: attempted to send 220 bytes
5:57:14.559 AM: [6780.9912] <16> dtcp_write: TCP - failure: send socket (1772) (TCP 10054: Connection reset by peer)
5:57:14.559 AM: [6780.9912] <16> dtcp_write: TCP - failure: attempted to send 220 bytes
5:57:14.575 AM: [6780.9912] <16> dtcp_write: TCP - failure: send socket (1772) (TCP 10054: Connection reset by peer)

VOX

Backup of particular drive is failing for windows servers with status code 24