08-04-2011 08:16 AM
Hi,
I have a backup solution with 1x Solaris master and 4x Solaris media servers (all 6.5.6). One of the servers (Windows 2003 SP2 32 bit w/ SQL 2008 Netbackup 6.5.4) has been backing up fine for sometime and has suddenly started ending with status '(24) socket write failed'.
Below is a snippet of the bpbkar log, has anyone seen this before? I've tried all sorts like turning off TCP chimney, turning of checksum offload, updating the NIC drivers etc.. Still fails! Nothing has changed server side either.
Mike
BPBKAR NetBackup Backup/Archive 6.5GA [May 1 2009]
Copyright 1993 - 2007 VERITAS Software Corporation
All Rights Reserved.
15:32:08.443: [5580.5240] <2> WinMain: DAT - _pgmptr = 'C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe'
15:32:08.443: [5580.5240] <2> WinMain: DAT - lpCmdLine = '-r 2419200 -ru root -dt 512225 -to 0 -clnt {REMOVED} -class 002_Clients -sched Diff-Incremental -st INCR -bpstart_to 300 -bpend_to 300 -read_to 300 -blks_per_buffer 512 -use_otm -fso -b {REMOVED}_1312468317 -kl 28 -WOFB_enabled -WOFB_fim 0 -WOFB_usage 0 -WOFB_error 0 -ct 13 -use_ofb '
15:32:08.443: [5580.5240] <2> date_debug: DAT - timezone: GMT Standard Time, offset=0, dst: GMT Daylight Time
15:32:08.443: [5580.5240] <2> date_debug: DAT - current time: 1312468328, 04/08/2011 15:32:08
15:32:08.443: [5580.5240] <2> date_debug: DAT - 01/01/94 UCT: 757382400, 01/01/1994 01:00:00
15:32:08.443: [5580.5240] <2> date_debug: DAT - 07/01/94 UCT: 773020800, 01/07/1994 01:00:00
15:32:08.443: [5580.5240] <2> WinMain: DAT - standard input handle = 536
15:32:08.443: [5580.5240] <2> WinMain: DAT - standard output handle = 1916
15:32:08.443: [5580.5240] <2> WinMain: DAT - standard error handle = 1856
15:32:08.474: [5580.5240] <4> tar_backup::V_SetupJobData: INF - dwJobData: ffffffff
15:32:08.474: [5580.5240] <4> tar_backup::V_SetupJobData: INF - dwJob: ffffffff
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - backup privileges enabled, previous = 0
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - restore privileges enabled, previous = 0
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - security privileges enabled, previous = 0
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - tcb privileges enabled, previous = 0
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - create token privileges enabled, previous = 0
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - user name: root
15:32:08.490: [5580.5240] <4> dos_backup::V_PreProcessing: INF - no access token: scheduled backup
15:32:08.506: [5580.5240] <2> ov_log::V_GlobalLog: INF - BEDS_Init() Enter InitFlags:0x0
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: C:
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: D:
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: E:
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Microsoft Terminal Services
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Microsoft Windows Network
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Web Client Network
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Oracle
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Shadow?Copy?Components
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: System?State
15:32:08.615: [5580.5240] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Active Directory Application Mode
15:32:10.849: [5580.5240] <2> tar_backupt_tfi::create: TAR - Incremental backup using the archive bit
15:32:10.928: [5580.5240] <2> tar_backup::V_SetupProcessContinue: TAR - CONTINUE BACKUP received
15:32:10.928: [5580.5240] <2> tar_backup_tfi::setupFileDirectives: TAR - backup filename = ALL_LOCAL_DRIVES
15:32:10.928: [5580.5668] <4> tar_base::keepaliveThread: INF - keepalive thread is active with an interval of 60 seconds
15:32:11.959: [5580.5240] <2> tar_backup_vxbsa::add: INF - called with '/'
15:32:11.959: [5580.5240] <4> dos_backup::V_VerifyFileSystem: INF - Added: C:\
15:32:11.959: [5580.5240] <4> dos_backup::V_VerifyFileSystem: INF - Added: D:\
15:32:11.959: [5580.5240] <4> dos_backup::V_VerifyFileSystem: INF - Added: E:\
15:32:11.959: [5580.5240] <4> dos_backup::V_VerifyFileSystem: INF - Added: Shadow Copy Components:\
15:32:11.959: [5580.5240] <4> dos_backup::V_VerifyFileSystem: INF - Added: System State:\
15:32:15.553: [5580.5240] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'System State:\' --> 1002002d
15:32:15.553: [5580.5240] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for '_BACKUP_SPECIAL_OBJECTS AFTER System State:' --> 1002002d
15:32:15.553: [5580.5240] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'D:\' --> 10020002
15:32:15.553: [5580.5240] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'E:\' --> 10020002
15:32:17.615: [5580.5240] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'Shadow Copy Components:\' --> 10020027
15:32:17.615: [5580.5240] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'C:\' --> 10020002
15:32:19.459: [5580.5240] <4> VssSnapshot::V_VssSetMetadata(): INF - Unable to locate DFSR metadata
15:32:20.912: [5580.5240] <4> dos_backup::V_InitializeShadowCopy: INF - ============================================
15:32:20.912: [5580.5240] <4> dos_backup::V_InitializeShadowCopy: INF - ShadowCopy PrePost: Begin
15:32:22.599: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - Drive 'C:' added to VSS Volume Set
15:32:22.646: [5580.5240] <4> SnapshotAssociation::SnapshotAssociation: INF - VSS_MAXIMUM_STORAGE_SIZE is unavailable or the wrong type, staying with the default action of limiting unbounded associations
15:32:22.646: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - checking association status for drive "C:\"
15:32:22.662: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - QueryDiffAreasForVolume returned failure status; no Diff areas for this volume
15:32:40.084: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - VSS Volume Set snapped
15:32:40.084: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - About to commence post-snap unbounded association checks
15:32:40.115: [5580.5240] <4> SnapshotAssociation::SnapshotAssociation: INF - VSS_MAXIMUM_STORAGE_SIZE is unavailable or the wrong type, staying with the default action of limiting unbounded associations
15:32:40.115: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - checking association status for drive "C:\"
15:32:40.131: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - QueryDiffAreasForVolume returned failure status; no Diff areas for this volume
15:32:40.131: [5580.5240] <4> dos_backup::V_InitializeShadowCopy: INF - ShadowCopy PrePost: End
15:32:40.131: [5580.5240] <4> dos_backup::V_InitializeShadowCopy: INF - ============================================
15:32:40.131: [5580.5240] <4> VssSnapshot::V_VssSetMetadata(): INF - Unable to locate DFSR metadata
15:32:46.818: [5580.5240] <4> dos_backup::tfs_scanstart: INF - Backup Special Objects request AFTER System State:
15:32:48.021: [5580.5240] <4> dos_backup::tfs_findfirst: WRN - unable to find first file: 'System State:\'
15:32:48.021: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - Drive 'C:' added to VSS Volume Set
15:32:48.037: [5580.5240] <4> SnapshotAssociation::SnapshotAssociation: INF - VSS_MAXIMUM_STORAGE_SIZE is unavailable or the wrong type, staying with the default action of limiting unbounded associations
15:32:48.037: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - checking association status for drive "C:\"
15:32:48.053: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - QueryDiffAreasForVolume returned failure status; no Diff areas for this volume
15:32:48.053: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - Drive 'c:' added to VSS Volume Set
15:32:48.084: [5580.5240] <4> SnapshotAssociation::SnapshotAssociation: INF - VSS_MAXIMUM_STORAGE_SIZE is unavailable or the wrong type, staying with the default action of limiting unbounded associations
15:32:48.084: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - checking association status for drive "c:\"
15:32:48.099: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - QueryDiffAreasForVolume returned failure status; no Diff areas for this volume
15:32:56.881: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - VSS Volume Set snapped
15:32:56.881: [5580.5240] <4> VssSnapshot::V_VssVolumeSnapshot(): INF - About to commence post-snap unbounded association checks
15:32:56.896: [5580.5240] <4> SnapshotAssociation::SnapshotAssociation: INF - VSS_MAXIMUM_STORAGE_SIZE is unavailable or the wrong type, staying with the default action of limiting unbounded associations
15:32:56.896: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - checking association status for drive "C:\"
15:32:57.006: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - QueryDiffAreasForVolume returned failure status; no Diff areas for this volume
15:32:57.021: [5580.5240] <4> SnapshotAssociation::SnapshotAssociation: INF - VSS_MAXIMUM_STORAGE_SIZE is unavailable or the wrong type, staying with the default action of limiting unbounded associations
15:32:57.021: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - checking association status for drive "c:\"
15:32:57.037: [5580.5240] <4> SnapshotAssociation::AdjustUnboundedAssocations: INF - QueryDiffAreasForVolume returned failure status; no Diff areas for this volume
15:32:57.037: [5580.5240] <4> backup_create: INF - NetBackup Temp Directory: 'C:\Program Files\Veritas\\NetBackup\Temp'
15:32:58.224: [5580.5240] <2> ov_log::V_GlobalLogEx: INF - beds_ss_access::V_OpenForRead(): FS_OpenObj() Successful.
15:33:06.381: [5580.5240] <4> dos_backup::tfs_include: INF - folder (System Files) has been created recently (since 29/07/2011 16:15:05). It will be backed up in full.
15:33:06.381: [5580.5240] <2> ov_log::V_GlobalLogEx: INF - beds_ss_access::V_OpenForRead(): FS_OpenObj() Successful.
15:33:14.740: [5580.5240] <2> ov_log::V_GlobalLogEx: INF - beds_ss_access::V_OpenForRead(): FS_OpenObj() Successful.
15:38:15.162: [5580.5240] <32> TransporterRemote::write[2](): FTL - SocketWriteException: send() call failed, could not write data to the socket, possible broken connection.
15:38:15.162: [5580.5240] <16> NBUException::traceException(): (
An Exception of type [Symantec::NetBackup::Ncf::OperationFailedException] was thrown. Details about the exception follow...:
Error code = (-1008).
Src file = (D:\654\src\cl\clientpc\util\tar_tfi.cpp).
Src Line = (275).
Description = (%s getBuffer operation failed).
Operation type=().
)
15:38:15.162: [5580.5240] <16> NBUException::traceException(): (
An Exception of type [Symantec::NetBackup::Ncf::SocketWriteException] was thrown. Details about the exception follow...:
Error code = (-1027).
Src file = (TransporterRemote.cpp).
Src Line = (310).
Description = (send() call failed, could not write data to the socket, possible broken connection).
Local IP=(). Remote IP=(). Remote Port No.=(0).
No. of bytes to write=(32768) while No. of bytes written=(0).
)
15:38:15.162: [5580.5240] <2> tar_base::V_vTarMsgW: FTL - socket write failed
15:38:15.162: [5580.5240] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
15:38:15.162: [5580.5240] <4> tar_backup::backup_done_state: INF - number of file directives found: 4
15:38:15.162: [5580.5668] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
15:38:15.162: [5580.5240] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
15:38:15.162: [5580.5240] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
15:38:15.162: [5580.5240] <4> tar_backup::backup_done_state: INF - Not waiting for server status
15:38:15.334: [5580.5240] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
15:38:15.537: [5580.5240] <4> OVStopCmd: INF - EXIT - status = 0
15:38:15.537: [5580.5240] <2> tar_base::V_Close: closing...
15:38:15.537: [5580.5240] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
15:38:15.943: [5580.5240] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000001
15:38:29.959: [5580.5240] <4> OVShutdown: INF - Finished process
15:38:29.990: [5580.5240] <4> WinMain: INF - Exiting C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
15:38:31.990: [5580.5240] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\Veritas\NetBackup\logs\BPBKAR\080411.LOG
08-04-2011 09:29 AM
What NBU version on Client ?
Do u hav enough disk space for VSS ??
kindly run below command from master server & post o/p.
/usr/openv/netbackup/bin/admincmd$ ./bptestbpcd -client <client> -verbose -debug
08-04-2011 10:12 AM
Hi, versions are as mentioned above.. Client is 6.5.4, master/media are 6.5.6.
C:\ drive is 33.8GB total with 10GB free.
Output from bptestbpcd:
18:05:10.335 [7920] <2> bptestbpcd: VERBOSE = 0
18:05:10.376 [7920] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2048: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
18:05:10.376 [7920] <2> vnet_vnetd_service_socket: vnet_vnetd.c.2062: service: bpcd
18:05:10.517 [7920] <2> logconnections: BPCD CONNECT FROM XXXXXXXX.35001 TO XXXXXXXX.13724
18:05:10.518 [7920] <2> vnet_connect_to_vnetd_extra: vnet_vnetd.c.182: msg: VNETD CONNECT FROM XXXXXXXX.35002 TO XXXXXXXX.13724 fd = 5
18:05:10.571 [7920] <2> vnet_vnetd_connect_forward_socket_begin: vnet_vnetd.c.535: VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a
18:05:10.723 [7920] <2> vnet_vnetd_connect_forward_socket_begin: vnet_vnetd.c.552: ipc_string: 4344
1 1 1
XXXXXXXX:35001 -> XXXXXXXX:13724
XXXXXXXX:35002 -> XXXXXXXX:13724
18:05:10.842 [7920] <2> bpcr_get_peername_rqst: Server peername length = 27
18:05:10.902 [7920] <2> bpcr_get_hostname_rqst: Server hostname length = 14
18:05:10.962 [7920] <2> bpcr_get_clientname_rqst: Server client name length = 14
18:05:11.022 [7920] <2> bpcr_get_version_rqst: bpcd version: 06540000
18:05:11.082 [7920] <2> bpcr_get_platform_rqst: Server client platform length = 2
18:05:11.142 [7920] <2> bpcr_get_version_rqst: bpcd version: 06540000
18:05:11.202 [7920] <2> bpcr_patch_version_rqst: theRest == > <
18:05:11.203 [7920] <2> bpcr_get_version_rqst: bpcd version: 06540000
18:05:11.262 [7920] <2> bpcr_patch_version_rqst: theRest == > <
18:05:11.263 [7920] <2> bpcr_get_version_rqst: bpcd version: 06540000
PEER_NAME = {REMOVED}
HOST_NAME = {REMOVED}
CLIENT_NAME = {REMOVED}
VERSION = 0x06540000
PLATFORM = nt
PATCH_VERSION = 6.5.4.0
SERVER_PATCH_VERSION = 6.5.4.0
MASTER_SERVER = {REMOVED}
EMM_SERVER = {REMOVED}
18:05:11.323 [7920] <2> vnet_connect_to_vnetd_extra: vnet_vnetd.c.182: msg: VNETD CONNECT FROM XXXXXXXX.35011 TO XXXXXXXX.13724 fd = 6
18:05:11.358 [7920] <2> vnet_vnetd_connect_forward_socket_begin: vnet_vnetd.c.535: VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a
18:05:11.623 [7920] <2> vnet_vnetd_connect_forward_socket_begin: vnet_vnetd.c.552: ipc_string: 4347
XXXXXXXX:35011 -> XXXXXXXX:13724
<2>bptestbpcd: EXIT status = 0
18:05:11.629 [7920] <2> bptestbpcd: EXIT status = 0
08-04-2011 10:36 AM
i hope below will help you .....
https://www-secure.symantec.com/connect/forums/socket-errors
http://www.symantec.com/business/support/index?page=content&id=TECH56950
08-04-2011 10:46 AM
As mentioned, I've tried turning off the TCP chimney, upgrading NIC drivers etc. Just checked the master, media server and the client they're all set to full duplex and the correct port speed.
No idea where to go or what to try next.. Mike.
08-04-2011 11:25 AM
This won't help too much, but comparing log snippets, this TechNote almost exactly describes your issue:
http://www.symantec.com/docs/TECH127930
The conclusion in this doc is that it isn't NetBackup, but something hardware-related in the network between your server and client. Probably not what you wanted to hear but it does seem to make sense, logically...
08-04-2011 01:09 PM
What is the Client Read Timeout on the media server backing up this client?
The default is 300 (5 minutes).
You will notice that there is 5 minutes between last activity on the client and the failure:
15:33:14.740: [5580.5240] <2> ov_log::V_GlobalLogEx: INF - beds_ss_access::V_OpenForRead(): FS_OpenObj() Successful.
15:38:15.162: [5580.5240] <32> TransporterRemote::write[2](): FTL - SocketWriteException: send() call failed, could not write data to the socket, possible broken connection.
Please also check bpbrm on media server for evidence of timeout.
Increase the timeout to 900 and see if that makes a difference.
08-05-2011 01:37 AM
I ran a backup over the prod LAN last night which completed fine, then immediately ran another job over the backup LAN which started however was extremely slow (~159KB/s) and failed with the same error as before. How odd.. Currently attempting another backup over the backup LAN and its running at around 600KB/s compared to it's usual 26,000KB/s+.
08-05-2011 02:13 AM
It just failed too.. Same entry as before in the bpbkar logs.
08-05-2011 02:15 AM
Can you check backup speed by changing STU ??
08-05-2011 03:31 AM
Another storage unit? It used the same storage unit and same tape drive as the prod LAN backup from last night.
I've got networks checking the switch, port speed and duplex settings again. Just for info, the NIC card settings are:
802.11p QOS - Disable
Checksum Offload - None
Ethernet@WireSpeed - Enable
Flow Control - Auto
Jumbo MTU 1500
Large Send Offload - Disable
Speed & Duplex - Auto
Wakeup Capabilities - Both
WOL Speed - Auto
09-16-2011 07:57 AM