cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

majority of backups now failing

MarkLamont
Level 2
Hello,
This is my first post to a NBU forum so please be kind.
I have a Win 2K3 R2  64 bit master server running 6.5.5 and Win2K3 32 bit clients running a mix of 6.5.4 and 6.5.5.
This was all installed in early December and was working well backing up servers from 15Gb tp 20TB without fail.
Sometime over the xmas /new year period the majority stopped working. Manually running them produces various errors stating with 24 followed by 71 then 150.
When I run the job manually it usually follows this pattern;
starts the job
get files list
load tape(s)
start writing
write some data , usually between 8 and 15 Gb
error 24
error 71
error 150

I have tried switching of the tcp offloading and the tcp chimney as suggested in most of the documents i could find.


extract from the bpbkar log on one client;

15:27:17.183: [4180.6368] <2> WinMain: DAT - _pgmptr = 'C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe'
15:27:17.183: [4180.6368] <2> WinMain: DAT - lpCmdLine = '-r 1209600 -ru root -dt 0 -to 0 -clnt bhmss04 -class BHMSS04-data -sched full -st FULL -bpstart_to 300 -bpend_to 300 -read_to 300 -ckpt_time 900 -blks_per_buffer 1023 -stream_count 1 -stream_number 1 -jobgrpid 1427 -use_otm -use_ofb -b bhmss04_1262791620 -kl 28 -fso -WOFB_enabled -WOFB_fim 0 -WOFB_usage 0 -WOFB_error 0 -ct 13 '
15:27:17.183: [4180.6368] <2> date_debug: DAT - timezone: GMT Standard Time, offset=0, dst: GMT Daylight Time
15:27:17.183: [4180.6368] <2> date_debug: DAT - current time: 1262791637, 06/01/2010 15:27:17
15:27:17.183: [4180.6368] <2> date_debug: DAT - 01/01/94 UCT:  757382400, 01/01/1994 00:00:00
15:27:17.183: [4180.6368] <2> date_debug: DAT - 07/01/94 UCT:  773020800, 01/07/1994 00:00:00
15:27:17.183: [4180.6368] <2> WinMain: DAT - standard input handle = 508
15:27:17.183: [4180.6368] <2> WinMain: DAT - standard output handle = 1848
15:27:17.183: [4180.6368] <2> WinMain: DAT - standard error handle = 1856
15:27:17.261: [4180.6368] <4> tar_backup::V_SetupJobData: INF - dwJobData: ffffffff
15:27:17.261: [4180.6368] <4> tar_backup::V_SetupJobData: INF -     dwJob: ffffffff
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - backup privileges enabled, previous = 0
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - restore privileges enabled, previous = 0
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - security privileges enabled, previous = 0
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - tcb privileges enabled, previous = 0
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - create token privileges enabled, previous = 0
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - user name: root
15:27:17.261: [4180.6368] <4> dos_backup::V_PreProcessing: INF - no access token: scheduled backup
15:27:17.276: [4180.6368] <2> ov_log::V_GlobalLog: INF - BEDS_Init() Enter InitFlags:0x0
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: C:
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: D:
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: S:
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: T:
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Microsoft Terminal Services
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Microsoft Windows Network
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Web Client Network
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: bhmss04
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Shadow?Copy?Components
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: System?State
15:27:17.730: [4180.6368] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Active Directory Application Mode
15:27:17.870: [4180.6368] <2> tar_backup::V_SetupProcessContinue: TAR - CONTINUE BACKUP received
15:27:17.870: [4180.6368] <4> tar_backup_cpr::start: INF - checkpoint thread started
15:27:18.120: [4180.6368] <2> tar_backup_tfi::setupFileDirectives: TAR - backup filename = SET SNAP_ID=bhmss04_1262791531
15:27:18.120: [4180.6368] <2> tar_backup_tfi::setupFileDirectives: TAR - backup filename = BACKUP D:\ USING \\?\Volume{06399d45-5a1f-4696-a9e0-87bbfa29a2e7}\ OPTIONS:ALT_PATH_PREFIX=C:\Program Files\Veritas\NetBackup\temp\_vrts_frzn_img_612,FITYPE=MIRROR,MNTPOINT=D:\,FSTYPE=NTFS
15:27:18.120: [4180.7032] <4> tar_base::keepaliveThread: INF - keepalive thread is active with an interval of 60 seconds
15:27:19.198: [4180.6368] <2> tar_backup_vxbsa::add: INF - called with 'SET SNAP_ID=bhmss04_1262791531'
15:27:19.198: [4180.6368] <2> tar_backup_vxbsa::add: INF - called with 'BACKUP D:\ USING \\?\Volume{06399d45-5a1f-4696-a9e0-87bbfa29a2e7}\ OPTIONS:ALT_PATH_PREFIX=C:\Program Files\Veritas\NetBackup\temp\_vrts_frzn_img_612,FITYPE=MIRROR,MNTPOINT=D:\,FSTYPE=NTFS'
15:27:19.198: [4180.6368] <4> V_Snapshot::V_Snapshot_AddMappings: INF - Volume Snapshot Mapping: \\?\Volume{131aaa82-c60b-11dd-a97e-001321f2ba77}\ --> Volume{06399d45-5a1f-4696-a9e0-87bbfa29a2e7}
15:27:19.198: [4180.6368] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'D:\' --> 10020002
15:27:20.261: [4180.6368] <4> backup_create: INF - NetBackup Temp Directory: 'C:\Program Files\Veritas\\NetBackup\Temp'
15:27:21.698: [4180.6368] <2> ov_log::V_GlobalLogEx: INF - file_access (constructor): 1 non-NTFS volumes
15:33:22.725: [4180.6368] <32> TransporterRemote::write[2](): FTL - SocketWriteException: send() call failed, could not write data to the socket, possible broken connection.
15:33:22.740: [4180.6368] <16> NBUException::traceException(): (
An Exception of type [Symantec::NetBackup::Ncf::OperationFailedException] was thrown. Details about the exception follow...:
Error code  = (-1008).
Src file    = (D:\655\src\cl\clientpc\util\tar_tfi.cpp).
Src Line    = (275).
Description = (%s getBuffer operation failed).
Operation type=().
)
15:33:22.740: [4180.6368] <16> NBUException::traceException(): (
An Exception of type [Symantec::NetBackup::Ncf::SocketWriteException] was thrown. Details about the exception follow...:
Error code  = (-1027).
Src file    = (TransporterRemote.cpp).
Src Line    = (310).
Description = (send() call failed, could not write data to the socket, possible broken connection).
Local IP=(). Remote IP=(). Remote Port No.=(0).
No. of bytes to write=(32768) while No. of bytes written=(0).
)
15:33:22.740: [4180.6368] <2> tar_base::V_vTarMsgW: FTL - socket write failed
15:33:22.740: [4180.6368] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
15:33:22.740: [4180.6368] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
15:33:22.740: [4180.7032] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
15:33:22.740: [4180.6368] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
15:33:22.740: [4180.6368] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
15:33:22.740: [4180.6368] <4> tar_backup::backup_done_state: INF - Not waiting for server status
15:33:23.740: [4180.6368] <4> OVStopCmd: INF - EXIT - status = 0
15:33:23.740: [4180.6368] <2> tar_base::V_Close: closing...
15:33:23.740: [4180.6368] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
15:33:23.756: [4180.6368] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000101
15:33:23.756: [4180.6368] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): ubs specifics: 0x001d0000
15:33:23.787: [4180.6368] <16> dtcp_read: TCP - failure: recv socket (508) (TCP 10053: Software caused connection abort)
15:33:23.787: [4180.6368] <4> OVShutdown: INF - Finished process
15:33:23.787: [4180.6368] <4> WinMain: INF - Exiting C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
15:33:25.787: [4180.6368] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\Veritas\NetBackup\logs\BPBKAR\010610.LOG
5 REPLIES 5

lu
Level 6
Maybe not related to your problem, but why do you use VSP instead of VSS ?

MarkLamont
Level 2
I'm fairly new to NBU, what are vsp and vss?

SteveYu
Level 4
Employee Accredited Certified
If this were my case. I would ask you to do the following.

- What is the backup selection for one of these problem clients? is it all local drives?
- Do your failures now happen 100% of the time? If I were talking/webex'd with you could I ask you to run a manual backup and it fail 100% of the time?

My strategy would be to determine if the problem is on the client or the master/media side. You say that the majority stop working. This tends to say that either something on the master/media has changed or something for all the clients did. To test to see if the there is a problem on the client side with the file selectoin do the below

1. on the client go to <install path>\netbackup\logs\bpbkar
2. If there is a log file here for todays date and delete/move it
3. Execute technote http://support.veritas.com/docs/242918 for the exact same backup selection as you have in the policy for this client.
What the technote does is execute the backup read operation with no knowledge of the master. It simply asks the client to perform a backup of a selection and output it to /dev/null. The trick here is that it will recreate the bpbkar log file.
4. Post the new bpbkar to null log file here and also compare it with what you posted. Do you still have the same errors? If no then the problem is not with your clients If yes then we'll take it from there

MarkLamont
Level 2

Steve,
thanks for that.
backup to null works fine, log file below;
I'm convinced it's a network issue because we always get socket errors then the backup stops.



11:01:34.611 AM: [7672.7716] <4> ov_log::OVInit: INF - Starting log file: C:\Program Files\Veritas\NetBackup\logs\BPBKAR\010710.LOG

11:01:34.611 AM: [7672.7716] <4> ov_log::OVInit: GENERAL Log Level: 0
11:01:34.611 AM: [7672.7716] <4> ov_log::OVInit: TCP Log Level: 0
11:01:34.611 AM: [7672.7716] <4> ov_log::OVInit: INF - the log mutex: 1788
BPBKAR  NetBackup Backup/Archive  6.5GA  [Nov  5 2009]
Copyright 1993 - 2007 VERITAS Software Corporation
All Rights Reserved.

11:01:34.611 AM: [7672.7716] <2> WinMain: DAT - _pgmptr = 'C:\Program Files\VERITAS\NetBackup\bin\bpbkar32.exe'
11:01:34.611 AM: [7672.7716] <2> WinMain: DAT - lpCmdLine = '-nocont d:\ '
11:01:34.611 AM: [7672.7716] <2> date_debug: DAT - timezone: GMT Standard Time, offset=0, dst: GMT Daylight Time
11:01:34.611 AM: [7672.7716] <2> date_debug: DAT - current time: 1262862094, 1/7/2010 11:01:34 AM
11:01:34.611 AM: [7672.7716] <2> date_debug: DAT - 01/01/94 UCT:  757382400, 1/1/1994 12:00:00 AM
11:01:34.611 AM: [7672.7716] <2> date_debug: DAT - 07/01/94 UCT:  773020800, 7/1/1994 12:00:00 AM
11:01:34.611 AM: [7672.7716] <2> WinMain: DAT - standard input handle = -1
11:01:34.611 AM: [7672.7716] <4> WinMain: INF - not using socket for standard input handle
11:01:34.611 AM: [7672.7716] <2> WinMain: DAT - standard output handle = 136
11:01:34.611 AM: [7672.7716] <4> WinMain: INF - not using socket for standard output handle
11:01:34.611 AM: [7672.7716] <2> WinMain: DAT - standard error handle = 140
11:01:34.611 AM: [7672.7716] <4> WinMain: INF - not using socket for standard error handle
11:01:34.847 AM: [7672.7716] <4> tar_backup::V_SetupJobData: INF - dwJobData: ffffffff
11:01:34.847 AM: [7672.7716] <4> tar_backup::V_SetupJobData: INF -     dwJob: ffffffff
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - backup privileges enabled, previous = 0
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - restore privileges enabled, previous = 0
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - security privileges enabled, previous = 0
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - tcb privileges enabled, previous = 0
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - create token privileges enabled, previous = 0
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - user name:
11:01:34.847 AM: [7672.7716] <4> dos_backup::V_PreProcessing: INF - no access token: scheduled backup
11:01:34.860 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - BEDS_Init() Enter InitFlags:0x0
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: C:
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: D:
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: S:
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: T:
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Microsoft Terminal Services
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Microsoft Windows Network
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Web Client Network
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: BHMSS04
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Shadow?Copy?Components
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: System?State
11:01:35.700 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - DumpDleInfo() DLE Device Name: Active Directory Application Mode
11:01:35.726 AM: [7672.7716] <2> tar_backup_tfi::setupFileDirectives: TAR - backup filename = d:\
11:01:35.726 AM: [7672.7748] <4> tar_base::keepaliveThread: INF - keepalive thread is active with an interval of 30 seconds
11:01:36.632 AM: [7672.7716] <2> tar_base::V_vTarMsgW: INF - Open File Backup license is not installed on the server.  Open File Backup usage will be disabled.
11:01:36.632 AM: [7672.7716] <2> tar_backup_vxbsa::add: INF - called with 'd:\'
11:01:36.632 AM: [7672.7716] <4> dos_backup::V_VerifyFileList: INF - UBS Local Type for 'd:\' --> 10020002
11:01:39.388 AM: [7672.7716] <4> backup_create: INF - NetBackup Temp Directory: 'C:\Program Files\Veritas\\NetBackup\Temp'
11:01:40.306 AM: [7672.7716] <4> folder_access::V_OpenForRead: INF - obtained disk quota data: D:
11:01:40.346 AM: [7672.7716] <2> ov_log::V_GlobalLogEx: INF - file_access (constructor): 1 non-NTFS volumes
11:06:28.651 AM: [7672.7716] <16> file_access::V_OpenForRead: ERR - CreateFile() failed: \\?\D:\Databases\PLAMS_Data.MDF (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.651 AM: [7672.7716] <2> tar_base::V_vTarMsgW: WRN - can't open file: D:\Databases\PLAMS_Data.MDF (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.680 AM: [7672.7716] <16> file_access::V_OpenForRead: ERR - CreateFile() failed: \\?\D:\Databases\PLUS.mdf (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.680 AM: [7672.7716] <2> tar_base::V_vTarMsgW: WRN - can't open file: D:\Databases\PLUS.mdf (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.680 AM: [7672.7716] <16> file_access::V_OpenForRead: ERR - CreateFile() failed: \\?\D:\Databases\PLUS_Staging.mdf (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.680 AM: [7672.7716] <2> tar_base::V_vTarMsgW: WRN - can't open file: D:\Databases\PLUS_Staging.mdf (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.886 AM: [7672.7716] <16> file_access::V_OpenForRead: ERR - CreateFile() failed: \\?\D:\Databases\tempdb.mdf (WIN32 32: The process cannot access the file because it is being used by another process. )
11:06:28.886 AM: [7672.7716] <2> tar_base::V_vTarMsgW: WRN - can't open file: D:\Databases\tempdb.mdf (WIN32 32: The process cannot access the file because it is being used by another process. )
11:21:51.248 AM: [7672.7716] <2> tar_base::backup_finish: TAR - backup:                     13872 files
11:21:51.248 AM: [7672.7716] <2> tar_base::backup_finish: TAR - backup:          file data:  129237395 bytes  51 gigabytes
11:21:51.248 AM: [7672.7716] <2> tar_base::backup_finish: TAR - backup:         image data:  174458880 bytes  51 gigabytes
11:21:51.248 AM: [7672.7716] <2> tar_base::backup_finish: TAR - backup:       elapsed time:       1212 secs     45326148 bps
11:21:51.248 AM: [7672.7716] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
11:21:51.248 AM: [7672.7716] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
11:21:51.248 AM: [7672.7748] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
11:21:51.248 AM: [7672.7716] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
11:21:51.248 AM: [7672.7716] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 1: the requested operation was partially successful
11:21:51.248 AM: [7672.7716] <8> tar_backup_tfi::cleanupTemp: WRN - will not cleanup 'temp' directory, backup id or 'temp' directory is not set
11:21:51.248 AM: [7672.7716] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
11:21:51.248 AM: [7672.7716] <4> OVStopCmd: INF - EXIT - status = 0
11:21:51.248 AM: [7672.7716] <2> tar_base::V_Close: closing...
11:21:51.248 AM: [7672.7716] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
11:21:51.279 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000101
11:21:51.279 AM: [7672.7716] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): ubs specifics: 0x001d0000
11:21:51.310 AM: [7672.7716] <4> OVShutdown: INF - Finished process
11:21:51.310 AM: [7672.7716] <4> WinMain: INF - Exiting C:\Program Files\VERITAS\NetBackup\bin\bpbkar32.exe
11:21:53.302 AM: [7672.7716] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\Veritas\NetBackup\logs\BPBKAR\010710.LOG
 

SteveYu
Level 4
Employee Accredited Certified
Mark,

You are correct in understanding the strategy. You now know that if you installed netbackup on that client as a master server and did a local disk backup to itself it would complete the operation.  Thanks for posting that log so now I know the backup would have completed with status code 1.

From this point I would ask you the customer the following questions.

- I want to double check something with you. I see you used "D:\" as the target for the backup to null. Is that also what your policy has as the backup selection for this client? If not then I would like for you to create a test policy for this client with d: as the backup selection and then clear the bpbkar log on the client and then execute the backup and note if the backup works or not

- How many media servers do you have?
- If you have more than one have you tried asking this client to backup to another media server?
- Do you have restrictive firewalls/blackboxes/etc/etc in your environment? If so where are they? between the master/media and clients ?