β09-15-2014 04:19 AM
Hello all,
For some time now we get intermittent backup failures in the exchange 2010 DAG backup.
The failures only occur in the full backup (on the weekend), incremental backups during the week run without problems.
We see an error in the netbackup activity monitor: socket write failed(24)
Job details:
13-Sep-14 9:13:33 PM - begin writing
13-Sep-14 11:29:18 PM - Critical bpbrm(pid=26883) from client dag.xx.xx: FTL - socket write failed
13-Sep-14 11:29:20 PM - Error bptm(pid=27781) media manager terminated by parent process
13-Sep-14 11:44:58 PM - Info bpbkar(pid=14560) done. status: 24: socket write failed
13-Sep-14 11:44:58 PM - end writing; write time: 2:31:25
socket write failed(24)
When looking in the applicatation log on the exchange client we see 2 errors at the time of the failure:
Application Log
13-Sep-14 11:29:18 PM
eventid 401
Instance 1: The physical consistency check has completed, but one or more errors were detected. The consistency check has terminated with error code of -106 (0xffffff96).
eventid 403
Instance 1: The physical consistency check successfully validated 4191658 out of 12526160 pages of database '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy15\MB014\MB014.edb'. Because some database pages were either not validated or failed validation, the consistency check has been considered unsuccessful.
Netbackup Logging on the Client
In the exchange client bpbkar log we see:
21:13:39.549 [8160.16316] <4> V_Snapshot::V_Snapshot_ExcludeRemoteFiles: INF - Excluding /\\?/Volume{4390bc2e-a934-11e2-8296-005056ac2864}/pagefile.sys
23:29:18.402 [14560.14712] <16> tar_tfi::processException:
An Exception of type [SocketWriteException] has occured at:
Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.55 $ , Function: TransporterRemote::write[2](), Line: 338
Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.91.94.2 $ , Function: Packer::getBuffer(), Line: 653
Module: tar_tfi::getBuffer, Function: D:\NB\NB_7.6.0.3\src\cl\clientpc\util\tar_tfi.cpp, Line: 311
Local Address: [::]:0
Remote Address: [::]:0
OS Error: 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
)
Expected bytes: 524288
23:29:18.433 [14560.14712] <2> tar_base::V_vTarMsgW: FTL - socket write failed
23:29:18.433 [14560.14712] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
23:29:18.433 [14560.14712] <4> tar_backup::backup_done_state: INF - number of file directives found: 5
23:29:18.433 [14560.12468] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
23:29:18.448 [14560.14712] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
23:29:18.464 [14560.14712] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
23:29:18.464 [14560.14712] <4> tar_backup::backup_done_state: INF - Not waiting for server status
23:29:18.464 [14560.14712] <2> ov_log::V_GlobalLog: ERR - endChksgfilesCCheck:ErrTerm() failed with error code -106.
23:29:18.464 [14560.14712] <2> exchange_shadowcopy_access::V_CloseForRead(): ERR - consistency check failed for 'Microsoft Information Store:\MB014\'
23:29:18.464 [14560.14712] <2> tar_base::V_vTarMsgW: WRN - Exchange Validation for 'Microsoft Information Store:\MB014\' failed. Please refer to the backup and application event logs for more details.
23:29:18.464 [14560.14712] <2> ov_log::V_GlobalLog: ERR - endChksgfilesCCheck:ErrTerm() failed with error code -1029.
23:29:18.480 [14560.14712] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
23:29:18.480 [14560.14712] <4> ov_log::OVLoop: Timestamp
23:29:18.480 [14560.14712] <4> OVStopCmd: INF - EXIT - status = 0
23:29:18.495 [14560.14712] <2> tar_base::V_Close: closing...
23:29:18.495 [14560.14712] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
23:29:18.604 [14560.14712] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000001
23:31:18.803 [14560.14712] <4> OVShutdown: INF - Finished process
23:31:18.803 [14560.14712] <4> WinMain: INF - Exiting C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
Symantec Tech Note
We found:
http://www.symantec.com/business/support/index?page=content&id=TECH136986
and we set the shadow copy to: "No limit". this was set only on the disks with the database (both active and passive)
We did this three weeks ago.
The backups ran fine for two weeks and it looked that this solved the problem.
But no... this weekend the problem was back again.
In all cases when we had the failure we did a rerun of the failed databases an the rerun always ended good.
Our enviroment:
Any help on where we can look to resolve this problem would be much appreciated
David
Solved! Go to Solution.
β09-15-2014 05:11 AM
Hello,
NetBackup will run a consistency check on the database (as you can see in the application logs). What you need to investigate is why its failing. What is wrong with them?
Or you need to disable this on the properties of the Exchange client.
Perform consistency check before backup with Microsoft Volume Shadow Copy Service (VSS)
β09-15-2014 04:56 AM
Did the phycial drive run out of space while backup was running ?
Have you tried to divert the VSS snap to another disk drive than the orginal one ?
See test 1 - step in this tech note:
http://www.symantec.com/docs/TECH47808
β09-15-2014 05:11 AM
Hello,
NetBackup will run a consistency check on the database (as you can see in the application logs). What you need to investigate is why its failing. What is wrong with them?
Or you need to disable this on the properties of the Exchange client.
Perform consistency check before backup with Microsoft Volume Shadow Copy Service (VSS)
β09-15-2014 05:39 AM
Get your Exchange Admin to investigate this:
23:29:18.464 [14560.14712] <2> exchange_shadowcopy_access::V_CloseForRead(): ERR - consistency check failed for 'Microsoft Information Store:\MB014\'
23:29:18.464 [14560.14712] <2> tar_base::V_vTarMsgW: WRN - Exchange Validation for 'Microsoft Information Store:\MB014\' failed. Please refer to the backup and application event logs for more details.
23:29:18.464 [14560.14712] <2> ov_log::V_GlobalLog: ERR - endChksgfilesCCheck:ErrTerm() failed with error code -1029.
NBU is reporting the error.
Not causing it.
β09-15-2014 05:47 AM
Hello Nicolai,
We didn't see any evidence of space running out in the windows eventlog.
In other cases where we had trouble with vss snapshot there was a clear message in the events of snapshots running out of space. Here we didn't see anything of this.
The disk where the database is running is 2 TB in size with aprox 1.5 TB free space.
The size of the database (without logs) is about 400 GB in size.
If the VSS snapshot is on the same disk ther should be enough space for a complete copy.
David
β09-15-2014 05:53 AM
Hello Marianne,
Our Exchange admin is (at this moment) unable to find something wrong with the (original) database. The VSS copy is already deleted so we cannot investigate that anymore
Exchange Eventlogs don't reveal any extra information other than that is fails.
The Backup runs OK on a rerun. That is the strange part.
I will ask the exchnage admin to see if he can investigate deeper and if more logging could be turned on.
David
β09-15-2014 05:56 AM
Hello Riaan,
It is only the consistency check of the copy that is failing. The original database seens to bee ok.
Is disabling the consistency check a good idea?
David
β09-15-2014 06:02 AM
That would still cause issues.If the copy is not consistent that could also cause your log to not be truncated. I would try and figure out why its not consistent, after all, you'd want to use it in case you have a failure in the active copy.
β09-18-2014 05:57 AM
Hello all,
It seems that our exchange enviroment was and stil is experiencing performance and stability issues.
We are looking into that, maybe it could have caused the failures
David