cancel
Showing results for 
Search instead for 
Did you mean: 

backup fail with status 13

danguss
Level 4
hello guys,

I have a problem with a backup of a server operating system.

"file read failed (13)"

This backup is configured to copy C, D, and System State.

However the system state backup is configured not to run on the incremental differential schedule. Added to exclud list, but realized that it still backs up the EFI partition.

I checked the network settings, server overload and event viewer but found nothing.

Attached are the logs
bpcd, bpfis, bpbkar and active monitor.

Could someone help?
1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You need to change Client Read Timeout on the Media Server, not the client. 

View solution in original post

18 REPLIES 18

sdo
Moderator
Moderator
Partner    VIP    Certified

Apologies, I've not looked at the logs... but when I consistently get status 13/14 for file-system style backups across SDN (software defined networking) (aka virtual networks) (e.g. Hyper-V NVGRE)... the one consistent solution for me (for quite a few (and I do mean like a 100+) guest VMs) has been to disable client-side-dedupe, and only very occaisionally to also disable accelerator.

What I mean is... that.. accelerator doesn't seem to be a problem for me... but client-side dedupe... hmmm... there's something, that I can't quite yet put my finger on, about client-side-dedupe that seems to not sit too happily with SDN, e.g. NAT across NVGRE.

So, without looking at your logs... maybe you could turn off client-side dedupe for the problem clients?

You don't need to make any policy adjustments... because client-side-dedupe is granular at the client-name level, e.g. :

bpclient -update -client blah.com -client_direct 0

Mike_Gavrilov
Moderator
Moderator
Partner    VIP    Accredited Certified

02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - VssSnapshotVolume::BackupComplete(): Trying to get lock to run BackupComplete 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - VssSnapshotVolume::BackupComplete(): Got lock - running BackupComplete 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - Attached [0x1f0d430] At this operation time out occurred on media server but the client doesn't know yet
03:02:40.443 [18812.15328] <2> ov_log::V_GlobalLog: INF - Unattaching [0x1f0d430] 03:02:40.443 [18812.15328] <2> ov_log::V_GlobalLog: INF - Attempting to delete VSS snapshot set: {81b8f0aa-336e-44a9-a84d-2686026a0f91} 03:02:41.136 [18812.15328] <2> ov_log::V_GlobalLog: INF - VSS snapshot set deleted. Snapshot SetID = {81b8f0aa-336e-44a9-a84d-2686026a0f91} 03:02:41.137 [18812.15328] <2> ov_log::V_GlobalLog: INF - Freeing gathered writer metadata 03:02:41.137 [18812.15328] <2> ov_log::V_GlobalLog: INF - Releasing vss interface 0x1f0d430 03:02:41.277 [18812.15328] <8> tar_backup_tfi::cleanupTemp: WRN - will not cleanup 'temp' directory, debug level(s) are too high 03:02:41.277 [18812.15328] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 0: the requested operation was successfully completed 03:02:41.277 [18812.15328] <4> dos_backup::V_PostCleanup: INF - ================================================================================

 

Try to start with  increasing CLIENT_READ_TIMEOUT  

@sdo 

Deduplication is always performed by the Media server. In the case of this policy the backup is directed to a stage area (basic disk).

I did some tests on changing the media server, increasing the client_read_timeout and it doesn't seem to work.

while analyzing the bpbkar logs I found the following messages:

 

01:38:01.856 [18812.15328] <2> ov_log::V_GlobalLog: INF - EFI Partition mount access point is GLOBALROOT\Device\HarddiskVolume2!
01:38:01.868 [18812.15328] <2> ov_log::V_GlobalLog: INF - CreateFile(\\.\EFI>System>Partition) failed. return code: 2

01:27:24.806 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (736) (TCP 10053: Software caused connection abort)
01:27:25.807 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (736) (TCP 10053: Software caused connection abort)
01:27:25.807 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (740) (TCP 10058: Can't send after socket shutdown)
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Shutdown wait finished
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Closing stderr connection
01:27:25.807 [27532.19004] <2> dtcp_close: TCP - success: close socket (740)
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Closing stdout/stdin connection
01:27:25.807 [27532.19004] <2> dtcp_close: TCP - success: close socket (736)
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Finished process

01:38:01.856 [18812.15328] <2> ov_log::V_GlobalLog: INF - EFI Partition mount access point is GLOBALROOT\Device\HarddiskVolume2!
01:38:01.868 [18812.15328] <2> ov_log::V_GlobalLog: INF - CreateFile(\\.\EFI>System>Partition) failed. return code: 2

Do these messages mean anything?

 

 


  

@Mike_Gavrilov  I had already made this change and it seems to have no effect. = (

sdo
Moderator
Moderator
Partner    VIP    Certified

I think EFI messages can be ignored for now.

The first error is TCP status 10053, which is listed here:

https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2

As for why the underlying network TCP conversation failed, then I'm not sure that we can tell from the bpbkar log snippet.  I think you might have to google for causes of TCP 10053 and posisble solutions.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@danguss 

You also need bpbrm and bptm logs on the media server. 

According to Activity Monitor, bpbrm on the media server failed at this point: 

15/10/2019 02:57:13 - Error bpbrm (pid=544) socket read failed: errno = 62 - Timer expired

bpbkar on the client however was at this point stil 'happily' generating backup stream until 03:02:

03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\system32\winevt\logs\HardwareEvents.evtx HardwareEvents
03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\System32\winevt\Logs\Security.evtx Security
03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\system32\winevt\Logs\System.evtx System
03:02:41.287 [18812.15328] <4> dos_backup::V_ShadowCopyBackupEnd: INF - foreign snapshot released.

At this point, bpbkar realised that connection with media server is broken: 

03:02:41.292 [18812.15328] <2> ov_log::V_GlobalLogEx: KeepAliveThread::isKeepAliveRunning keep alive not running: 2
03:02:41.292 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
03:02:42.292 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
03:02:43.293 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
03:02:43.293 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (100) (TCP 10058: Can't send after socket shutdown)

Level 3 logs on media server will give us this info:

bpbrm (pid=544) will tell you up to what time metadata from client was received and information about the timeout.
bptm ((pid=3595) and child process pid 4153 ) will tell you up to what time actual data was received from the client. 

Hamza_H
Moderator
Moderator
   VIP   

Hello,

Plus what Marianne and sdo mentionned, there is absolutly a communication problem between the client and the media, looking at timeout or your firewall (not the local one, but the general one) sometimes the session closes after a mount of time)..

 

Best of luck,

I performed another backup with a longer timeout and the backup now fails with status code 41. I increased the verbose of the attached media server.

The policy is set to "Allow Multiple data streams" and the odd thing is that only the job pointed to the "System state" that as mentioned is set in the client exclude list fails.

When you open this job, you measure something from the EFI volume. Could it be something related to this volume?

The Detailing of this job occurs between 10/17/2019 15:19:36 to 17/10/2019 16:29:15

 

 

Hamza_H
Moderator
Moderator
   VIP   

HELLO,

 

After reviewing the logs, there is a certain communication/timeout between the client and media server 

 

snippet from bpbkar :

16:24:10.803 [17432.41552] <2> ov_log::V_GlobalLog: INF - Attached [0x427a860]
16:44:39.080 [17432.41552] <2> ov_log::V_GlobalLog: INF - Unattaching [0x427a860]

 

as you can see there is a gap between the two lines (about 20 minutes)

from bpbrm logs :

15:20:08.927 [8328.9244] <2> bpbrm readline: sending keepalive
15:20:13.928 [8328.9244] <2> bpbrm mm_sig: received ready signal from media manager
15:21:03.929 [8328.9244] <2> bpbrm send_ping: PING
15:31:03.934 [8328.9244] <2> bpbrm send_ping: PING

media server is sending a keepalive and ping requests to say "hey i'm still here" 

but after a while it gives up.

 

You said that changed the client_read_timeout, what is the current value? because i'm still seeing 300 seconds which is 5 minutes which is the gap between the two lines here before it returns the error 

[8328.9244] <2> bpbrm send_ping: PING
16:01:03.938 [8328.9244] <2> bpbrm send_ping: PING
16:11:03.947 [8328.9244] <2> bpbrm send_ping: PING
16:21:03.948 [8328.9244] <2> bpbrm send_ping: PING
16:24:09.032 [8328.9244] <2> non_mpx_backup_archive_verify_import: ESTIMATE -1 -1 srv11420_1571336378
16:24:10.407 [8328.9244] <2> non_mpx_backup_archive_verify_import: ADDED FILES TO DB FOR srv11420_1571336378 1 /EFI System Partition/EFI/
16:29:10.887 [8328.9244] <2> bpbrm readline: bpbrm timeout after 300 seconds
16:29:10.887 [8328.9244] <2> bpbrm kill_child_process_Ex: start
16:29:15.606 [8328.9244] <2> bpbrm wait_for_child: start
16:29:15.606 [8328.9244] <2> bpbrm wait_for_child: child exit_status = 150
16:29:15.606 [8328.9244] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 0
16:29:15.606 [8328.9244] <2> set_job_details: Tfile (6251779): LOG 1571340555 4 bpbkar32 17432 done. status: 41: network connection timed out
16:29:15.606 [8328.9244] <2> send_job_file: job ID 6251779, ftype = 3 msg len = 79, msg = LOG 1571340555 4 bpbkar32 17432 done. status: 41: network connection timed out
16:29:15.606 [8328.9244] <2> bpbrm Exit: client snapshot EXIT STATUS 41: network connection timed out
16:29:15.621 [8328.9244] <2> vnet_close_socket_safely_ex: [vnet.c:1176] safe close 0 0x0
16:29:15.621 [8328.9244] <4> JobdSockList::UnregisterSocket: Unregister socket (624).
16:29:15.637 [8328.9244] <2> job_monitoring_exex: ACK disconnect
16:29:15.637 [8328.9244] <2> job_disconnect: Disconnected

 

try to increase timeout/ and let us know the results.

 

Good luck

Hamza_H
Moderator
Moderator
   VIP   

you mentioned " I did some tests on changing the media server, increasing the client_read_timeout and it doesn't seem to work." try changing the value on client and not media "client_read_timeout" from 300 to 3600.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

 


@danguss wrote:

The policy is set to "Allow Multiple data streams" and the odd thing is that only the job pointed to the "System state" that as mentioned is set in the client exclude list fails.


What exactly is recorded in this Client's Exclude List? 

I see the following in the bpbkar log that you posted the other day : 

00:23:04.925 [13752.29060] <2> ov_log::V_GlobalLog: INF - Successfully attached to Device 'Shadow?Copy?Components' BackupReason:0x800
00:23:04.925 [13752.29060] <2> ov_log::V_GlobalLog: INF - ATTACH_BACKUP_EXCLUDE: Shadow excludes will be processed Attach=0x800
...
I have extracted all lines containing 'exclude' in attached text file..... too many lines for me to go through.... 

I want to suggest that you log a Support call with Veritas - they have engineers with time and knowledge to sift through very large log files. 

This is strange, I really changed the client_read_timeout on the client to 3600 and stayed that way.

Due to a problem with system state incremental differential backups that were intermittent, it was necessary to add to an exclude list to perform the FULL backup only. I performed a new full backup "without the exclude list" and it returns with the same problem ...

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You need to change Client Read Timeout on the Media Server, not the client. 

I changed the client read timeout of the media server to 3600 and the backup performed successfully.

Looking at Job Details which represents "Using 262144 data buffer size", "Using 30 data buffers" and "Waited for full buffer 23080 times delayed 444406 times"

Backup Time has increased sharply, before it was up to max 5 minutes and now it takes 2 hours.

What are the main points I need to analyze?

 

 

Mike_Gavrilov
Moderator
Moderator
Partner    VIP    Accredited Certified

"Waited for full buffer 23080 times delayed 444406 times" means that media-server had to wait for the client and either bpbkar on the client can't read faster or network is the bottleneck. But to troubleshoot backup performance we need to know size of the backup and so on.