10-16-2019 01:54 PM
hello guys,
I have a problem with a backup of a server operating system.
"file read failed (13)"
This backup is configured to copy C, D, and System State.
However the system state backup is configured not to run on the incremental differential schedule. Added to exclud list, but realized that it still backs up the EFI partition.
I checked the network settings, server overload and event viewer but found nothing.
Attached are the logs
bpcd, bpfis, bpbkar and active monitor.
Could someone help?
Solved! Go to Solution.
10-18-2019 07:31 AM
You need to change Client Read Timeout on the Media Server, not the client.
10-16-2019 04:29 PM - edited 10-16-2019 04:30 PM
Apologies, I've not looked at the logs... but when I consistently get status 13/14 for file-system style backups across SDN (software defined networking) (aka virtual networks) (e.g. Hyper-V NVGRE)... the one consistent solution for me (for quite a few (and I do mean like a 100+) guest VMs) has been to disable client-side-dedupe, and only very occaisionally to also disable accelerator.
What I mean is... that.. accelerator doesn't seem to be a problem for me... but client-side dedupe... hmmm... there's something, that I can't quite yet put my finger on, about client-side-dedupe that seems to not sit too happily with SDN, e.g. NAT across NVGRE.
So, without looking at your logs... maybe you could turn off client-side dedupe for the problem clients?
You don't need to make any policy adjustments... because client-side-dedupe is granular at the client-name level, e.g. :
bpclient -update -client blah.com -client_direct 0
10-16-2019 04:54 PM - edited 10-16-2019 04:56 PM
02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - VssSnapshotVolume::BackupComplete(): Trying to get lock to run BackupComplete 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - VssSnapshotVolume::BackupComplete(): Got lock - running BackupComplete 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - Attached [0x1f0d430] At this operation time out occurred on media server but the client doesn't know yet
03:02:40.443 [18812.15328] <2> ov_log::V_GlobalLog: INF - Unattaching [0x1f0d430] 03:02:40.443 [18812.15328] <2> ov_log::V_GlobalLog: INF - Attempting to delete VSS snapshot set: {81b8f0aa-336e-44a9-a84d-2686026a0f91} 03:02:41.136 [18812.15328] <2> ov_log::V_GlobalLog: INF - VSS snapshot set deleted. Snapshot SetID = {81b8f0aa-336e-44a9-a84d-2686026a0f91} 03:02:41.137 [18812.15328] <2> ov_log::V_GlobalLog: INF - Freeing gathered writer metadata 03:02:41.137 [18812.15328] <2> ov_log::V_GlobalLog: INF - Releasing vss interface 0x1f0d430 03:02:41.277 [18812.15328] <8> tar_backup_tfi::cleanupTemp: WRN - will not cleanup 'temp' directory, debug level(s) are too high 03:02:41.277 [18812.15328] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 0: the requested operation was successfully completed 03:02:41.277 [18812.15328] <4> dos_backup::V_PostCleanup: INF - ================================================================================
Try to start with increasing CLIENT_READ_TIMEOUT
10-17-2019 04:35 AM
Deduplication is always performed by the Media server. In the case of this policy the backup is directed to a stage area (basic disk).
I did some tests on changing the media server, increasing the client_read_timeout and it doesn't seem to work.
while analyzing the bpbkar logs I found the following messages:
01:38:01.856 [18812.15328] <2> ov_log::V_GlobalLog: INF - EFI Partition mount access point is GLOBALROOT\Device\HarddiskVolume2!
01:38:01.868 [18812.15328] <2> ov_log::V_GlobalLog: INF - CreateFile(\\.\EFI>System>Partition) failed. return code: 2
01:27:24.806 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (736) (TCP 10053: Software caused connection abort)
01:27:25.807 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (736) (TCP 10053: Software caused connection abort)
01:27:25.807 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (740) (TCP 10058: Can't send after socket shutdown)
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Shutdown wait finished
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Closing stderr connection
01:27:25.807 [27532.19004] <2> dtcp_close: TCP - success: close socket (740)
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Closing stdout/stdin connection
01:27:25.807 [27532.19004] <2> dtcp_close: TCP - success: close socket (736)
01:27:25.807 [27532.19004] <4> OVShutdown: INF - Finished process
01:38:01.856 [18812.15328] <2> ov_log::V_GlobalLog: INF - EFI Partition mount access point is GLOBALROOT\Device\HarddiskVolume2!
01:38:01.868 [18812.15328] <2> ov_log::V_GlobalLog: INF - CreateFile(\\.\EFI>System>Partition) failed. return code: 2
Do these messages mean anything?
10-17-2019 04:36 AM
@Mike_Gavrilov I had already made this change and it seems to have no effect. = (
10-17-2019 06:02 AM
I think EFI messages can be ignored for now.
The first error is TCP status 10053, which is listed here:
https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2
As for why the underlying network TCP conversation failed, then I'm not sure that we can tell from the bpbkar log snippet. I think you might have to google for causes of TCP 10053 and posisble solutions.
10-17-2019 06:42 AM
You also need bpbrm and bptm logs on the media server.
According to Activity Monitor, bpbrm on the media server failed at this point:
15/10/2019 02:57:13 - Error bpbrm (pid=544) socket read failed: errno = 62 - Timer expired
bpbkar on the client however was at this point stil 'happily' generating backup stream until 03:02:
03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\system32\winevt\logs\HardwareEvents.evtx HardwareEvents
03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\System32\winevt\Logs\Security.evtx Security
03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\system32\winevt\Logs\System.evtx System
03:02:41.287 [18812.15328] <4> dos_backup::V_ShadowCopyBackupEnd: INF - foreign snapshot released.
At this point, bpbkar realised that connection with media server is broken:
03:02:41.292 [18812.15328] <2> ov_log::V_GlobalLogEx: KeepAliveThread::isKeepAliveRunning keep alive not running: 2
03:02:41.292 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
03:02:42.292 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
03:02:43.293 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
03:02:43.293 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (100) (TCP 10058: Can't send after socket shutdown)
Level 3 logs on media server will give us this info:
bpbrm (pid=544) will tell you up to what time metadata from client was received and information about the timeout.
bptm ((pid=3595) and child process pid 4153 ) will tell you up to what time actual data was received from the client.
10-18-2019 02:55 AM
Hello,
Plus what Marianne and sdo mentionned, there is absolutly a communication problem between the client and the media, looking at timeout or your firewall (not the local one, but the general one) sometimes the session closes after a mount of time)..
Best of luck,
10-18-2019 04:33 AM
I performed another backup with a longer timeout and the backup now fails with status code 41. I increased the verbose of the attached media server.
10-18-2019 04:43 AM
The policy is set to "Allow Multiple data streams" and the odd thing is that only the job pointed to the "System state" that as mentioned is set in the client exclude list fails.
When you open this job, you measure something from the EFI volume. Could it be something related to this volume?
10-18-2019 04:47 AM
The Detailing of this job occurs between 10/17/2019 15:19:36 to 17/10/2019 16:29:15
10-18-2019 05:08 AM
HELLO,
After reviewing the logs, there is a certain communication/timeout between the client and media server
snippet from bpbkar :
16:24:10.803 [17432.41552] <2> ov_log::V_GlobalLog: INF - Attached [0x427a860]
16:44:39.080 [17432.41552] <2> ov_log::V_GlobalLog: INF - Unattaching [0x427a860]
as you can see there is a gap between the two lines (about 20 minutes)
from bpbrm logs :
15:20:08.927 [8328.9244] <2> bpbrm readline: sending keepalive
15:20:13.928 [8328.9244] <2> bpbrm mm_sig: received ready signal from media manager
15:21:03.929 [8328.9244] <2> bpbrm send_ping: PING
15:31:03.934 [8328.9244] <2> bpbrm send_ping: PING
media server is sending a keepalive and ping requests to say "hey i'm still here"
but after a while it gives up.
You said that changed the client_read_timeout, what is the current value? because i'm still seeing 300 seconds which is 5 minutes which is the gap between the two lines here before it returns the error
[8328.9244] <2> bpbrm send_ping: PING
16:01:03.938 [8328.9244] <2> bpbrm send_ping: PING
16:11:03.947 [8328.9244] <2> bpbrm send_ping: PING
16:21:03.948 [8328.9244] <2> bpbrm send_ping: PING
16:24:09.032 [8328.9244] <2> non_mpx_backup_archive_verify_import: ESTIMATE -1 -1 srv11420_1571336378
16:24:10.407 [8328.9244] <2> non_mpx_backup_archive_verify_import: ADDED FILES TO DB FOR srv11420_1571336378 1 /EFI System Partition/EFI/
16:29:10.887 [8328.9244] <2> bpbrm readline: bpbrm timeout after 300 seconds
16:29:10.887 [8328.9244] <2> bpbrm kill_child_process_Ex: start
16:29:15.606 [8328.9244] <2> bpbrm wait_for_child: start
16:29:15.606 [8328.9244] <2> bpbrm wait_for_child: child exit_status = 150
16:29:15.606 [8328.9244] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 0
16:29:15.606 [8328.9244] <2> set_job_details: Tfile (6251779): LOG 1571340555 4 bpbkar32 17432 done. status: 41: network connection timed out
16:29:15.606 [8328.9244] <2> send_job_file: job ID 6251779, ftype = 3 msg len = 79, msg = LOG 1571340555 4 bpbkar32 17432 done. status: 41: network connection timed out
16:29:15.606 [8328.9244] <2> bpbrm Exit: client snapshot EXIT STATUS 41: network connection timed out
16:29:15.621 [8328.9244] <2> vnet_close_socket_safely_ex: [vnet.c:1176] safe close 0 0x0
16:29:15.621 [8328.9244] <4> JobdSockList::UnregisterSocket: Unregister socket (624).
16:29:15.637 [8328.9244] <2> job_monitoring_exex: ACK disconnect
16:29:15.637 [8328.9244] <2> job_disconnect: Disconnected
try to increase timeout/ and let us know the results.
Good luck
10-18-2019 05:11 AM
you mentioned " I did some tests on changing the media server, increasing the client_read_timeout and it doesn't seem to work." try changing the value on client and not media "client_read_timeout" from 300 to 3600.
10-18-2019 05:23 AM
@danguss wrote:
The policy is set to "Allow Multiple data streams" and the odd thing is that only the job pointed to the "System state" that as mentioned is set in the client exclude list fails.
What exactly is recorded in this Client's Exclude List?
I see the following in the bpbkar log that you posted the other day :
00:23:04.925 [13752.29060] <2> ov_log::V_GlobalLog: INF - Successfully attached to Device 'Shadow?Copy?Components' BackupReason:0x800
00:23:04.925 [13752.29060] <2> ov_log::V_GlobalLog: INF - ATTACH_BACKUP_EXCLUDE: Shadow excludes will be processed Attach=0x800
...
I have extracted all lines containing 'exclude' in attached text file..... too many lines for me to go through....
I want to suggest that you log a Support call with Veritas - they have engineers with time and knowledge to sift through very large log files.
10-18-2019 07:22 AM
This is strange, I really changed the client_read_timeout on the client to 3600 and stayed that way.
10-18-2019 07:27 AM
Due to a problem with system state incremental differential backups that were intermittent, it was necessary to add to an exclude list to perform the FULL backup only. I performed a new full backup "without the exclude list" and it returns with the same problem ...
10-18-2019 07:31 AM
You need to change Client Read Timeout on the Media Server, not the client.
10-18-2019 10:16 AM
I changed the client read timeout of the media server to 3600 and the backup performed successfully.
Looking at Job Details which represents "Using 262144 data buffer size", "Using 30 data buffers" and "Waited for full buffer 23080 times delayed 444406 times"
Backup Time has increased sharply, before it was up to max 5 minutes and now it takes 2 hours.
What are the main points I need to analyze?
10-18-2019 01:00 PM
"Waited for full buffer 23080 times delayed 444406 times" means that media-server had to wait for the client and either bpbkar on the client can't read faster or network is the bottleneck. But to troubleshoot backup performance we need to know size of the backup and so on.