Forum Discussion

danguss's avatar
danguss
Level 4
5 years ago

backup fail with status 13

hello guys,

I have a problem with a backup of a server operating system.

"file read failed (13)"

This backup is configured to copy C, D, and System State.

However the system state backup is configured not to run on the incremental differential schedule. Added to exclud list, but realized that it still backs up the EFI partition.

I checked the network settings, server overload and event viewer but found nothing.

Attached are the logs
bpcd, bpfis, bpbkar and active monitor.

Could someone help?
  • You need to change Client Read Timeout on the Media Server, not the client. 

18 Replies

Replies have been turned off for this discussion
  • danguss 

    You also need bpbrm and bptm logs on the media server. 

    According to Activity Monitor, bpbrm on the media server failed at this point: 

    15/10/2019 02:57:13 - Error bpbrm (pid=544) socket read failed: errno = 62 - Timer expired

    bpbkar on the client however was at this point stil 'happily' generating backup stream until 03:02:

    03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\system32\winevt\logs\HardwareEvents.evtx HardwareEvents
    03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\System32\winevt\Logs\Security.evtx Security
    03:02:41.287 [18812.15328] <4> dos_backup::OVDeInitSpecialInfo: INF - Special File: 1 C:\Windows\system32\winevt\Logs\System.evtx System
    03:02:41.287 [18812.15328] <4> dos_backup::V_ShadowCopyBackupEnd: INF - foreign snapshot released.

    At this point, bpbkar realised that connection with media server is broken: 

    03:02:41.292 [18812.15328] <2> ov_log::V_GlobalLogEx: KeepAliveThread::isKeepAliveRunning keep alive not running: 2
    03:02:41.292 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
    03:02:42.292 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
    03:02:43.293 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (124) (TCP 10053: Software caused connection abort)
    03:02:43.293 [18812.15328] <16> dtcp_read: TCP - failure: recv socket (100) (TCP 10058: Can't send after socket shutdown)

    Level 3 logs on media server will give us this info:

    bpbrm (pid=544) will tell you up to what time metadata from client was received and information about the timeout.
    bptm ((pid=3595) and child process pid 4153 ) will tell you up to what time actual data was received from the client. 

    • Hamza_H's avatar
      Hamza_H
      Moderator

      Hello,

      Plus what Marianne and sdo mentionned, there is absolutly a communication problem between the client and the media, looking at timeout or your firewall (not the local one, but the general one) sometimes the session closes after a mount of time)..

       

      Best of luck,

      • danguss's avatar
        danguss
        Level 4

        The policy is set to "Allow Multiple data streams" and the odd thing is that only the job pointed to the "System state" that as mentioned is set in the client exclude list fails.

        When you open this job, you measure something from the EFI volume. Could it be something related to this volume?

    • danguss's avatar
      danguss
      Level 4

      I performed another backup with a longer timeout and the backup now fails with status code 41. I increased the verbose of the attached media server.

      • Hamza_H's avatar
        Hamza_H
        Moderator

        HELLO,

         

        After reviewing the logs, there is a certain communication/timeout between the client and media server 

         

        snippet from bpbkar :

        16:24:10.803 [17432.41552] <2> ov_log::V_GlobalLog: INF - Attached [0x427a860]
        16:44:39.080 [17432.41552] <2> ov_log::V_GlobalLog: INF - Unattaching [0x427a860]

         

        as you can see there is a gap between the two lines (about 20 minutes)

        from bpbrm logs :

        15:20:08.927 [8328.9244] <2> bpbrm readline: sending keepalive
        15:20:13.928 [8328.9244] <2> bpbrm mm_sig: received ready signal from media manager
        15:21:03.929 [8328.9244] <2> bpbrm send_ping: PING
        15:31:03.934 [8328.9244] <2> bpbrm send_ping: PING

        media server is sending a keepalive and ping requests to say "hey i'm still here" 

        but after a while it gives up.

         

        You said that changed the client_read_timeout, what is the current value? because i'm still seeing 300 seconds which is 5 minutes which is the gap between the two lines here before it returns the error 

        [8328.9244] <2> bpbrm send_ping: PING
        16:01:03.938 [8328.9244] <2> bpbrm send_ping: PING
        16:11:03.947 [8328.9244] <2> bpbrm send_ping: PING
        16:21:03.948 [8328.9244] <2> bpbrm send_ping: PING
        16:24:09.032 [8328.9244] <2> non_mpx_backup_archive_verify_import: ESTIMATE -1 -1 srv11420_1571336378
        16:24:10.407 [8328.9244] <2> non_mpx_backup_archive_verify_import: ADDED FILES TO DB FOR srv11420_1571336378 1 /EFI System Partition/EFI/
        16:29:10.887 [8328.9244] <2> bpbrm readline: bpbrm timeout after 300 seconds
        16:29:10.887 [8328.9244] <2> bpbrm kill_child_process_Ex: start
        16:29:15.606 [8328.9244] <2> bpbrm wait_for_child: start
        16:29:15.606 [8328.9244] <2> bpbrm wait_for_child: child exit_status = 150
        16:29:15.606 [8328.9244] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 0
        16:29:15.606 [8328.9244] <2> set_job_details: Tfile (6251779): LOG 1571340555 4 bpbkar32 17432 done. status: 41: network connection timed out
        16:29:15.606 [8328.9244] <2> send_job_file: job ID 6251779, ftype = 3 msg len = 79, msg = LOG 1571340555 4 bpbkar32 17432 done. status: 41: network connection timed out
        16:29:15.606 [8328.9244] <2> bpbrm Exit: client snapshot EXIT STATUS 41: network connection timed out
        16:29:15.621 [8328.9244] <2> vnet_close_socket_safely_ex: [vnet.c:1176] safe close 0 0x0
        16:29:15.621 [8328.9244] <4> JobdSockList::UnregisterSocket: Unregister socket (624).
        16:29:15.637 [8328.9244] <2> job_monitoring_exex: ACK disconnect
        16:29:15.637 [8328.9244] <2> job_disconnect: Disconnected

         

        try to increase timeout/ and let us know the results.

         

        Good luck

  • Apologies, I've not looked at the logs... but when I consistently get status 13/14 for file-system style backups across SDN (software defined networking) (aka virtual networks) (e.g. Hyper-V NVGRE)... the one consistent solution for me (for quite a few (and I do mean like a 100+) guest VMs) has been to disable client-side-dedupe, and only very occaisionally to also disable accelerator.

    What I mean is... that.. accelerator doesn't seem to be a problem for me... but client-side dedupe... hmmm... there's something, that I can't quite yet put my finger on, about client-side-dedupe that seems to not sit too happily with SDN, e.g. NAT across NVGRE.

    So, without looking at your logs... maybe you could turn off client-side dedupe for the problem clients?

    You don't need to make any policy adjustments... because client-side-dedupe is granular at the client-name level, e.g. :

    bpclient -update -client blah.com -client_direct 0
    • danguss's avatar
      danguss
      Level 4

      sdo 

      Deduplication is always performed by the Media server. In the case of this policy the backup is directed to a stage area (basic disk).

      I did some tests on changing the media server, increasing the client_read_timeout and it doesn't seem to work.

      while analyzing the bpbkar logs I found the following messages:

       

      01:38:01.856 [18812.15328] <2> ov_log::V_GlobalLog: INF - EFI Partition mount access point is GLOBALROOT\Device\HarddiskVolume2!
      01:38:01.868 [18812.15328] <2> ov_log::V_GlobalLog: INF - CreateFile(\\.\EFI>System>Partition) failed. return code: 2

      01:27:24.806 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (736) (TCP 10053: Software caused connection abort)
      01:27:25.807 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (736) (TCP 10053: Software caused connection abort)
      01:27:25.807 [27532.19004] <16> dtcp_read: TCP - failure: recv socket (740) (TCP 10058: Can't send after socket shutdown)
      01:27:25.807 [27532.19004] <4> OVShutdown: INF - Shutdown wait finished
      01:27:25.807 [27532.19004] <4> OVShutdown: INF - Closing stderr connection
      01:27:25.807 [27532.19004] <2> dtcp_close: TCP - success: close socket (740)
      01:27:25.807 [27532.19004] <4> OVShutdown: INF - Closing stdout/stdin connection
      01:27:25.807 [27532.19004] <2> dtcp_close: TCP - success: close socket (736)
      01:27:25.807 [27532.19004] <4> OVShutdown: INF - Finished process

      01:38:01.856 [18812.15328] <2> ov_log::V_GlobalLog: INF - EFI Partition mount access point is GLOBALROOT\Device\HarddiskVolume2!
      01:38:01.868 [18812.15328] <2> ov_log::V_GlobalLog: INF - CreateFile(\\.\EFI>System>Partition) failed. return code: 2

      Do these messages mean anything?

       

       


        


  • 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - VssSnapshotVolume::BackupComplete(): Trying to get lock to run BackupComplete 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - VssSnapshotVolume::BackupComplete(): Got lock - running BackupComplete 02:42:13.321 [18812.15328] <2> ov_log::V_GlobalLog: INF - Attached [0x1f0d430] At this operation time out occurred on media server but the client doesn't know yet
    03:02:40.443 [18812.15328] <2> ov_log::V_GlobalLog: INF - Unattaching [0x1f0d430] 03:02:40.443 [18812.15328] <2> ov_log::V_GlobalLog: INF - Attempting to delete VSS snapshot set: {81b8f0aa-336e-44a9-a84d-2686026a0f91} 03:02:41.136 [18812.15328] <2> ov_log::V_GlobalLog: INF - VSS snapshot set deleted. Snapshot SetID = {81b8f0aa-336e-44a9-a84d-2686026a0f91} 03:02:41.137 [18812.15328] <2> ov_log::V_GlobalLog: INF - Freeing gathered writer metadata 03:02:41.137 [18812.15328] <2> ov_log::V_GlobalLog: INF - Releasing vss interface 0x1f0d430 03:02:41.277 [18812.15328] <8> tar_backup_tfi::cleanupTemp: WRN - will not cleanup 'temp' directory, debug level(s) are too high 03:02:41.277 [18812.15328] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 0: the requested operation was successfully completed 03:02:41.277 [18812.15328] <4> dos_backup::V_PostCleanup: INF - ================================================================================

     

    Try to start with  increasing CLIENT_READ_TIMEOUT