Forum Discussion

MatBams's avatar
MatBams
Level 4
4 years ago

NDMP NetApp backup failing with error 40

Hi everyone,

I have an issue with my ndmp backup. Often, we have the same job with status 40 (network connection broken). The volume, that is backup, is big with a lot of small files.

My NetApp filer is connected with a brocade switch in fiber channel like my media server.

In attachment, you can see the error in my nbu environment.

I send you the bpbrm.log :

14:51:57.685 [9504.11716] <2> non_mpx_backup_archive_verify_import: start to write catalog
14:51:57.685 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host X.X.X.X, query type 78
14:51:57.701 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{F0DEB798-DF7A-4CDF-9023-3C0F2E6D9DD3}:OUTBOUND
14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.763 [9504.11716] <2> logconnections: BPDBM CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:57.763 [9504.11716] <2> db_begin: auth only query 78, socket 628 is not proxied
14:51:57.763 [9504.11716] <2> put_n_bytes_abs: write to socket failed: An established connection was aborted by the software in your host machine. (10053)
14:51:57.763 [9504.11716] <2> ts_put_length_string_optimized: put n bytes failed: An established connection was aborted by the software in your host machine. (10053)
14:51:57.763 [9504.11716] <2> db_senddata: ts_put_string_handle(): connection dropped or not connected, An established connection was aborted by the software in your host machine. , (10053)
14:51:57.763 [9504.11716] <2> db_startrequest: db_sendrequest() failed: network connection broken
14:51:57.763 [9504.11716] <16> db_begin: db_startrequest() failed: network connection broken
14:51:57.763 [9504.11716] <2> db_FLISTsend: db_begin() failed: network connection broken
14:51:57.779 [9504.11716] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: network connection broken (40)
14:51:57.779 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host X.X.X.X, query type 1
14:51:57.795 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.795 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{5AB060C3-509E-4F2C-A804-B33A74081867}:OUTBOUND
14:51:57.795 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.857 [9504.11716] <2> logconnections: PROXY CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:57.857 [9504.11716] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.57774 TO 127.0.0.1.57775 fd = 628
14:51:57.857 [9504.11716] <2> db_end: Need to collect reply
14:51:57.873 [9504.11716] <2> signal_ndmpagent: sending signal=1,status=40, to ndmpagent on media.server, client_pid=12232
14:51:57.873 [9504.11716] <2> bpcr_send_signal: Ignoring connect_opts = 0x01030202
14:51:57.873 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master.server, query type 223
14:51:57.888 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.888 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{66C6FBB5-61BB-4AA6-A182-82CDFE9ED60F}:OUTBOUND
14:51:57.888 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.951 [9504.11716] <2> logconnections: PROXY CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:57.951 [9504.11716] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.57778 TO 127.0.0.1.57779 fd = 628
14:51:57.966 [9504.11716] <2> db_CLIENTsend: reset client protocol version from 0 to 9
14:51:57.966 [9504.11716] <2> db_end: Need to collect reply
14:51:57.998 [9504.11716] <2> logconnections: BPCD CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:58.029 [9504.11716] <2> bpcr_get_version_rqst: bpcd version: 08100000
14:51:58.029 [9504.11716] <2> bpcr_get_version_rqst: bpcd version: 08100000
14:51:58.029 [9504.11716] <2> bpcr_send_signal: CLIENT_CMD_SOCK from bpcr = 628
14:51:58.029 [9504.11716] <2> bpcr_send_signal: CLIENT_STAT_SOCK from bpcr = 864
14:52:01.234 [9504.11716] <2> signal_ndmpagent: from client Harmonie: INF - EXIT STATUS 150: termination requested by administrator
14:52:01.234 [9504.11716] <2> bpbrm kill_child_process_Ex: start
14:53:07.624 [9504.11716] <2> bpbrm wait_for_child: start
14:53:07.624 [9504.11716] <2> bpbrm wait_for_child: child exit_status = 150
14:53:07.624 [9504.11716] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 150
14:53:07.624 [9504.11716] <2> bpbrm Exit: client backup EXIT STATUS 40: network connection broken
14:53:07.624 [9504.11716] <4> JobdSockList::UnregisterSocket: Unregister socket (764).
14:53:07.640 [9504.11716] <2> job_monitoring_exex: ACK disconnect
14:53:07.640 [9504.11716] <2> job_disconnect: Disconnected
14:57:56.310 [10208.7500] <2> non_mpx_backup_archive_verify_import: start to write catalog
14:57:56.310 [10208.7500] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master.server, query type 78
14:57:56.326 [10208.7500] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:57:56.326 [10208.7500] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{0973C69E-6CB1-4C07-B075-FDA7E3B14020}:OUTBOUND
14:57:56.326 [10208.7500] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:57:56.388 [10208.7500] <2> logconnections: BPDBM CONNECT FROM X.X.X.X TO X.X.X.X fd = 796
14:57:56.388 [10208.7500] <2> db_begin: auth only query 78, socket 796 is not proxied
14:57:56.420 [10208.7500] <2> db_end: Need to collect reply
14:57:57.123 [10208.7500] <2> non_mpx_backup_archive_verify_import: end write catalog

 

If someone can help me, i would be grateful.

  • MatBams 

    How much time has elapsed since the start of the backup?

    Can you please post all text in Job Details of a failed job?

    If you really need to replace hostnames or IP addresses, please replace with some generic hostnames/IPs,
    e.g. master 10.10.10.1
    media1 (if different from master) 10.10.10.2
    filer1 10.10.10.3

    We have no idea if the X.X.X.X that is the same in the 'from and to'  (BPDBM CONNECT FROM X.X.X.X TO X.X.X.X) means that it is the same host that is master and media server.

    Depending on what we see in Job Details, it might the 8-hour timeout issue on NDMP backups:
    https://www.veritas.com/support/en_US/article.10000860

    There is also this article about failure after 2 hours:
    https://www.veritas.com/support/en_US/article.100004901

    I don't see the same error in ndmp log, but still worth a look:
    https://www.veritas.com/support/en_US/article.100015404

    • MatBams's avatar
      MatBams
      Level 4

      Marianne 

      approximately 18 hours

      I modified the job detail and the log with more explicite name.

      14:50:35.674 [10208.7500] <2> non_mpx_backup_archive_verify_import: start to write catalog
      14:50:35.674 [10208.7500] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master, query type 78
      14:50:35.689 [10208.7500] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
      14:50:35.689 [10208.7500] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{AE789DE6-4E78-4ADD-A9A1-BB490413B536}:OUTBOUND
      14:50:35.689 [10208.7500] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
      14:50:35.752 [10208.7500] <2> logconnections: BPDBM CONNECT FROM media.57746 TO master.1556 fd = 1028
      14:50:35.768 [10208.7500] <2> db_begin: auth only query 78, socket 1028 is not proxied
      14:50:35.799 [10208.7500] <2> db_end: Need to collect reply
      14:50:36.659 [10208.7500] <2> non_mpx_backup_archive_verify_import: end write catalog
      14:51:57.685 [9504.11716] <2> non_mpx_backup_archive_verify_import: start to write catalog
      14:51:57.685 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master, query type 78
      14:51:57.701 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
      14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{F0DEB798-DF7A-4CDF-9023-3C0F2E6D9DD3}:OUTBOUND
      14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
      14:51:57.763 [9504.11716] <2> logconnections: BPDBM CONNECT FROM media.57769 TO master.1556 fd = 628
      14:51:57.763 [9504.11716] <2> db_begin: auth only query 78, socket 628 is not proxied
      14:51:57.763 [9504.11716] <2> put_n_bytes_abs: write to socket failed: An established connection was aborted by the software in your host machine. (10053)
      14:51:57.763 [9504.11716] <2> ts_put_length_string_optimized: put n bytes failed: An established connection was aborted by the software in your host machine. (10053)
      14:51:57.763 [9504.11716] <2> db_senddata: ts_put_string_handle(): connection dropped or not connected, An established connection was aborted by the software in your host machine. , (10053)
      14:51:57.763 [9504.11716] <2> db_startrequest: db_sendrequest() failed: network connection broken
      14:51:57.763 [9504.11716] <16> db_begin: db_startrequest() failed: network connection broken
      14:51:57.763 [9504.11716] <2> db_FLISTsend: db_begin() failed: network connection broken
      14:51:57.779 [9504.11716] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: network connection broken (40)

      I asked network team and we can have a network issue which impact a lot of things in our network.

      Sometimes the same job failed with error 233 premature eof encountered

       

      pats_729 

      It's going to a tape yes

      there is a lot of small files on the volumes that failed

      Now we can't run this backup to a dedup pool, we don't have the space on the MSDP.

      • MatBams's avatar
        MatBams
        Level 4

        Hi,

        We always search in this moment. Our issue with network is okay now but the backups failed again.

        Maybe it's the connection between my media and my master ? Can i up the timeout in the properties of my media and master ?

        Have a nice day.

  • Is this backup going to a TAPE and Local NDMP perhaps?

    Looks like small files causing this...

    I recommend run this backups to a dedup pool.

    You may still struggle with first backup but subsequent backups will go swift.