cancel
Showing results for 
Search instead for 
Did you mean: 

NDMP NetApp backup failing with error 40

MatBams
Level 4

Hi everyone,

I have an issue with my ndmp backup. Often, we have the same job with status 40 (network connection broken). The volume, that is backup, is big with a lot of small files.

My NetApp filer is connected with a brocade switch in fiber channel like my media server.

In attachment, you can see the error in my nbu environment.

I send you the bpbrm.log :

14:51:57.685 [9504.11716] <2> non_mpx_backup_archive_verify_import: start to write catalog
14:51:57.685 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host X.X.X.X, query type 78
14:51:57.701 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{F0DEB798-DF7A-4CDF-9023-3C0F2E6D9DD3}:OUTBOUND
14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.763 [9504.11716] <2> logconnections: BPDBM CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:57.763 [9504.11716] <2> db_begin: auth only query 78, socket 628 is not proxied
14:51:57.763 [9504.11716] <2> put_n_bytes_abs: write to socket failed: An established connection was aborted by the software in your host machine. (10053)
14:51:57.763 [9504.11716] <2> ts_put_length_string_optimized: put n bytes failed: An established connection was aborted by the software in your host machine. (10053)
14:51:57.763 [9504.11716] <2> db_senddata: ts_put_string_handle(): connection dropped or not connected, An established connection was aborted by the software in your host machine. , (10053)
14:51:57.763 [9504.11716] <2> db_startrequest: db_sendrequest() failed: network connection broken
14:51:57.763 [9504.11716] <16> db_begin: db_startrequest() failed: network connection broken
14:51:57.763 [9504.11716] <2> db_FLISTsend: db_begin() failed: network connection broken
14:51:57.779 [9504.11716] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: network connection broken (40)
14:51:57.779 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host X.X.X.X, query type 1
14:51:57.795 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.795 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{5AB060C3-509E-4F2C-A804-B33A74081867}:OUTBOUND
14:51:57.795 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.857 [9504.11716] <2> logconnections: PROXY CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:57.857 [9504.11716] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.57774 TO 127.0.0.1.57775 fd = 628
14:51:57.857 [9504.11716] <2> db_end: Need to collect reply
14:51:57.873 [9504.11716] <2> signal_ndmpagent: sending signal=1,status=40, to ndmpagent on media.server, client_pid=12232
14:51:57.873 [9504.11716] <2> bpcr_send_signal: Ignoring connect_opts = 0x01030202
14:51:57.873 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master.server, query type 223
14:51:57.888 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.888 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{66C6FBB5-61BB-4AA6-A182-82CDFE9ED60F}:OUTBOUND
14:51:57.888 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.951 [9504.11716] <2> logconnections: PROXY CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:57.951 [9504.11716] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.57778 TO 127.0.0.1.57779 fd = 628
14:51:57.966 [9504.11716] <2> db_CLIENTsend: reset client protocol version from 0 to 9
14:51:57.966 [9504.11716] <2> db_end: Need to collect reply
14:51:57.998 [9504.11716] <2> logconnections: BPCD CONNECT FROM X.X.X.X TO X.X.X.X fd = 628
14:51:58.029 [9504.11716] <2> bpcr_get_version_rqst: bpcd version: 08100000
14:51:58.029 [9504.11716] <2> bpcr_get_version_rqst: bpcd version: 08100000
14:51:58.029 [9504.11716] <2> bpcr_send_signal: CLIENT_CMD_SOCK from bpcr = 628
14:51:58.029 [9504.11716] <2> bpcr_send_signal: CLIENT_STAT_SOCK from bpcr = 864
14:52:01.234 [9504.11716] <2> signal_ndmpagent: from client Harmonie: INF - EXIT STATUS 150: termination requested by administrator
14:52:01.234 [9504.11716] <2> bpbrm kill_child_process_Ex: start
14:53:07.624 [9504.11716] <2> bpbrm wait_for_child: start
14:53:07.624 [9504.11716] <2> bpbrm wait_for_child: child exit_status = 150
14:53:07.624 [9504.11716] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 150
14:53:07.624 [9504.11716] <2> bpbrm Exit: client backup EXIT STATUS 40: network connection broken
14:53:07.624 [9504.11716] <4> JobdSockList::UnregisterSocket: Unregister socket (764).
14:53:07.640 [9504.11716] <2> job_monitoring_exex: ACK disconnect
14:53:07.640 [9504.11716] <2> job_disconnect: Disconnected
14:57:56.310 [10208.7500] <2> non_mpx_backup_archive_verify_import: start to write catalog
14:57:56.310 [10208.7500] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master.server, query type 78
14:57:56.326 [10208.7500] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:57:56.326 [10208.7500] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{0973C69E-6CB1-4C07-B075-FDA7E3B14020}:OUTBOUND
14:57:56.326 [10208.7500] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:57:56.388 [10208.7500] <2> logconnections: BPDBM CONNECT FROM X.X.X.X TO X.X.X.X fd = 796
14:57:56.388 [10208.7500] <2> db_begin: auth only query 78, socket 796 is not proxied
14:57:56.420 [10208.7500] <2> db_end: Need to collect reply
14:57:57.123 [10208.7500] <2> non_mpx_backup_archive_verify_import: end write catalog

 

If someone can help me, i would be grateful.

10 REPLIES 10

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@MatBams 

How much time has elapsed since the start of the backup?

Can you please post all text in Job Details of a failed job?

If you really need to replace hostnames or IP addresses, please replace with some generic hostnames/IPs,
e.g. master 10.10.10.1
media1 (if different from master) 10.10.10.2
filer1 10.10.10.3

We have no idea if the X.X.X.X that is the same in the 'from and to'  (BPDBM CONNECT FROM X.X.X.X TO X.X.X.X) means that it is the same host that is master and media server.

Depending on what we see in Job Details, it might the 8-hour timeout issue on NDMP backups:
https://www.veritas.com/support/en_US/article.10000860

There is also this article about failure after 2 hours:
https://www.veritas.com/support/en_US/article.100004901

I don't see the same error in ndmp log, but still worth a look:
https://www.veritas.com/support/en_US/article.100015404

pats_729
Level 6
Employee
Is this backup going to a TAPE and Local NDMP perhaps?

Looks like small files causing this...

I recommend run this backups to a dedup pool.

You may still struggle with first backup but subsequent backups will go swift.

@Marianne 

approximately 18 hours

I modified the job detail and the log with more explicite name.

14:50:35.674 [10208.7500] <2> non_mpx_backup_archive_verify_import: start to write catalog
14:50:35.674 [10208.7500] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master, query type 78
14:50:35.689 [10208.7500] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:50:35.689 [10208.7500] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{AE789DE6-4E78-4ADD-A9A1-BB490413B536}:OUTBOUND
14:50:35.689 [10208.7500] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:50:35.752 [10208.7500] <2> logconnections: BPDBM CONNECT FROM media.57746 TO master.1556 fd = 1028
14:50:35.768 [10208.7500] <2> db_begin: auth only query 78, socket 1028 is not proxied
14:50:35.799 [10208.7500] <2> db_end: Need to collect reply
14:50:36.659 [10208.7500] <2> non_mpx_backup_archive_verify_import: end write catalog
14:51:57.685 [9504.11716] <2> non_mpx_backup_archive_verify_import: start to write catalog
14:51:57.685 [9504.11716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host master, query type 78
14:51:57.701 [9504.11716] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\{F0DEB798-DF7A-4CDF-9023-3C0F2E6D9DD3}:OUTBOUND
14:51:57.701 [9504.11716] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
14:51:57.763 [9504.11716] <2> logconnections: BPDBM CONNECT FROM media.57769 TO master.1556 fd = 628
14:51:57.763 [9504.11716] <2> db_begin: auth only query 78, socket 628 is not proxied
14:51:57.763 [9504.11716] <2> put_n_bytes_abs: write to socket failed: An established connection was aborted by the software in your host machine. (10053)
14:51:57.763 [9504.11716] <2> ts_put_length_string_optimized: put n bytes failed: An established connection was aborted by the software in your host machine. (10053)
14:51:57.763 [9504.11716] <2> db_senddata: ts_put_string_handle(): connection dropped or not connected, An established connection was aborted by the software in your host machine. , (10053)
14:51:57.763 [9504.11716] <2> db_startrequest: db_sendrequest() failed: network connection broken
14:51:57.763 [9504.11716] <16> db_begin: db_startrequest() failed: network connection broken
14:51:57.763 [9504.11716] <2> db_FLISTsend: db_begin() failed: network connection broken
14:51:57.779 [9504.11716] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: network connection broken (40)

I asked network team and we can have a network issue which impact a lot of things in our network.

Sometimes the same job failed with error 233 premature eof encountered

 

@pats_729 

It's going to a tape yes

there is a lot of small files on the volumes that failed

Now we can't run this backup to a dedup pool, we don't have the space on the MSDP.

Hi,

We always search in this moment. Our issue with network is okay now but the backups failed again.

Maybe it's the connection between my media and my master ? Can i up the timeout in the properties of my media and master ?

Have a nice day.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@MatBams 

I have shared a bunch of Articles 2 weeks ago.

Have you had a look at them?

*** EDIT ***

It seems that the 1st TN was removed in the meantime....
I found another TN that also describes the 8-hour timeout, but with a different status code:
https://www.veritas.com/support/en_US/article.100008602

Please have a look at this one too:

https://www.veritas.com/support/en_US/article.100045218

Sorry, i forgot to anwser you.

I've tried to add MAX_ENTRIES file before but it's KO.

I've add the file NDMP and the option on the server.conf file, i restart my backup.

The issue still exists.

I will try to inspect my brocade switch.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@MatBams 

Maybe best to have a good look at the full set of logs to determine where the break in communication is.

If you want this community to assist, we will need a full set of logs at logging level 3.
On master: bpdbm  (NBU needs to be restarted for logging chance to take effect)
On media server: bpbrm, bptm and ndmpagent.

If you do not feel comfortable to upload logs here, then best to log a call with Veritas Support. They will ask for level 5 logs.

 

Ok, so i modfied the verbose on 3 like you want. I upload the log files on monday after the ndmp backup of the week-end

Hi, 

Maybe i found a solution. I modified the NDMP_PROGRESS_TIMEOUT file in the folder /db/config on the media server to add 2400