The strange socket read failed 13

liuyl · ‎03-07-2022

Enviroment: NBU 8.1.1 (Master/Media/Client) on RHEL 6.9

Symptom:

We have a large Oracle DB client with approximately 50TB, and every week once full backup and the other six days archive log backups.
And they almost all work fine, just except for only few archive log backup failed occasionally with the following similar job detail ...

03/04/2022 02:19:36 - Info nbjm(pid=21023) starting backup job (jobid=3375140) for client jcyc01db01bak\
policy jcyc01db-EDCDB-arch\
schedule Default-Application-Backup
03/04/2022 02:19:36 - Info nbjm(pid=21023) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=3375140\
request id:{7BC315D6-9B1E-11EC-9BB9-3BDF5C092EF3})
03/04/2022 02:19:36 - requesting resource jc41nbubk7-stu
03/04/2022 02:19:36 - requesting resource jcbak.NBU_CLIENT.MAXJOBS.jcyc01db01bak
03/04/2022 02:19:37 - granted resource jcbak.NBU_CLIENT.MAXJOBS.jcyc01db01bak
03/04/2022 02:19:37 - granted resource MediaID=@aaabT;DiskVolume=/jc41nbubk7_nbudata1;DiskPool=jc41nbubk7_dp;Path=/jc41nbubk7_nbudata1;StorageServer=jc41nbubk7;MediaServer=jc41nbubk7
03/04/2022 02:19:37 - granted resource jc41nbubk7-stu
03/04/2022 02:19:37 - estimated 0 kbytes needed
03/04/2022 02:19:37 - Info nbjm(pid=21023) started backup (backupid=jcyc01db01bak_1646331577) job for client jcyc01db01bak\
policy jcyc01db-EDCDB-arch\
schedule Default-Application-Backup on storage unit jc41nbubk7-stu
03/04/2022 02:19:37 - started process bpbrm (4140324)
03/04/2022 02:19:38 - connecting
03/04/2022 02:18:29 - Info bpbrm(pid=4140324) jcyc01db01bak is the host to backup data from
03/04/2022 02:18:29 - Info bpbrm(pid=4140324) reading file list for client
03/04/2022 02:18:30 - Info bpbrm(pid=4140324) listening for client connection
03/04/2022 02:18:30 - Info bpbrm(pid=4140324) INF - Client read timeout = 3600
03/04/2022 02:18:31 - Info bpbrm(pid=4140324) accepted connection from client
03/04/2022 02:18:31 - Info dbclient(pid=107296) Backup started
03/04/2022 02:18:31 - Info bpbrm(pid=4140324) bptm pid: 4140327
03/04/2022 02:19:40 - connected; connect time: 000:00:02
03/04/2022 02:18:32 - Info bptm(pid=4140327) start
03/04/2022 02:18:32 - Info bptm(pid=4140327) using 262144 data buffer size
03/04/2022 02:18:32 - Info bptm(pid=4140327) using 30 data buffers
03/04/2022 02:18:33 - Info bptm(pid=4140327) start backup
03/04/2022 02:18:33 - Info bptm(pid=4140327) backup child process is pid 4140354
03/04/2022 02:18:33 - Info dbclient(pid=107296) dbclient(pid=107296) wrote first buffer(size=262144)
03/04/2022 02:19:42 - begin writing
03/04/2022 02:37:04 - Info bptm(pid=4140327) waited for full buffer 5690 times\
delayed 71496 times
03/04/2022 02:37:04 - Info bptm(pid=4140327) EXITING with status 0 <----------
03/04/2022 02:46:14 - Error bpbrm(pid=4140324) socket read failed: errno = 104 - Connection reset by peer
03/04/2022 02:46:15 - Info dbclient(pid=107296) done. status: 13: file read failed
03/04/2022 02:47:23 - end writing; write time: 000:27:41
36382721
0
3375140
32751

So now I wonder why the bpbrm would fail at 02:46:14 after the bptm “successfully” finished at 02:37:04.
Or in other words, what the matter would illustrate between the “successful“” bptm and the failed bpbrm ...？

Nicolai · ‎03-07-2022

@mph999 Can you help Martin ?

Yogesh_Patil · ‎03-08-2022

@liuyl

Backup size is large.
As per provided job details error is : socket read failed: errno = 104 - Connection reset by peer

This error can be came due to network connection break for some second.

Please confirm, Is firewall connectivity between client & Media server? and How is the network connectivity (LAN/SAN)?
If firewall in between, please ask firewall admins to monitor network traffic between the media server and client when backup is kicked off.

Please provide the media server bpbrm & bptm if possible.

Hamza_H · ‎03-08-2022

hello,

also check tcp configuration on all servers, especially timeouts.

I think bpbrm had already closed the session after a tcp timeout.

liuyl · ‎03-08-2022

Neither the firewall blocked nor the LAN network broken, but just failed occasionally ..., it mostly works fine！

1). bpbrm.log:

02:38:59.384 [4142698.4142698] <2> send_job_file: job ID 3375211, ftype = 3 msg len = 84, msg = LOG 1646332739 4 dbclient 70797 dbclient(pid=70797) wrote first buffer(size=262144)
02:38:59.386 [4142698.4142698] <2> bpbrm send_ping: PING
02:38:59.408 [4142698.4142698] <2> Bpbrm_siginfo_print: 0: delay 0 signo SIGUSR1:10 code 0 pid 4142701
02:38:59.408 [4142698.4142698] <2> bpbrm mm_sig: received ready signal from media manager
02:45:07.047 [4140489.4140489] <2> bpbrm send_ping: PING
02:46:14.993 [4140324.4140324] <16> bpbrm readline: socket read failed: errno = 104 - Connection reset by peer
02:46:14.993 [4140324.4140324] <2> set_job_details: Tfile (3375140): LOG 1646333174 16 bpbrm 4140324 socket read failed: errno = 104 - Connection reset by peer
02:46:14.993 [4140324.4140324] <2> send_job_file: job ID 3375140, ftype = 3 msg len = 91, msg = LOG 1646333174 16 bpbrm 4140324 socket read failed: errno = 104 - Connection reset
by peer
02:46:14.993 [4140324.4140324] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
02:46:14.993 [4140324.4140324] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
02:46:14.993 [4140324.4140324] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1

02:46:15.022 [4140324.4140324] <2> is_connection_INBOUND: Connection-Is-INBOUND:[FALSE]
02:46:15.023 [4140324.4140324] <2> logconnections: PROXY CONNECT FROM 10.131.40.168.34907 TO 10.131.33.60.1556 fd = 7
02:46:15.023 [4140324.4140324] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.39024 TO 127.0.0.1.45834 fd = 7
02:46:15.023 [4140324.4140324] <2> vnet_check_vxss_client_magic_with_info_ex: [vnet_vxss_helper.c:1138] Ignoring VxSS authentication 2 0x2
02:46:15.023 [4140324.4140324] <2> db_getdata: timeout is 7200 seconds
02:46:15.066 [4140324.4140324] <2> db_end: Need to collect reply
02:46:15.066 [4140324.4140324] <2> db_getdata: timeout is 7200 seconds
02:46:15.077 [4140324.4140324] <2> isDONE: reply DONE 0 <?xml version="1.0" encoding="utf-8"?><StatusMsgDoc><StatusMsg NBUStatus="0" Severity="0" VXULoid="271" VXULtid="0"><MsgStr
ing>the requested operation was successfully completed</MsgString></StatusMsg></StatusMsgDoc>
02:46:15.077 [4140324.4140324] <2> isDONE: reply DONE 0
02:46:15.078 [4140324.4140324] <2> bpbrm kill_child_process: start
02:46:15.078 [4140324.4140324] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 0
02:46:15.078 [4140324.4140324] <2> set_job_details: Tfile (3375140): LOG 1646333175 4 dbclient 107296 done. status: 13: file read failed
02:46:15.078 [4140324.4140324] <2> send_job_file: job ID 3375140, ftype = 3 msg len = 68, msg = LOG 1646333175 4 dbclient 107296 done. status: 13: file read failed
02:46:15.078 [4140324.4140324] <2> bpbrm Exit: client backup EXIT STATUS 13: file read failed
02:46:15.083 [4140324.4140324] <2> vnet_close_socket_safely_ex: [vnet.c:1176] safe close 0 0x0
02:46:15.083 [4140324.4140324] <4> JobdSockList::UnregisterSocket: Unregister socket (5).
02:46:15.088 [4140324.4140324] <2> job_monitoring_exex: ACK disconnect
02:46:15.088 [4140324.4140324] <2> job_disconnect: Disconnected
02:47:00.964 [4122683.4122683] <2> bpbrm send_ping: PING

2). bptm.log: (It would be also a bit strange that there was almost no any error at the above time point 02:46:14~15 )

02:44:23.798 [4142701] <2> JobInst::sendIrmMsg: returning
02:44:36.208 [4140494] <2> send_MDS_msg: KBYTES_WRITTEN 0 {97BD8AD2-9B1E-11EC-8676-1308BD1804E7} 2743057 1 63001344 3375142
02:44:36.208 [4140494] <2> JobInst::sendIrmMsg: starting
02:44:36.219 [4140494] <2> JobInst::sendIrmMsg: returning
02:45:23.799 [4142701] <2> send_MDS_msg: KBYTES_WRITTEN 0 {5537C4CC-9B21-11EC-BF30-ED672D6DB55D} 2743121 1 42000896 3375211
02:45:23.800 [4142701] <2> JobInst::sendIrmMsg: starting
02:45:23.806 [4142701] <2> JobInst::sendIrmMsg: returning
02:45:48.803 [4140494] <2> send_MDS_msg: KBYTES_WRITTEN 0 {97BD8AD2-9B1E-11EC-8676-1308BD1804E7} 2743057 1 66001408 3375142
02:45:48.804 [4140494] <2> JobInst::sendIrmMsg: starting
02:45:48.813 [4140494] <2> JobInst::sendIrmMsg: returning
02:46:16.129 [4142701] <2> send_MDS_msg: KBYTES_WRITTEN 0 {5537C4CC-9B21-11EC-BF30-ED672D6DB55D} 2743121 1 45000960 3375211
02:46:16.129 [4142701] <2> JobInst::sendIrmMsg: starting
02:46:16.134 [4142701] <2> JobInst::sendIrmMsg: returning
02:46:57.948 [4142701] <2> send_MDS_msg: KBYTES_WRITTEN 0 {5537C4CC-9B21-11EC-BF30-ED672D6DB55D} 2743121 1 48001024 3375211
02:46:57.948 [4142701] <2> JobInst::sendIrmMsg: starting
02:46:57.953 [4142701] <2> JobInst::sendIrmMsg: returning
02:47:06.437 [4140494] <2> send_MDS_msg: KBYTES_WRITTEN 0 {97BD8AD2-9B1E-11EC-8676-1308BD1804E7} 2743057 1 69001472 3375142
02:47:06.437 [4140494] <2> JobInst::sendIrmMsg: starting
02:47:06.441 [4140494] <2> JobInst::sendIrmMsg: returning
02:47:31.865 [4140521] <2> fill_buffer: [4140494] socket is closed, waited for empty buffer 2130 times, delayed 2142 times, read 69732384 Kbytes
02:47:31.869 [4140494] <2> write_data: writing block shorter than BUFF_SIZE, 32768 bytes
02:47:31.869 [4140494] <2> write_data: writing short block, 32768 bytes, remainder 0
02:47:31.869 [4140494] <2> wait_for_sigcld: waiting for child to exit, timeout is 3600
02:47:31.870 [4140494] <2> child_wait: waitpid returned zero, no children waiting (../common.c:6179)
02:47:31.892 [4140521] <8> JobdSockList::~JobdSockList: 1 socket left.
02:47:31.892 [4140521] <4> JobdSockList::UnregisterSocket: Unregister socket (1).
02:47:31.893 [4140521] <2> put_length_bytes_nonblocking: cannot write data to network: Bad file descriptor (9)
02:47:31.893 [4140521] <2> job_disconnect: Failed to send request
02:47:31.893 [4140521] <2> vnet_proxy_get_logs: fstat(1) failed: 9
02:47:31.904 [4140494] <2> Media_siginfo_print: 0: delay 0 signo SIGCHLD:17 code 1 pid 4140521
02:47:31.904 [4140494] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=4140521 (../common.c:6179)
02:47:31.904 [4140494] <2> write_data: waited for full buffer 9067 times, delayed 107216 times
02:47:31.904 [4140494] <2> set_job_details: Tfile (3375142): LOG 1646333251 4 bptm 4140494 waited for full buffer 9067 times, delayed 107216 times
02:47:31.904 [4140494] <2> send_job_file: job ID 3375142, ftype = 3 msg len = 86, msg = LOG 1646333251 4 bptm 4140494 waited for full buffer 9067 times, delayed 107216 times
02:47:31.904 [4140494] <2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = 0
02:47:31.904 [4140494] <2> write_backup: tp.tv_sec = 1646333251, stp.tv_sec = 1646331560, tp.tv_usec = 869878, stp.tv_usec = 722860, et = 1691147, mpx_total_kbytes[TWIN_INDEX = 0]
= 69732384
02:47:31.905 [4140494] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
02:47:31.905 [4140494] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
02:47:31.905 [4140494] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
02:47:31.905 [4140494] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3021] Comparing name1:[jcbak] name2:[localhost]
02:47:31.908 [4140494] <2> vnet_sortaddrs: [vnet_addrinfo.c:4170] sorted addrs: 1 0x1
02:47:31.908 [4140494] <2> vnet_get_pref_netconnection: [vnet_addrinfo.c:5173] using interface ANY
02:47:31.908 [4140494] <2> vnet_sortaddrs: [vnet_addrinfo.c:4170] sorted addrs: 1 0x1
02:47:31.908 [4140494] <2> vnet_get_pref_netconnection: [vnet_addrinfo.c:5173] using interface ANY
02:47:31.908 [4140494] <2> async_connect: [vnet_connect.c:1880] connect in progress 1 0x1
02:47:31.908 [4140494] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null
02:47:31.909 [4140494] <2> is_connection_INBOUND: non-proxy, localhost connection

liuyl · ‎03-08-2022

Neither the firewall blocked nor the LAN network broken, but just failed occasionally ..., it mostly works fine！

Notes: please see my attachment for the bpbrm & bptm logfiles

liuyl · ‎03-08-2022

Here what the corresponding tcp timeout parameter should be ..... ？

Yogesh_Patil · ‎03-08-2022

@liuyl

As per my observation this issues occurred due to not sufficient Buffer size on media server.
File read failed and disconnected the connection.
As per bptm log showing buffer size issue.
Please check what is the current BUFFER size and increase it.

Please refer the Buffer size reference:
https://www.veritas.com/support/en_US/article.100016112
It will best if you check what is media server memory size/utilization accordingly you will increase.

@Nicolai @Marianne : Please correct me in case any other troubleshoot and action.

Nicolai · ‎03-08-2022

Hi @liuyl and @Yogesh_Patil

I don't believe buffer size is the issue here. However I do recommend using a higer NUMBER_DATA_BUFFERS. A value of 128 or 256 would be my recommendation.

If there is a firewall between client and media server, please implement TCP keep alive. It is a quick implementation and doesn't require any reboots. Basically TCP keep alive sends idle packages with a predefined interval to prevent firewall from closing them. Cisco firewalls closes idle connection after one hour per default. I am not 100% sure this is the root cause, but it is my best guess.

https://www.veritas.com/support/en_US/article.100028680

liuyl · ‎03-09-2022

Today I found a possible important cue in the following bptm.log at the timestamp with 02:37:06.464:

02:37:06.410 [4140204] <2> db_getdata: timeout is 7200 seconds
02:37:06.452 [4140204] <2> db_end: Need to collect reply
02:37:06.452 [4140204] <2> db_getdata: timeout is 7200 seconds
02:37:06.463 [4140204] <2> isDONE: reply DONE 0 <?xml version="1.0" encoding="utf-8"?><StatusMsgDoc><StatusMsg NBUStatus="0" Severity="0" VXULoid="271" VXULtid="0"><MsgString>the&
#32;requested&#32;operation&#32;was&#32;successfully&#32;completed</MsgString></StatusMsg></StatusMsgDoc>
02:37:06.464 [4140204] <2> isDONE: reply DONE 0
02:37:06.464 [4140204] <16> catch_signal: media manager terminated by parent process <<< Here the bptm process was killed unexpectedly ！
02:37:06.465 [4140204] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 4140231 (../common.c:1622)
02:37:06.465 [4140204] <2> wait_for_sigcld: waiting for child to exit, timeout is 3600
02:37:06.465 [4140204] <2> child_wait: waitpid returned zero, no children waiting (../common.c:6179)
02:37:06.465 [4140231] <2> Media_child_dispatch_signal: calling child_catch_signal for 12 (../child.c:791) delay 0 seconds
02:37:06.466 [4140231] <2> child_catch_signal: ../child.c.641: Entered: child_catch_signal
02:37:06.466 [4140231] <2> Media_siginfo_print: 0: delay 0 signo SIGUSR2:12 code 0 pid 4140204
02:37:06.494 [4140231] <8> JobdSockList::~JobdSockList: 1 socket left.
02:37:06.494 [4140231] <4> JobdSockList::UnregisterSocket: Unregister socket (1).
02:37:06.494 [4140231] <2> put_length_bytes_nonblocking: cannot write data to network: Bad file descriptor (9)
02:37:06.494 [4140231] <2> job_disconnect: Failed to send request
02:37:06.495 [4140231] <2> vnet_proxy_get_logs: fstat(1) failed: 9
02:37:06.507 [4140204] <2> Media_siginfo_print: 1: delay 0 signo SIGCHLD:17 code 1 pid 4140231

liuyl · ‎03-09-2022

Today I found a possible important cue in the following attachment with bptm_killed.log at the timestamp 02:37:06.464:

VOX

The strange socket read failed 13