Forum Discussion

liuyl's avatar
liuyl
Level 6
3 years ago

The strange socket read failed 13

Enviroment:  NBU 8.1.1 (Master/Media/Client) on RHEL 6.9

Symptom:

We have a large Oracle DB client with approximately 50TB, and every week once full backup and the other six days archive log backups.
And they almost all work fine,  just except for only few archive log backup failed occasionally with the following similar job detail ...

03/04/2022 02:19:36 - Info nbjm(pid=21023) starting backup job (jobid=3375140) for client jcyc01db01bak\
policy jcyc01db-EDCDB-arch\
schedule Default-Application-Backup
03/04/2022 02:19:36 - Info nbjm(pid=21023) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=3375140\
request id:{7BC315D6-9B1E-11EC-9BB9-3BDF5C092EF3})
03/04/2022 02:19:36 - requesting resource jc41nbubk7-stu
03/04/2022 02:19:36 - requesting resource jcbak.NBU_CLIENT.MAXJOBS.jcyc01db01bak
03/04/2022 02:19:37 - granted resource jcbak.NBU_CLIENT.MAXJOBS.jcyc01db01bak
03/04/2022 02:19:37 - granted resource MediaID=@aaabT;DiskVolume=/jc41nbubk7_nbudata1;DiskPool=jc41nbubk7_dp;Path=/jc41nbubk7_nbudata1;StorageServer=jc41nbubk7;MediaServer=jc41nbubk7
03/04/2022 02:19:37 - granted resource jc41nbubk7-stu
03/04/2022 02:19:37 - estimated 0 kbytes needed
03/04/2022 02:19:37 - Info nbjm(pid=21023) started backup (backupid=jcyc01db01bak_1646331577) job for client jcyc01db01bak\
policy jcyc01db-EDCDB-arch\
schedule Default-Application-Backup on storage unit jc41nbubk7-stu
03/04/2022 02:19:37 - started process bpbrm (4140324)
03/04/2022 02:19:38 - connecting
03/04/2022 02:18:29 - Info bpbrm(pid=4140324) jcyc01db01bak is the host to backup data from
03/04/2022 02:18:29 - Info bpbrm(pid=4140324) reading file list for client
03/04/2022 02:18:30 - Info bpbrm(pid=4140324) listening for client connection
03/04/2022 02:18:30 - Info bpbrm(pid=4140324) INF - Client read timeout = 3600
03/04/2022 02:18:31 - Info bpbrm(pid=4140324) accepted connection from client
03/04/2022 02:18:31 - Info dbclient(pid=107296) Backup started
03/04/2022 02:18:31 - Info bpbrm(pid=4140324) bptm pid: 4140327
03/04/2022 02:19:40 - connected; connect time: 000:00:02
03/04/2022 02:18:32 - Info bptm(pid=4140327) start
03/04/2022 02:18:32 - Info bptm(pid=4140327) using 262144 data buffer size
03/04/2022 02:18:32 - Info bptm(pid=4140327) using 30 data buffers
03/04/2022 02:18:33 - Info bptm(pid=4140327) start backup
03/04/2022 02:18:33 - Info bptm(pid=4140327) backup child process is pid 4140354
03/04/2022 02:18:33 - Info dbclient(pid=107296) dbclient(pid=107296) wrote first buffer(size=262144)
03/04/2022 02:19:42 - begin writing
03/04/2022 02:37:04 - Info bptm(pid=4140327) waited for full buffer 5690 times\
delayed 71496 times
03/04/2022 02:37:04 - Info bptm(pid=4140327) EXITING with status 0 <----------
03/04/2022 02:46:14 - Error bpbrm(pid=4140324) socket read failed: errno = 104 - Connection reset by peer
03/04/2022 02:46:15 - Info dbclient(pid=107296) done. status: 13: file read failed
03/04/2022 02:47:23 - end writing; write time: 000:27:41
36382721
0
3375140
32751


So now I wonder why the bpbrm would fail at 02:46:14 after the bptm “successfully” finished at 02:37:04.
Or in other words,  what the matter would illustrate between the “successful“” bptm and the failed bpbrm ...?

10 Replies

  • liuyl

    Backup size is large.
    As per provided job details error is : socket read failed: errno = 104 - Connection reset by peer

    This error can be came due to network connection break for some second. 

    Please confirm, Is firewall connectivity between client & Media server? and How is the network connectivity (LAN/SAN)?
    If firewall in between, please ask firewall admins to monitor network traffic between the media server and client when backup is kicked off.

    Please provide the media server bpbrm & bptm if possible.

     

    • Hamza_H's avatar
      Hamza_H
      Moderator

      hello,

      also check tcp configuration on all servers, especially timeouts.

      I think bpbrm had already closed the session after a tcp timeout.

       

      • liuyl's avatar
        liuyl
        Level 6

        Here what the corresponding tcp timeout parameter should be ..... ?

    • liuyl's avatar
      liuyl
      Level 6

      Neither the firewall blocked nor the LAN network broken,  but just failed occasionally ...,  it mostly works fine!


      1). bpbrm.log:

      02:38:59.384 [4142698.4142698] <2> send_job_file: job ID 3375211, ftype = 3 msg len = 84, msg = LOG 1646332739 4 dbclient 70797 dbclient(pid=70797) wrote first buffer(size=262144)
      02:38:59.386 [4142698.4142698] <2> bpbrm send_ping: PING
      02:38:59.408 [4142698.4142698] <2> Bpbrm_siginfo_print: 0: delay 0 signo SIGUSR1:10 code 0 pid 4142701
      02:38:59.408 [4142698.4142698] <2> bpbrm mm_sig: received ready signal from media manager
      02:45:07.047 [4140489.4140489] <2> bpbrm send_ping: PING
      02:46:14.993 [4140324.4140324] <16> bpbrm readline: socket read failed: errno = 104 - Connection reset by peer
      02:46:14.993 [4140324.4140324] <2> set_job_details: Tfile (3375140): LOG 1646333174 16 bpbrm 4140324 socket read failed: errno = 104 - Connection reset by peer
      02:46:14.993 [4140324.4140324] <2> send_job_file: job ID 3375140, ftype = 3 msg len = 91, msg = LOG 1646333174 16 bpbrm 4140324 socket read failed: errno = 104 - Connection reset
      by peer
      02:46:14.993 [4140324.4140324] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
      02:46:14.993 [4140324.4140324] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
      02:46:14.993 [4140324.4140324] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1

       

      02:46:15.022 [4140324.4140324] <2> is_connection_INBOUND: Connection-Is-INBOUND:[FALSE]
      02:46:15.023 [4140324.4140324] <2> logconnections: PROXY CONNECT FROM 10.131.40.168.34907 TO 10.131.33.60.1556 fd = 7
      02:46:15.023 [4140324.4140324] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.39024 TO 127.0.0.1.45834 fd = 7
      02:46:15.023 [4140324.4140324] <2> vnet_check_vxss_client_magic_with_info_ex: [vnet_vxss_helper.c:1138] Ignoring VxSS authentication 2 0x2
      02:46:15.023 [4140324.4140324] <2> db_getdata: timeout is 7200 seconds
      02:46:15.066 [4140324.4140324] <2> db_end: Need to collect reply
      02:46:15.066 [4140324.4140324] <2> db_getdata: timeout is 7200 seconds
      02:46:15.077 [4140324.4140324] <2> isDONE: reply DONE 0 <?xml version="1.0" encoding="utf-8"?><StatusMsgDoc><StatusMsg NBUStatus="0" Severity="0" VXULoid="271" VXULtid="0"><MsgStr
      ing>the&#32;requested&#32;operation&#32;was&#32;successfully&#32;completed</MsgString></StatusMsg></StatusMsgDoc>
      02:46:15.077 [4140324.4140324] <2> isDONE: reply DONE 0
      02:46:15.078 [4140324.4140324] <2> bpbrm kill_child_process: start
      02:46:15.078 [4140324.4140324] <2> inform_client_of_status: COMM_SOCK == INVALID_SOCKET, 0
      02:46:15.078 [4140324.4140324] <2> set_job_details: Tfile (3375140): LOG 1646333175 4 dbclient 107296 done. status: 13: file read failed
      02:46:15.078 [4140324.4140324] <2> send_job_file: job ID 3375140, ftype = 3 msg len = 68, msg = LOG 1646333175 4 dbclient 107296 done. status: 13: file read failed
      02:46:15.078 [4140324.4140324] <2> bpbrm Exit: client backup EXIT STATUS 13: file read failed
      02:46:15.083 [4140324.4140324] <2> vnet_close_socket_safely_ex: [vnet.c:1176] safe close 0 0x0
      02:46:15.083 [4140324.4140324] <4> JobdSockList::UnregisterSocket: Unregister socket (5).
      02:46:15.088 [4140324.4140324] <2> job_monitoring_exex: ACK disconnect
      02:46:15.088 [4140324.4140324] <2> job_disconnect: Disconnected
      02:47:00.964 [4122683.4122683] <2> bpbrm send_ping: PING

       

      2). bptm.log: (It would be also a bit strange that there was almost no any error at the above time point 02:46:14~15 )

      02:44:23.798 [4142701] <2> JobInst::sendIrmMsg: returning
      02:44:36.208 [4140494] <2> send_MDS_msg: KBYTES_WRITTEN 0 {97BD8AD2-9B1E-11EC-8676-1308BD1804E7} 2743057 1 63001344 3375142
      02:44:36.208 [4140494] <2> JobInst::sendIrmMsg: starting
      02:44:36.219 [4140494] <2> JobInst::sendIrmMsg: returning
      02:45:23.799 [4142701] <2> send_MDS_msg: KBYTES_WRITTEN 0 {5537C4CC-9B21-11EC-BF30-ED672D6DB55D} 2743121 1 42000896 3375211
      02:45:23.800 [4142701] <2> JobInst::sendIrmMsg: starting
      02:45:23.806 [4142701] <2> JobInst::sendIrmMsg: returning
      02:45:48.803 [4140494] <2> send_MDS_msg: KBYTES_WRITTEN 0 {97BD8AD2-9B1E-11EC-8676-1308BD1804E7} 2743057 1 66001408 3375142
      02:45:48.804 [4140494] <2> JobInst::sendIrmMsg: starting
      02:45:48.813 [4140494] <2> JobInst::sendIrmMsg: returning
      02:46:16.129 [4142701] <2> send_MDS_msg: KBYTES_WRITTEN 0 {5537C4CC-9B21-11EC-BF30-ED672D6DB55D} 2743121 1 45000960 3375211
      02:46:16.129 [4142701] <2> JobInst::sendIrmMsg: starting
      02:46:16.134 [4142701] <2> JobInst::sendIrmMsg: returning
      02:46:57.948 [4142701] <2> send_MDS_msg: KBYTES_WRITTEN 0 {5537C4CC-9B21-11EC-BF30-ED672D6DB55D} 2743121 1 48001024 3375211
      02:46:57.948 [4142701] <2> JobInst::sendIrmMsg: starting
      02:46:57.953 [4142701] <2> JobInst::sendIrmMsg: returning
      02:47:06.437 [4140494] <2> send_MDS_msg: KBYTES_WRITTEN 0 {97BD8AD2-9B1E-11EC-8676-1308BD1804E7} 2743057 1 69001472 3375142
      02:47:06.437 [4140494] <2> JobInst::sendIrmMsg: starting
      02:47:06.441 [4140494] <2> JobInst::sendIrmMsg: returning
      02:47:31.865 [4140521] <2> fill_buffer: [4140494] socket is closed, waited for empty buffer 2130 times, delayed 2142 times, read 69732384 Kbytes
      02:47:31.869 [4140494] <2> write_data: writing block shorter than BUFF_SIZE, 32768 bytes
      02:47:31.869 [4140494] <2> write_data: writing short block, 32768 bytes, remainder 0
      02:47:31.869 [4140494] <2> wait_for_sigcld: waiting for child to exit, timeout is 3600
      02:47:31.870 [4140494] <2> child_wait: waitpid returned zero, no children waiting (../common.c:6179)
      02:47:31.892 [4140521] <8> JobdSockList::~JobdSockList: 1 socket left.
      02:47:31.892 [4140521] <4> JobdSockList::UnregisterSocket: Unregister socket (1).
      02:47:31.893 [4140521] <2> put_length_bytes_nonblocking: cannot write data to network: Bad file descriptor (9)
      02:47:31.893 [4140521] <2> job_disconnect: Failed to send request
      02:47:31.893 [4140521] <2> vnet_proxy_get_logs: fstat(1) failed: 9
      02:47:31.904 [4140494] <2> Media_siginfo_print: 0: delay 0 signo SIGCHLD:17 code 1 pid 4140521
      02:47:31.904 [4140494] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=4140521 (../common.c:6179)
      02:47:31.904 [4140494] <2> write_data: waited for full buffer 9067 times, delayed 107216 times
      02:47:31.904 [4140494] <2> set_job_details: Tfile (3375142): LOG 1646333251 4 bptm 4140494 waited for full buffer 9067 times, delayed 107216 times
      02:47:31.904 [4140494] <2> send_job_file: job ID 3375142, ftype = 3 msg len = 86, msg = LOG 1646333251 4 bptm 4140494 waited for full buffer 9067 times, delayed 107216 times
      02:47:31.904 [4140494] <2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = 0
      02:47:31.904 [4140494] <2> write_backup: tp.tv_sec = 1646333251, stp.tv_sec = 1646331560, tp.tv_usec = 869878, stp.tv_usec = 722860, et = 1691147, mpx_total_kbytes[TWIN_INDEX = 0]
      = 69732384
      02:47:31.905 [4140494] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
      02:47:31.905 [4140494] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
      02:47:31.905 [4140494] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
      02:47:31.905 [4140494] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3021] Comparing name1:[jcbak] name2:[localhost]
      02:47:31.908 [4140494] <2> vnet_sortaddrs: [vnet_addrinfo.c:4170] sorted addrs: 1 0x1
      02:47:31.908 [4140494] <2> vnet_get_pref_netconnection: [vnet_addrinfo.c:5173] using interface ANY
      02:47:31.908 [4140494] <2> vnet_sortaddrs: [vnet_addrinfo.c:4170] sorted addrs: 1 0x1
      02:47:31.908 [4140494] <2> vnet_get_pref_netconnection: [vnet_addrinfo.c:5173] using interface ANY
      02:47:31.908 [4140494] <2> async_connect: [vnet_connect.c:1880] connect in progress 1 0x1
      02:47:31.908 [4140494] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null
      02:47:31.909 [4140494] <2> is_connection_INBOUND: non-proxy, localhost connection

       

    • liuyl's avatar
      liuyl
      Level 6

      Neither the firewall blocked nor the LAN network broken, but just failed occasionally ..., it mostly works fine!

      Notes: please see my attachment for the bpbrm & bptm  logfiles 

  • liuyl 

    As per my observation this issues occurred due to not sufficient Buffer size on media server.
    File read failed and disconnected the connection.
    As per bptm log showing buffer size issue.
    Please check what is the current BUFFER size and increase it.

    Please refer the Buffer size reference:
    https://www.veritas.com/support/en_US/article.100016112
    It will best if you check what is media server memory size/utilization accordingly you will increase.

    Nicolai Marianne : Please correct me in case any other troubleshoot and action.

  • Hi liuyl  and Yogesh_Patil 

    I don't believe buffer size is the issue here. However I do recommend using a higer NUMBER_DATA_BUFFERS. A value of 128 or 256 would be my recommendation.

    If there is a firewall between client and media server, please implement TCP keep alive. It is a quick implementation and doesn't require any reboots. Basically TCP keep alive sends idle packages with a predefined interval to prevent firewall from closing them. Cisco firewalls closes idle connection after one hour per default. I am not 100% sure this is the root cause, but it is my best guess.

    https://www.veritas.com/support/en_US/article.100028680

     

    • liuyl's avatar
      liuyl
      Level 6

      Today I found a possible important cue in the following bptm.log at the timestamp with 02:37:06.464:

       

      02:37:06.410 [4140204] <2> db_getdata: timeout is 7200 seconds
      02:37:06.452 [4140204] <2> db_end: Need to collect reply
      02:37:06.452 [4140204] <2> db_getdata: timeout is 7200 seconds
      02:37:06.463 [4140204] <2> isDONE: reply DONE 0 <?xml version="1.0" encoding="utf-8"?><StatusMsgDoc><StatusMsg NBUStatus="0" Severity="0" VXULoid="271" VXULtid="0"><MsgString>the&
      #32;requested&#32;operation&#32;was&#32;successfully&#32;completed</MsgString></StatusMsg></StatusMsgDoc>
      02:37:06.464 [4140204] <2> isDONE: reply DONE 0
      02:37:06.464 [4140204] <16> catch_signal: media manager terminated by parent process <<< Here the bptm process was killed unexpectedly !
      02:37:06.465 [4140204] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 4140231 (../common.c:1622)
      02:37:06.465 [4140204] <2> wait_for_sigcld: waiting for child to exit, timeout is 3600
      02:37:06.465 [4140204] <2> child_wait: waitpid returned zero, no children waiting (../common.c:6179)
      02:37:06.465 [4140231] <2> Media_child_dispatch_signal: calling child_catch_signal for 12 (../child.c:791) delay 0 seconds
      02:37:06.466 [4140231] <2> child_catch_signal: ../child.c.641: Entered: child_catch_signal
      02:37:06.466 [4140231] <2> Media_siginfo_print: 0: delay 0 signo SIGUSR2:12 code 0 pid 4140204
      02:37:06.494 [4140231] <8> JobdSockList::~JobdSockList: 1 socket left.
      02:37:06.494 [4140231] <4> JobdSockList::UnregisterSocket: Unregister socket (1).
      02:37:06.494 [4140231] <2> put_length_bytes_nonblocking: cannot write data to network: Bad file descriptor (9)
      02:37:06.494 [4140231] <2> job_disconnect: Failed to send request
      02:37:06.495 [4140231] <2> vnet_proxy_get_logs: fstat(1) failed: 9
      02:37:06.507 [4140204] <2> Media_siginfo_print: 1: delay 0 signo SIGCHLD:17 code 1 pid 4140231