cancel
Showing results for 
Search instead for 
Did you mean: 

Restore on Unix failling - cannot write data to socket, 10054

philalbert
Level 4

Hi,

I'm trying to restore a backup from a Unix(Ubuntu) to another Ubuntu server. On multiple occasion I get the error cannot write data to socket, 10054, and media manager for backup id xxx exited with status 24: socket write failed

I've been looking in the bpbrm log file on the server and I found this section and I'm not sure if it's normal or not.

14:34:43.770 [25708.26436] <2> bpbrm read_media_msg: read from media manager: MEDIA READY
14:34:43.770 [25708.26436] <2> bpbrm signal_bpbrm_child: sending Media Ready to bpbrm child 20536
14:34:45.314 [20536.24204] <2> bpbrm mm_sig: received ready signal from media manager
14:39:44.010 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 1

14:39:44.134 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:44:45.060 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 2

14:44:45.185 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:49:46.049 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 3

14:49:46.111 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:54:47.037 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 4

14:54:47.099 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:59:48.025 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 5

14:59:48.150 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:04:49.045 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 6

15:04:49.107 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:09:50.049 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 7

15:09:50.174 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:14:51.037 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 8

15:14:51.099 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:19:52.010 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 9

15:19:52.135 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:24:53.061 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 10

15:24:53.185 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:29:54.049 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 11

15:29:54.174 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:34:55.053 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 12

15:34:55.178 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:39:56.057 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 13

15:39:56.182 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:42:04.056 [25708.26436] <2> bpbrm read_media_msg: read from media manager: MEDIA NOT READY
15:42:04.056 [25708.26436] <2> bpbrm signal_bpbrm_child: sending Media Ready to bpbrm child 20536
15:42:04.056 [25708.26436] <2> bpbrm read_media_msg: read from media manager: EXIT licqc06_1447567228 24
15:42:04.056 [25708.26436] <2> bpbrm process_media_msg: media manager for backup id licqc06_1447567228 exited with status 24: socket write failed
15:42:04.056 [25708.26436] <2> bpbrm kill_bpbrm_child: terminating bpbrm child 20536 jobid=4295267
15:42:04.056 [25708.26436] <2> bpbrm signal_bpbrm_child: sending Terminate to bpbrm child 20536
15:42:06.474 [20536.24204] <2> bpbrm mm_sig: received not ready signal from media manager
15:42:06.474 [20536.24204] <2> bpbrm check_for_terminate: unexpected terminate
15:42:06.474 [20536.24204] <2> bpbrm kill_child_process_Ex: start
15:42:06.474 [20536.24204] <2> job_monitoring_exex: ACK disconnect
15:42:06.474 [20536.24204] <2> job_disconnect: Disconnected
15:42:06.474 [25708.26436] <2> bpbrm brm_child_done: child done, status 150
15:42:06.474 [25708.26436] <2> bpbrm brm_child_done: bpbrm child 20536 terminated by bpbrm parent
15:42:06.474 [25708.26436] <2> bpbrm send_status_to_parent: EXIT licqc06_1447567228 24 sent to parent process for jobid = 4295267.
15:42:06.536 [25708.26436] <2> bpbrm read_parent_msg: read from parent TERMINATE
15:42:06.536 [25708.26436] <2> bpbrm tell_mm: sending media manager msg: TERMINATE
15:42:06.536 [25708.26436] <2> job_monitoring_exex: ACK disconnect
15:42:06.536 [25708.26436] <2> job_disconnect: Disconnected

 

All the ACK_KEEP_ALIVE is what I'm wondering if it's normal, looks like it's timing out.

 

Thanks

Philip

29 REPLIES 29

philalbert
Level 4

It is not consistent Nicolai. Yesterday, 3 restores worked, but 1 failed.

 

 

Marianne, alot of restore and backups ran at the same time as my test, so it's not easy to look at the logs. I'm going to run a new restore now(when nothing else is running) the logs should be easier to read. Sorry for the delay.

philalbert
Level 4

Here are the log files.

The tar is pretty empty.

Nicolai
Moderator
Moderator
Partner    VIP   

Something is messing with the network communication :

 <16> write_to_out: cannot write data to socket, 10054

You need to verify if network firewalls,anti-virus or endpoint protections software is closing the connection  on purpose. Alternative restore mail to another location and ask the admin to move them. I know if not a "smoking gun". But this is what I was able to extract form the logs.

According to : https://msdn.microsoft.com/en-us/library/windows/desktop/ms740668(v=vs.85).aspx code 10054 is 

Connection reset by peer.

An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockoptfor more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.

 

philalbert
Level 4

You're on the right track Nicolai.

I checked with the firewall team and there's one between the media server and the destination client.

The firewall is dropping some connection and I'm seeing these messages:

TCP packet out of state: First packet isn't SYN; tcp_glags: PUSH-ACK

TCP packet out of state: First packet isn't SYN; tcp_glags: RST-ACK

TCP packet out of state: First packet isn't SYN; tcp_glags: RST-ACK

 

Anyone seen this behavior before and know what to do to fix it?

 

Thanks!

 

philalbert
Level 4

Forgot the mentionned that we see that message at the same time that my restores fails.

Marianne
Level 6
Partner    VIP    Accredited Certified
You need to speak to firewall admins. Ask them to increase firewall timeouts. Another option would be to reduce KeepAlive settings on media server and client.

philalbert
Level 4

Yes I will talk to them.

But if I reduce the KeepAlive setting, the timeout will happen earlier no?

Marianne
Level 6
Partner    VIP    Accredited Certified
KeepAlive will be sent at shorter intervals to prevent timeout. Timeout happens when no activity is seen on connection. We often see this with large files being written. Acknowledgement is only sent upon completion of write operation.

philalbert
Level 4

Oh ok, I understand.

Thanks!

philalbert
Level 4

Hi,

 

I modified the value of the KeepAliveTime key under TCPIP in the registry on all of my media server. The restore worked without any issue.

The key is:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\Parameters

DWord: KeepAliveTime  Value 900000 (decimal)(15min)

Thank you very much Marianne!