cancel
Showing results for 
Search instead for 
Did you mean: 

Restore on Unix failling - cannot write data to socket, 10054

philalbert
Level 4

Hi,

I'm trying to restore a backup from a Unix(Ubuntu) to another Ubuntu server. On multiple occasion I get the error cannot write data to socket, 10054, and media manager for backup id xxx exited with status 24: socket write failed

I've been looking in the bpbrm log file on the server and I found this section and I'm not sure if it's normal or not.

14:34:43.770 [25708.26436] <2> bpbrm read_media_msg: read from media manager: MEDIA READY
14:34:43.770 [25708.26436] <2> bpbrm signal_bpbrm_child: sending Media Ready to bpbrm child 20536
14:34:45.314 [20536.24204] <2> bpbrm mm_sig: received ready signal from media manager
14:39:44.010 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 1

14:39:44.134 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:44:45.060 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 2

14:44:45.185 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:49:46.049 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 3

14:49:46.111 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:54:47.037 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 4

14:54:47.099 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
14:59:48.025 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 5

14:59:48.150 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:04:49.045 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 6

15:04:49.107 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:09:50.049 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 7

15:09:50.174 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:14:51.037 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 8

15:14:51.099 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:19:52.010 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 9

15:19:52.135 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:24:53.061 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 10

15:24:53.185 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:29:54.049 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 11

15:29:54.174 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:34:55.053 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 12

15:34:55.178 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:39:56.057 [25708.26436] <2> bpbrm send_parent_msg: KEEP_ALIVE 13

15:39:56.182 [25708.26436] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE
15:42:04.056 [25708.26436] <2> bpbrm read_media_msg: read from media manager: MEDIA NOT READY
15:42:04.056 [25708.26436] <2> bpbrm signal_bpbrm_child: sending Media Ready to bpbrm child 20536
15:42:04.056 [25708.26436] <2> bpbrm read_media_msg: read from media manager: EXIT licqc06_1447567228 24
15:42:04.056 [25708.26436] <2> bpbrm process_media_msg: media manager for backup id licqc06_1447567228 exited with status 24: socket write failed
15:42:04.056 [25708.26436] <2> bpbrm kill_bpbrm_child: terminating bpbrm child 20536 jobid=4295267
15:42:04.056 [25708.26436] <2> bpbrm signal_bpbrm_child: sending Terminate to bpbrm child 20536
15:42:06.474 [20536.24204] <2> bpbrm mm_sig: received not ready signal from media manager
15:42:06.474 [20536.24204] <2> bpbrm check_for_terminate: unexpected terminate
15:42:06.474 [20536.24204] <2> bpbrm kill_child_process_Ex: start
15:42:06.474 [20536.24204] <2> job_monitoring_exex: ACK disconnect
15:42:06.474 [20536.24204] <2> job_disconnect: Disconnected
15:42:06.474 [25708.26436] <2> bpbrm brm_child_done: child done, status 150
15:42:06.474 [25708.26436] <2> bpbrm brm_child_done: bpbrm child 20536 terminated by bpbrm parent
15:42:06.474 [25708.26436] <2> bpbrm send_status_to_parent: EXIT licqc06_1447567228 24 sent to parent process for jobid = 4295267.
15:42:06.536 [25708.26436] <2> bpbrm read_parent_msg: read from parent TERMINATE
15:42:06.536 [25708.26436] <2> bpbrm tell_mm: sending media manager msg: TERMINATE
15:42:06.536 [25708.26436] <2> job_monitoring_exex: ACK disconnect
15:42:06.536 [25708.26436] <2> job_disconnect: Disconnected

 

All the ACK_KEEP_ALIVE is what I'm wondering if it's normal, looks like it's timing out.

 

Thanks

Philip

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Level 6
Partner    VIP    Accredited Certified
KeepAlive will be sent at shorter intervals to prevent timeout. Timeout happens when no activity is seen on connection. We often see this with large files being written. Acknowledgement is only sent upon completion of write operation.

View solution in original post

29 REPLIES 29

Nicolai
Moderator
Moderator
Partner    VIP   

Any firewalls between master/media and client ?

bpbkar on the media server and tar on the client are also relevant to have a look in.

 

Marianne
Level 6
Partner    VIP    Accredited Certified

Do you have all relevant logging enabled?

On media server: bpbrm and bptm

On destination client: tar  (If file-level restore)

Level 3 should be sufficient for now.

Please tell us more about the restore - is anything actually written to the client before it fails?
Are you restoring large files (that could maybe cause the timeout while bpbrm and bptm are waiting for acknowledgement from client)?

What is Client Read Timeout set to on the media server?

Please copy full logs to .txt files (e.g. bptm.txt) and upload as File attachments.

Michael_G_Ander
Level 6
Certified

Being a linux variation I am thinking it might could be the local OS firewall, if there is a firewall it worth to check if there is dropped packet/connections

I would start with going through the connectity troubleshooting steps, to be sure all required name resolution and ports was open between the client and the Netbackup Server(s).

Have found that the bpcd log often contains indications of the problem(s) with issues like this.

 

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

philalbert
Level 4

Since I don't have access to the linux client, I've asked someone to create the directory tar and if it's already created to send me the logs. I've also asked him if anything was written on the destination server.

The logs bpbkar, bpbrm and bptm on the media are already created, I've attached them to the message. The restore began around 14h30. Source client is licqc06, destination ligqc05.

The client read timeout is set to 1800 sec.

Marianne
Level 6
Partner    VIP    Accredited Certified

We really need tar log on the client - we need to follow entire process flow.

bpbkar on a media server is not involved in client restore.

Please remember to increase logging level to 3.
Client as well.

Seems logging level is still at 0:

(VERBOSE = 0)

Will_Restore
Level 6

OP log looks a lot like this one: https://www.veritas.com/community/forums/restore-fails-when-using-change-dest-works-ok-if-no-change-dest

 

Solution
 

Check permissions at the destination.
Maybe you aren't allowed to write there

philalbert
Level 4

The linux guy found a tar log, here's the content.

 

10:22:52 (4312399.001) INF - TAR STARTED 27445
10:22:52 (4312399.001) **LOCALE ERROR** locale <en_CA.UTF-8> not found in file </usr/openv/msg/.conf>
10:22:52 (4312399.001) Setting network receive buffer size to 32032 bytes
10:23:29 (4312399.001) Write interrupted by SIGPIPE.
10:23:29 (4312399.001) INF - TAR EXITING WITH STATUS = 40
10:23:29 (4312399.001) INF - TAR RESTORED 0 OF 0 FILES SUCCESSFULLY
10:23:29 (4312399.001) INF - TAR KEPT 0 EXISTING FILES
10:23:29 (4312399.001) INF - TAR PARTIALLY RESTORED 0 FILES

10:30:18 (4312402.001) INF - TAR STARTED 31253
10:30:18 (4312402.001) **LOCALE ERROR** locale <en_CA.UTF-8> not found in file </usr/openv/msg/.conf>
10:30:18 (4312402.001) Setting network receive buffer size to 32032 bytes
10:30:43 (4312402.001) Write interrupted by SIGPIPE.
10:30:43 (4312402.001) INF - TAR EXITING WITH STATUS = 40
10:30:43 (4312402.001) INF - TAR RESTORED 0 OF 0 FILES SUCCESSFULLY
10:30:43 (4312402.001) INF - TAR KEPT 0 EXISTING FILES
10:30:43 (4312402.001) INF - TAR PARTIALLY RESTORED 0 FILES

 

How do I change the verbose level on linux client, is it the same command?  nbsetconfig TAR_VERBOSE = 3   ???

Thanks

philalbert
Level 4

Ok found how to change the logging level on linux.

philalbert
Level 4

I can confirm that there's no firewall install on that linux box.

Also, nothing was written on the destination server.

 

Marianne
Level 6
Partner    VIP    Accredited Certified
We need logs for the same restore attempt. Hope you will be able to collect a full set and upload it.

philalbert
Level 4

Yes I know Marianne. I was waiting to receive the tapes.

I'll be running the restore tonight and I'll be able to upload all the logs. (Might be only tomorrow though)

 

Thanks

revarooo
Level 6
Employee

I bet there is a firewall software installed on the client, it may not have any rules but it will be there.

Get a root admin to run:

iptables -L 

Then post the output here.

philalbert
Level 4

The wierd thing is that some restore works and some doesn't. I'll ask an admin to run the command.

revarooo
Level 6
Employee

Phil, are you restoring them all to the same place - it could be a permissions issue on the directories/filesystems you are restoring to.

philalbert
Level 4

Yes they're all going into the same folder. (parent folder)

 

/srv/mail/mysite.com/Restaure/xxx

The xxx part is the only part changing.

revarooo
Level 6
Employee

VERBOSE 5 logs required as mentioned above - all of them needed.

philalbert
Level 4

Hi revarooo, here's the result of the iptables -L command.

~$ iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination        

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination        

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination        

Chain SUBNET_CHECK (0 references)
target     prot opt source               destination         

Marianne
Level 6
Partner    VIP    Accredited Certified
Have you been able to collect logs?

Nicolai
Moderator
Moderator
Partner    VIP   

Is this error consistent - what if you restore a file from /tmp, do you still get same error ?