04-22-2015 03:06 AM
Hi,
In our environment we have our backups getting failed with EC24
Windows Client Server - 2008 R2 -- NBU Version 7.6
Netbackup Master Server - Solaris - 7.6
Netbackup Media Server - Solaris 7.6
below is detailed status -----
04/22/2015 03:15:07 - Info nbjm (pid=16618) starting backup job (jobid=3161892) for client XXXXXXXXXX, policy XXXXXXXXXX_test, schedule Full
04/22/2015 03:15:07 - Info nbjm (pid=16618) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=3161892, request id:{4DE7C35A-E8BF-11E4-9428-00C0DD1C0E45})
04/22/2015 03:15:07 - requesting resource XXXXXXXXXX_dd_nfs001_dsu_st
04/22/2015 03:15:07 - requesting resource asprd537-ebr.XXXXXXXXXX.NBU_CLIENT.MAXJOBS.XXXXXXXXXX
04/22/2015 03:15:07 - requesting resource asprd537-ebr.XXXXXXXXXX.NBU_POLICY.MAXJOBS.XXXXXXXXXX_test
04/22/2015 03:15:07 - granted resource asprd537-ebr.XXXXXXXXXX.NBU_CLIENT.MAXJOBS.XXXXXXXXXX
04/22/2015 03:15:07 - granted resource asprd537-ebr.XXXXXXXXXX.NBU_POLICY.MAXJOBS.XXXXXXXXXX_test
04/22/2015 03:15:07 - granted resource MediaID=@aaaa6;Path=/opt/app/ebr/XXXXXXXXXX/nfs_stu001/nbu_dsu_st;MediaServer=XXXXXXXXXX.XXXXXXXXXX
04/22/2015 03:15:07 - granted resource XXXXXXXXXX_dd_nfs001_dsu_st
04/22/2015 03:15:08 - estimated 0 kbytes needed
04/22/2015 03:15:08 - Info nbjm (pid=16618) started backup (backupid=XXXXXXXXXX_1429686907) job for client XXXXXXXXXX, policy XXXXXXXXXX_test, schedule Full on storage unit XXXXXXXXXX_dd_nfs001_dsu_st
04/22/2015 03:15:12 - started process bpbrm (pid=9051)
04/22/2015 03:15:13 - Info bpbrm (pid=9051) XXXXXXXXXX is the host to backup data from
04/22/2015 03:15:14 - Info bpbrm (pid=9051) reading file list for client
04/22/2015 03:15:14 - connecting
04/22/2015 03:15:16 - Info bpbrm (pid=9051) starting bpbkar on client
04/22/2015 03:15:16 - connected; connect time: 0:00:00
04/22/2015 03:15:17 - Info bpbkar (pid=4896) Backup started
04/22/2015 03:15:17 - Info bpbrm (pid=9051) bptm pid: 9057
04/22/2015 03:15:17 - Info bpbkar (pid=4896) change time comparison:<disabled>
04/22/2015 03:15:17 - Info bpbkar (pid=4896) archive bit processing:<enabled>
04/22/2015 03:15:18 - Info bpbkar (pid=4896) not using change journal data for <C:\>: not enabled
04/22/2015 03:15:18 - Info bpbkar (pid=4896) not using change journal data for <D:\>: not enabled
04/22/2015 03:15:18 - Info bpbkar (pid=4896) not using change journal data for <E:\>: not enabled
04/22/2015 03:15:18 - Info bpbkar (pid=4896) not using change journal data for <F:\>: not enabled
04/22/2015 03:15:19 - Info bptm (pid=9057) start
04/22/2015 03:15:20 - Info bptm (pid=9057) using 262144 data buffer size
04/22/2015 03:15:20 - Info bptm (pid=9057) using 32 data buffers
04/22/2015 03:15:31 - Info bptm (pid=9057) start backup
04/22/2015 03:15:33 - Info bptm (pid=9057) backup child process is pid 9093
04/22/2015 03:15:33 - begin writing
04/22/2015 03:32:09 - Critical bpbrm (pid=9051) from client XXXXXXXXXX: FTL - socket write failed
04/22/2015 03:32:09 - Error bptm (pid=9093) system call failed - Connection reset by peer (at child.c.1306)
04/22/2015 03:32:10 - Error bptm (pid=9093) unable to perform read from client socket, connection may have been broken
04/22/2015 03:32:11 - Error bptm (pid=9057) media manager terminated by parent process
04/22/2015 03:32:49 - Error bpbrm (pid=9051) could not send server status message
04/22/2015 03:32:51 - Info bpbkar (pid=4896) done. status: 24: socket write failed
04/22/2015 03:32:51 - end writing; write time: 0:17:18
socket write failed (24)
-- Chked the followint things -
Solved! Go to Solution.
05-06-2015 08:54 AM
Its not a netbackup issue, Windows SA has changed some setting from OS end and issue has been resolved..
Thanks all for your valuable suggestions..
04-22-2015 09:56 AM
Experts : Could you please help to fix this EC 24 issue ?
04-22-2015 11:36 AM
There is a significant time lag shown in the details between the "begin writing" and the "socket write failed" message. Over 16 minutes in fact. What is the client connection timeout value? What is the policy type and options being used for the backup?
For diagnostic work I would enable/verify debug logging of bpbkar and bpcd on the client and see what it shows for any level of progress. I would do same for bpbrm and bptm on the Media Server.
The message "Connection reset by peer" typically means that the client process that bptm was talking to (bpbkar) died unexpectedly. It could also mean that the network connection itself went down. The bptm process has no idea why, only that it can no longer "talk" to the client process. That is analogous to talking to somebody through a cell phone and having the call drop on you. Was the cause because the other person hung up (application crash) or the link to the cell tower failed (network failure).
So, strictly on a first guess best shot initial view, the problem appears to be happening on the client.
04-22-2015 01:34 PM
Other option is to grab a tcp dump on both media and client and run it through Wireshark.
Look at the logs as suggested by Jaime first though.
04-22-2015 02:16 PM
netsh int tcp show global
run this on client and media server
If tcp autotuning enabled, disable
netsh int tcp set global autotuning=disabled
STATUS CODE 24: Socket write failedArticle: TECH150369 Updated: July 22, 2014 Article URL: http://www.symantec.com/docs/TECH150369
04-22-2015 06:50 PM
Already tried netsh int tcp set global autotuning=disabled cmd and also checked bptm and bpbrm logs on media server
and bpbkar and bpcd logs on client server, but no luck, still EC 24 issue persist.
04-22-2015 09:58 PM
gettting below error in bpbkar logs file :
21:04:05.127 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:06.141 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:07.155 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:08.169 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:09.183 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:10.197 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:11.211 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
Any suggestions ?
04-22-2015 10:28 PM
Bpbkar logs : Getting below error :
21:04:05.127 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:06.141 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:07.155 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:08.169 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:09.183 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:10.197 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
21:04:11.211 [13332.13884] <16> dtcp_read: TCP - failure: recv socket (612) (TCP 10058: Can't send after socket shutdown)
Please advice !!
04-23-2015 12:09 AM
OK, when did this problem start (seems only to be one client ?)
Has anyone added any OS patches of something similar to this server.
NBU doesn't cause status 24s !!!
Suggest you get the logs (though I think they may only tell us the error we already know) , and then look at the TCP dumps in Wireshark.
04-23-2015 02:36 AM
1) How many clients in total in the whole environment?
2) How many clients fail with status 24?
3) How long has this been hapenning for?
4) Is the master server also a media server? i.e. is it a 'master/media' server?
5) How many media servers are there in the NetBackup domain which is experiencing this issue?
05-06-2015 08:54 AM
Its not a netbackup issue, Windows SA has changed some setting from OS end and issue has been resolved..
Thanks all for your valuable suggestions..
05-07-2015 12:34 AM
Hi dixit47 - any chance you could share with us some detail around the actual problem, and the actual solution? Many thanks.