I'm having an issue with various clients throwing the errors above.
I've gone through a number of troubleshooting steps , more then I can remember at this point, changing timeout values or settings on the master server to try and resolve the issue.
That being said I can solve the issue by rebooting the system. After a reboot the system will backup without error and wthout changing any settings on the system.
The problem is that the issue starts to re-occur within a week or so requiring another reboot of the system.
If a reboot was all that was required to fix the issue that would be fine, but weekly reboots is not something I can have occuring.
The errors are the same across systems:
11/18/2018 16:53:31 - Critical bpbrm (pid=296) from client : FTL - socket write failed
11/18/2018 16:53:31 - Error bptm (pid=5848) socket operation failed - 10054 (at ../child.c.1276)
11/18/2018 16:53:31 - Error bptm (pid=5848) unable to perform read from client socket, connection may have been broken
11/18/2018 16:53:31 - Info bptm (pid=3896) EXITING with status 42 <----------
11/18/2018 16:53:31 - Error bpbrm (pid=296) could not send server status message
11/18/2018 04:40:08 - Critical bpbrm (pid=6188) from client : FTL - socket write failed
11/18/2018 04:40:09 - Error bptm (pid=5636) socket operation failed - 10054 (at ../child.c.1276)
11/18/2018 04:40:09 - Error bptm (pid=5636) unable to perform read from client socket, connection may have been broken
11/18/2018 04:40:09 - Error bpbrm (pid=6188) could not send server status message
Looking for next steps on how to run this issue down.
You mean to rebooting master/media server solves the problem and backup again starts failing after a while??
Do you have any anti-virus software configured on your master?? or any other software (bit9 or carbon black) that could be causing these issues.
I am battling with the OS and NBU versions that you have selected.
W2012 with 7.1.x and earlier?
Not possible as support for W2012 started much later.
Status 24 is never an NBU issue, therefore a good understanding of the environment is crucial, especially the problematic clients.
There are technotes for W2003 clients, but best if you give us correct info.
Herewith extract from excellent post :
I describe the 23/24/25 status codes as follows:
RC=23: Server A sent a IP packet to valid server B, and is waiting for a response packet. It fails to get the response packet within the TIMEOUT window and raises the rc=23.
RC=25: Server A tried to sent IP packet to invalid server B. No connection made so Server A sets rc=25.
RC=24: Server A sends packet to server B and get a response within the TIMEOUT window. But something happens that drops connection between them.
I make an analogy of this communication environment using phone calls:
Person on Phone A calls to phone number B, which connects and they leave a voice mail to call them back. They wait for a call back that does not come and after a specified time, they quit. RC=23.
Person A calls phone number for what he thinks is a valid Phone B. The call does not go through and they hear the message "The number you have dialed is not a working number". RC=25.
Person A calls Person B, they call is picked up but the line connection somehow gets dropped unexpectedly.while communications is in progress. RC=24.
All of these are communication errors of some kind.
For RC=25, the sourtce server may have the wrong target server name in its environment or an invalid/wrong IP address for the target server.
For RC=23, A can talk to B but B cannot talk to A. Could be a source server it does not recognize or it is using the wrong IP address t respond to. Possible bad host name to IP resolution.
RC 24: The toughest of the bunch. A and B know each other correctly. They just can't keep the call going.
You may also want to go through this post by @mph999 :
Unfortunately most of the Symantec URLs are no longer working...
What about OS and NBU version(s) on problematic clients?
How many clients are affected? The same or different clients each time?
Does this happen only during peak backup times?
If so, have you tried to stagger backup schedule times?
The 2nd post that I have referred to lists quite a lot of possible reasons for network issues during backup window.
So OS is all Windows 2012 R2
Basically the turn of events is the problem occurs on a machine, I reboot the machine, the problem goes away for a few days and then the problem returns again. Not on every client but I am getting to the point now where every night a new client is having the problem. I'd say this is affecting about 10 - 15 different machines.
No it happens off peak times as well, once the (24) and (42) errors start occuring the clients will not complete a sucesseful backup at any time unitl I reboot them.
As an example I have 2 clients having the issue right now and I haven't rebooted them. I just tried the suggestions in this article:
Once I added the clients to the Resilient Network they haven't failed but they haven't completed either. Normally these clients can complete a backup in 20 mins to an hour and right now they are both entering hour 5 and still have not completed their backup, this is off peak times with no other backups running beside these two clients.
The only error I see from the details status on each is:
Error bpbrm (pid=7904) from client : ERR - Send bpfis state to CPASP-NBMSTR2 failed. status = 25
Sorry for not responding sooner.
Okay - this puts a different spin on things:
ERR - Send bpfis state to CPASP-NBMSTR2 failed. status = 25
For Windows Filesystem backup, the client needs connectivity to the master.
At the end of the backup, the client needs to notify the master server that the snapshot backup is done.
If CPASP-NBMSTR2 is not the master, or if there is a connection problem on port 1556 between the client and master, the backup will hang for a long time before it actually fails.
Seems you are getting different error codes when this happens.
I have experienced this as status 156 some years ago:
I think that link has solved my issue.
I'm not backing up any open files so I went as far as to just outright remove the clients from that list and the error's have stopped without having to reboot the client.
Thanks for the help
So while this did work on one client, I'm still running into the issue pretty much daily.
I've now got 5 clients with the same errors and diabling the open file backup and restarting the netbackup services will remove the BPFIS error but I'm still getting the socket read 24 errors and my only solution is a reboot of the client.
12/18/2018 10:27:36 - Critical bpbrm (pid=6432) from client : FTL - socket write failed
12/18/2018 10:27:36 - Error bptm (pid=4120) socket operation failed - 10054 (at ../child.c.1276)
12/18/2018 10:27:36 - Error bptm (pid=4120) unable to perform read from client socket, connection may have been broken
12/18/2018 10:27:36 - Error bpbrm (pid=6432) could not send server status message
Status 24 is probably the worst to troubleshoot.
The problem is that there is nothing within NBU that we can do or check to find the cause.
NBU is merely reporting the problem.
'Something' in your network or the Client's TCP stack happens to cause the drop in connection.
In W2003 days, we had to disable TCP Chimney to resolve status 24.
Since W2008, the recommendation is to leave TCP Chimney enabled (but then I found another TN today that says to disable it! ).
If you type this in Google
netbackup windows status 24
you will find a bunch of TNs and forum posts.
You may want to check what the NICs on problematic clients have in common.
If we ask ourselves what a reboot actually restarts/refresh in the TCP stack, it might be worth the effort looking for new firmware or drivers for NIC cards.
Please also read again through the links that I posted a month ago.