Forum Discussion

Kasra_Hashemi's avatar
5 years ago

Network Connection Timeout and Socket WriteFailed

ll

  • Marianne  thanks by the way , you have always helped me and I'm so thankful ., you and all the other members are really a true human and true engineer cause you help people without any expectation.

    today I could solve this issue , by reinstalling the agent , changing the backup path ( I presented another LUN from SAN Storage ).

     

     

  • Curious to know why you have 'Follow NFS' and 'Cross Mountpoints' selected. 

    Is /backup NFS mount? With nested mountpoints?

    The timeout seems to be a network read timeout of 30 minutes.
    This means that the media server received NO data for 30 minutes. 

    Do you have bpbkar log folder on the client?
    And bpbrm and bptm log folders on the media server?

    All of these logging levels need to be higher than 0. I suggest level 3. 

    bpbkar on the client will show when files/data is sent to the media server.
    bptm on the media server will show each time a data block is received.
    bpbrm will show metadata received from client's bpbkar.

    bpbkar should also show file sizes - if there are large backup images that take very long to transmit, then it is quite possible that you will see timeouts. 
    Here you will need to look at TCP KeepAlive settings on the master, media server and client.

    Let us look at logs first.

    • Kasra_Hashemi's avatar
      Kasra_Hashemi
      Level 5

      Marianne 

      I have unchecked all the options which you mentioned (cause there are no mount points and NFS and those settings just were for testing)
      I changed the logging level to 3 and in order to make it clear I have replaced my server name to (NETBACKUP) AND the oracle client to (CLIENT)
      I couldn’t find the bpkar log (in //usr/openv/netbackup/logs.bpkar, I created bpkar directory) and I'm still searching for the solution and I will attach that log soon.
      But here is bptm and bprm log files.

      • Marianne's avatar
        Marianne
        Level 6

        The client directory name must be bpbkar. 
        You need to create the folder if it does not exist. 

        Incorrect folder name will not log anything.

        It is important to have all 3 logs - bpbkar on the client, bpbrm and bptm.
        All three logs are necessary in order to follow the process flow.
        We will also need all text in Job Details to obtain timestamps and PID. 

  • There is never a single common problem underlying status 41.  The status 41 can be caused by a very wide range of situations.  My advice would be to check that all elements of the network stack and network path do NOT have any strange custom configuration.

    .

    For example, I post this here purely as an example of how strange network configuration settings can cause strange networking errors for networking applications:

    https://www.veritas.com/support/en_US/article.100020853

    ...i.e. I am not trying to suggest that your problem is this, merely to demonstrate an example of a strange configuration setting having strange results.

    .

    However, having said all that, one of the most common usual culprits with networking errors is "TCP keepalive".  I'll let you do some searching on that topic.  Remember, "keepalive" is a feature at all points along the network path... source, carrier, target.  i.e. you must check the keepalive at all network contact points, which means: client, switch, media, switch, storage, switch, master.

    • There are four OracleLinux netbackup clients in the environment, two of them have this issue, so I think TCP Keepalive must be configured correctly otherwise those two must have the problem.

       

      • Hamza_H's avatar
        Hamza_H
        Moderator

        Hello,

        Did you try to enable logs as suggested by Marianne ?

        sdo 's post is interesting, it's obvious the error is about the timeout, which is 30 minutes, which means that media server was waiting for all that time and didn't receive any data.. so, what you need to do is enable logs and look at it closely on what causes this delay.. because I believe even if you increase TO to 3600 you will get an error after 1 hour.. so you need to dig on this.

        also, you said this issue concerns 2/4 oracle clients, so you have to look at which kind of data & how big it is are being backed up by these failing clients.

         

        Good luck