cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted

Network Connection Timeout and Socket WriteFailed

ll

Netdigest
Tags (2)
1 Solution

Accepted Solutions
Highlighted
Accepted Solution!

Re: Network Connection Timeout and Socket WriteFailed

@Marianne  thanks by the way , you have always helped me and I'm so thankful ., you and all the other members are really a true human and true engineer cause you help people without any expectation.

today I could solve this issue , by reinstalling the agent , changing the backup path ( I presented another LUN from SAN Storage ).

 

 

Netdigest

View solution in original post

13 Replies
Highlighted

Re: Network Connection Timeout and Socket WriteFailed

There is never a single common problem underlying status 41.  The status 41 can be caused by a very wide range of situations.  My advice would be to check that all elements of the network stack and network path do NOT have any strange custom configuration.

.

For example, I post this here purely as an example of how strange network configuration settings can cause strange networking errors for networking applications:

https://www.veritas.com/support/en_US/article.100020853

...i.e. I am not trying to suggest that your problem is this, merely to demonstrate an example of a strange configuration setting having strange results.

.

However, having said all that, one of the most common usual culprits with networking errors is "TCP keepalive".  I'll let you do some searching on that topic.  Remember, "keepalive" is a feature at all points along the network path... source, carrier, target.  i.e. you must check the keepalive at all network contact points, which means: client, switch, media, switch, storage, switch, master.

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

Curious to know why you have 'Follow NFS' and 'Cross Mountpoints' selected. 

Is /backup NFS mount? With nested mountpoints?

The timeout seems to be a network read timeout of 30 minutes.
This means that the media server received NO data for 30 minutes. 

Do you have bpbkar log folder on the client?
And bpbrm and bptm log folders on the media server?

All of these logging levels need to be higher than 0. I suggest level 3. 

bpbkar on the client will show when files/data is sent to the media server.
bptm on the media server will show each time a data block is received.
bpbrm will show metadata received from client's bpbkar.

bpbkar should also show file sizes - if there are large backup images that take very long to transmit, then it is quite possible that you will see timeouts. 
Here you will need to look at TCP KeepAlive settings on the master, media server and client.

Let us look at logs first.

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

There are four OracleLinux netbackup clients in the environment, two of them have this issue, so I think TCP Keepalive must be configured correctly otherwise those two must have the problem.

 

Netdigest
Highlighted

Re: Network Connection Timeout and Socket WriteFailed

Hello,

Did you try to enable logs as suggested by @Marianne ?

@sdo 's post is interesting, it's obvious the error is about the timeout, which is 30 minutes, which means that media server was waiting for all that time and didn't receive any data.. so, what you need to do is enable logs and look at it closely on what causes this delay.. because I believe even if you increase TO to 3600 you will get an error after 1 hour.. so you need to dig on this.

also, you said this issue concerns 2/4 oracle clients, so you have to look at which kind of data & how big it is are being backed up by these failing clients.

 

Good luck

 

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

@Marianne 

I have unchecked all the options which you mentioned (cause there are no mount points and NFS and those settings just were for testing)
I changed the logging level to 3 and in order to make it clear I have replaced my server name to (NETBACKUP) AND the oracle client to (CLIENT)
I couldn’t find the bpkar log (in //usr/openv/netbackup/logs.bpkar, I created bpkar directory) and I'm still searching for the solution and I will attach that log soon.
But here is bptm and bprm log files.

Netdigest
Highlighted

Re: Network Connection Timeout and Socket WriteFailed

The client directory name must be bpbkar. 
You need to create the folder if it does not exist. 

Incorrect folder name will not log anything.

It is important to have all 3 logs - bpbkar on the client, bpbrm and bptm.
All three logs are necessary in order to follow the process flow.
We will also need all text in Job Details to obtain timestamps and PID. 

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

 

I have unchecked all the options which you mentioned (cause there are no mount points and NFS)

I changed the logging level and in order to  make it clear I have replaced my server name to ( NETBACKUP ) AND the oracle client to ( CLIENT ) 

I could'nt find the bpkar log ( in //usr/openv/netbackup/logs.bpkar , I created bpkar directory) and I'm still searching for the solution and I will attach that log soon .

but here is bptm and bprm log

 

Netdigest
Highlighted

Re: Network Connection Timeout and Socket WriteFailed

We're not support.  We do this for fun (sic!), sort of.

Have you not been through the logs yourself first, to pick out what you think is important, to pre-filter them?

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

@sdo 

ok 

sorry

Indeed I looked at those logs , but I thought maybe it's better not to filter them.in order to keep all the important data which is important for you.

 

Netdigest
Highlighted

Re: Network Connection Timeout and Socket WriteFailed

Sorry dude, please by all means yes do upload the full logs in case anyone here does actually want/need to take a look and step line by line through large raw logs for themselves.  What I meant was... have you looked at them yourself?  What have you identified (filtered) as potentially being pertinent in them for yourself?

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

@Kasra_Hashemi 

My workload is quite hectic lately - all I can commit to is to have a look if and when time permits.

Highlighted
Accepted Solution!

Re: Network Connection Timeout and Socket WriteFailed

@Marianne  thanks by the way , you have always helped me and I'm so thankful ., you and all the other members are really a true human and true engineer cause you help people without any expectation.

today I could solve this issue , by reinstalling the agent , changing the backup path ( I presented another LUN from SAN Storage ).

 

 

Netdigest

View solution in original post

Highlighted

Re: Network Connection Timeout and Socket WriteFailed

@Kasra_Hashemi 

Glad to hear all is fine now.

I had a look at the logs over the weekend, but did not have time to respond.

bpbkar logging level was too low - it was at level 0, so we could not see what is happening w.r.t data read and send. 

There was nothing logged on the client between 08:32 and 09:24 (by the looks of it - some sparce file was encountered here, but not sufficient logging). Some 'criptic' entry at 09:24 and then nothing again till 10:17 when the media server terminated the job :

08:32:59.661 [100613] <2> mount build_mount_list: INF - Processing (tracefs) tracefs on /sys/kernel/debug/tracing
08:32:59.662 [100613] <4> check_file_sparseness: Device changing from 0 to 2049
09:24:29.527 [100613] <2> fscp_is_tracked: disabled tla_init
10:17:04.277 [100613] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
10:17:04.278 [100613] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: socket write failed

In bpbrm we see metadata received at periodic intervals, and then nothing after 09:34. 
Timeout was recorded at 10:14 because no data was received from the client. 

09:29:47.517 [12772.9820] <2> non_mpx_backup_archive_verify_import: ADDED FILES TO DB FOR 10.0.12.62_1587787268 1 /backup/rman/full/full_AML_14649_1_9puug7jp_1_1.bkf
09:34:30.950 [12772.9820] <2> non_mpx_backup_archive_verify_import: ADDED FILES TO DB FOR 10.0.12.62_1587787268 1 /backup/rman/full/full_AML_14650_1_9quug7jq_1_1.bkf
09:44:56.723 [12772.9820] <2> bpbrm send_ping: PING
10:14:56.790 [12772.9820] <2> bpbrm readline: bpbrm timeout after 1800 seconds
10:14:56.790 [12772.9820] <2> bpbrm kill_child_process_Ex: start

I have not had time to check bptm to see up to when actual data was received.

So, although backups are running fine now, the problem may re-appear, depending on what is causing the client to be slow with data read and send. 

Please be sure that you always have bpbkar log on the client and that logging level is higher than 0. Level 1 will probably be okay for basic troubleshooting, but I find level 3 to be most useful. 
(Only ever enable logging level 5 when asked by Veritas Support to do so. They have dedicated time and tools to analyze massive log files.)