10-29-2015 01:02 PM
Hi All,
I am looking for the possible solution for problem on customer's environment.
I have read a lot of articles and posts. And did not found any solution. None of solutions (i.e. keepalive timeout, etc.) doesn't work for us.
We have a lot of servers which are behind the firewall. These servers are for Oracle databases.
And the problem is that the parent job is terminating after about 2 hours and backup ended with 636 error - read from input socket failed. All child jobs ending with 0.
The reason of the situation is firewall session which is set to 14400 half-second - 2 hours. Terminated session is established between client and media server.
I know the recomended solution is that we should increase the timeout for inactive session on firewall but LAN administrators don't want do this.
And here is my question. Is there any way to make this TCP session "active" for parent job during whole backup session? Maybe some output from RMAN script could be redirected to media server?
Media server is running AIX and below are tcp_* settings:
tcp_keepcnt = 8
tcp_keepidle = 28800
tcp_keepinit = 150
tcp_keepintvl = 150
Other timeout settings on master or client are also set to be above 2 hours.
We have tested on servers which are not behind the firewall and parent job is running longer then 2 hours. So we are sure thet the problem is firewall.
Master server: RHEL, NBU 7.6.1.1
Media server: AIX 7.1, NBU 7.5.0.3
Clients: various versions 7.5.0.4 to 7.6.1.1 most of them is AIX.
Any suggestion would be appreciated.
Regards
Madej
Solved! Go to Solution.
10-30-2015 05:55 AM
You need to confgure TCP keepalive on master and media servers with a keep alive time of 15 minutes . The network admin won't change the parameter in the firewall becuase its again best pratice.
On a red hat host add the following to /etc/sysctl.conf:
net.ipv4.tcp_keepalive_time=900
Apply the setting with sysctl -p
The OS will then keep alive sessions by sending "ping" packages every 15 minutes therby preventing the firewall closing the sessions becuase of idle time. The keep alive is not just for Netbackup but for all application on the host.
Please see this tech note for configuring AIX hosts
DOCUMENTATION: COMM_FAILURE as a consequence of reusing a transport that has been inactive across a firewall
http://www.veritas.com/docs/000005752
Hint: be aware of "unit per messure".
10-29-2015 01:38 PM
10-29-2015 02:38 PM
Set media server and master server keepalive timeout to that of the firewall or lower? Or is that what you already tried?
-edit-
Noticed you posted your media server settings
tcp_keepidle = 28800
set that to 14400 and on your master as well.
10-30-2015 01:40 AM
Yes, that was tested. Value 14400 was set previously.
On master i have:
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
As i wrote, other servers which are not behind the firewall are not affected. So i guess that the problem is not with AIX or Linux TCP parameters values.
Revaroo, LAN admin doesn't want do that and there is no discussion. We tried many times.
10-30-2015 01:52 AM
From you initial post I understand that the master and media servers are in the same network and the affected clients are at the "other side" of the firewall.
If this is correct I suggests you to try two thinks.
stefanos
10-30-2015 05:55 AM
You need to confgure TCP keepalive on master and media servers with a keep alive time of 15 minutes . The network admin won't change the parameter in the firewall becuase its again best pratice.
On a red hat host add the following to /etc/sysctl.conf:
net.ipv4.tcp_keepalive_time=900
Apply the setting with sysctl -p
The OS will then keep alive sessions by sending "ping" packages every 15 minutes therby preventing the firewall closing the sessions becuase of idle time. The keep alive is not just for Netbackup but for all application on the host.
Please see this tech note for configuring AIX hosts
DOCUMENTATION: COMM_FAILURE as a consequence of reusing a transport that has been inactive across a firewall
http://www.veritas.com/docs/000005752
Hint: be aware of "unit per messure".
11-06-2015 01:30 AM
Hello,
Sorry for my absence for so long. I had to wait for administrator.
Nicolai, your advice finally solved the problem. Thank you.
The administrator set net.ipv4.tcp_keepalive_time=900 on master server.
Regards
Madej
11-06-2015 03:42 AM
Glad to hear :)
Thanks for marking a solution.