Inactive TCP session - error 636 (firewall)
Hi All,
I am looking for the possible solution for problem on customer's environment.
I have read a lot of articles and posts. And did not found any solution. None of solutions (i.e. keepalive timeout, etc.) doesn't work for us.
We have a lot of servers which are behind the firewall. These servers are for Oracle databases.
And the problem is that the parent job is terminating after about 2 hours and backup ended with 636 error - read from input socket failed. All child jobs ending with 0.
The reason of the situation is firewall session which is set to 14400 half-second - 2 hours. Terminated session is established between client and media server.
I know the recomended solution is that we should increase the timeout for inactive session on firewall but LAN administrators don't want do this.
And here is my question. Is there any way to make this TCP session "active" for parent job during whole backup session? Maybe some output from RMAN script could be redirected to media server?
Media server is running AIX and below are tcp_* settings:
tcp_keepcnt = 8
tcp_keepidle = 28800
tcp_keepinit = 150
tcp_keepintvl = 150
Other timeout settings on master or client are also set to be above 2 hours.
We have tested on servers which are not behind the firewall and parent job is running longer then 2 hours. So we are sure thet the problem is firewall.
Master server: RHEL, NBU 7.6.1.1
Media server: AIX 7.1, NBU 7.5.0.3
Clients: various versions 7.5.0.4 to 7.6.1.1 most of them is AIX.
Any suggestion would be appreciated.
Regards
Madej
You need to confgure TCP keepalive on master and media servers with a keep alive time of 15 minutes . The network admin won't change the parameter in the firewall becuase its again best pratice.
On a red hat host add the following to /etc/sysctl.conf:
net.ipv4.tcp_keepalive_time=900
Apply the setting with sysctl -p
The OS will then keep alive sessions by sending "ping" packages every 15 minutes therby preventing the firewall closing the sessions becuase of idle time. The keep alive is not just for Netbackup but for all application on the host.
Please see this tech note for configuring AIX hosts
DOCUMENTATION: COMM_FAILURE as a consequence of reusing a transport that has been inactive across a firewall
http://www.veritas.com/docs/000005752
Hint: be aware of "unit per messure".