Solved: Error 636

Limberth1 · ‎05-16-2016

Hello, we are having trouble with one solaris client, whe the backup is running for 4 hours and 10 minutes exactly it finishes with the next error "read from input socket failed (636)", I have read the other posts but I can´t find the problem, the firewall guys told me they don´t have policies or filters that can affect the job in that way.

I really apreciate your comments, and sorry for my english.

Marianne · ‎05-18-2016

Over here it is recommended that KeepAlive be made the same: https://www.veritas.com/community/forums/having-trouble-636-status-code#comment-10550201

Handy NetBackup Links

View solution in original post

Limberth1 · ‎05-16-2016

I have re-check and media server and DBserver are in the same network segment.

sdo · ‎05-16-2016

Has the backup job only recently started taking more than 250 minutes, or has it always been a very long running backup job? Would it be possible to break the job in to smaller pieces? Or is the one failing element a single file system?

Michael_G_Ander · ‎05-16-2016

Couple of questions:

Is there sent any data ?

Is it a database backup ?

Is there any long wait periods in the backup ?

As you are running through a firewall, have you implemented the recommended keepalive value of 4 minutes based on most firewalls has a idle session close after 5 minutes ?

Is the backup system busy when you get this error ?

What happens if you run the backup at another time ?

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne · ‎05-17-2016

Have you seen this TN?

http://www.veritas.com/docs/000018102

And this TN for the Solaris client:
http://www.veritas.com/docs/000027815

Handy NetBackup Links

Limberth1 · ‎05-17-2016

Thank you for your comments.

I can´t divide in smaller pieces because is an rman backup or at least I don't know how :S

Is there sent any data ?

I think when it fails the backup is already done, because the parent is the only one failing, all the child tasks end in status 0.

Is it a database backup ? Yes it is.

Is there any long wait periods in the backup ? I really dont now.

As you are running through a firewall, have you implemented the recommended keepalive value of 4 minutes based on most firewalls has a idle session close after 5 minutes ? I have added keepalive value for 15 minutes but just on the media server.

Is the backup system busy when you get this error ? No

What happens if you run the backup at another time ? I already did it but I got the same result 636 error at 04 hours and 10 minutes.

Marianne · ‎05-17-2016

KeepAlive settings should be done on master and media server.

See: https://www.veritas.com/community/forums/sql-backups-run-fine-parent-job-ends-error-636#comment-8085...

and: https://www.veritas.com/community/forums/having-trouble-636-status-code#comment-10550201

Handy NetBackup Links

Michael_G_Ander · ‎05-17-2016

In addition to the Master and Media, I have found it worth implement the low TCP Keepalive on clients behind firewalls especially database servers as lot of the connections is initiated by database server in the case of database backups.

Sounds like it is either the control file backup or the final validation by rman that is failing, check the stdout and stderr files under the dbclient folder in netbackup/logs on the client, create dbclient with 777 permission if you does not have it already.

Think you are confusing the Netbackup CLIENT_READ_TIMEOUT and the OS TCP Keepalive settings

Are you using the _%t parameter in the rman script for improved speed of the validation ?

Also get the DBA to check the Oracle alert log, there can be indications why the backup/validation does not work, that can't be seen in the Netbackup logs.

If it is a incremental backup talk with the DBA about the possibility to use Block Change Tracking, it makes the incremental faster as is does no have to scan for the changed blocks, but have some caveats on the Oracle side of things,

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

sdo · ‎05-17-2016

The RMAN piece name "_%t" meta-field that Michael is referring to is described here:

https://www.veritas.com/support/en_US/article.000087057

Limberth1 · ‎05-17-2016

Yes we use "_%t", as you told me I added the keepAliveTime on the master server a restart the server, the result this time was different, the parent job ends at one hour and 20 minutes with the same 636 error but child tasks continue backing up the data base.

Marianne · ‎05-18-2016

You did not tell us what values you used for KeepAlive?

See this TN: http://www.veritas.com/docs/000020036

The TCP KeepAliveTime value on master server which was already reduced to 900,000 ms was still too high for this environment.

After reducing the TCP KeepAliveTime setting on Master server to 300,000 ms (5 mins), followed by a reboot of the master server, SQL Server parent backup jobs were now able to complete successfully when backing up to the affected media servers.

Handy NetBackup Links

Limberth1 · ‎05-18-2016

You're right didn't tell you, at first I use 900,000 ms on master and media server but the problem continues, I just reduce it to 300,000 ms and reboot the master server, I'll let you know how it goes.

it is necessary that KeepAlivetime is the same on master and media servers ?

Marianne · ‎05-18-2016

Over here it is recommended that KeepAlive be made the same: https://www.veritas.com/community/forums/having-trouble-636-status-code#comment-10550201

Handy NetBackup Links

Limberth1 · ‎05-30-2016

Thank you all, as you tell me I changed the keepalive time on both media and master to 300000 and it works!!!