cancel
Showing results for 
Search instead for 
Did you mean: 

Error 636

Limberth1
Level 3

Hello, we are having trouble with one solaris client, whe the backup is running for 4 hours and 10 minutes exactly it finishes with the next error "read from input socket failed (636)", I have read the other posts but I can´t find the problem, the firewall guys told me they don´t have  policies or filters that can affect the job in that way.

I really apreciate your comments, and sorry for my english.

 

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
Over here it is recommended that KeepAlive be made the same: https://www.veritas.com/community/forums/having-trouble-636-status-code#comment-10550201

View solution in original post

13 REPLIES 13

Limberth1
Level 3

I have re-check and media server and DBserver are in the same network segment.

sdo
Moderator
Moderator
Partner    VIP    Certified
Has the backup job only recently started taking more than 250 minutes, or has it always been a very long running backup job? Would it be possible to break the job in to smaller pieces? Or is the one failing element a single file system?

Michael_G_Ander
Level 6
Certified

Couple of questions:

Is there sent any data ?

Is it a database backup ?

Is there any long wait periods in the backup ?

As you are running through a firewall, have you implemented the recommended keepalive value of 4 minutes based on most firewalls has a idle session close after 5 minutes ?

Is the backup system busy when you get this error ?

What happens if you run the backup at another time ?

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Have you seen this TN? 

http://www.veritas.com/docs/000018102

And this TN for the Solaris client:
http://www.veritas.com/docs/000027815

 

Limberth1
Level 3

Thank you for your comments.

I can´t divide in smaller pieces because is an rman backup or at least I don't know how :S

 

Is there sent any data ?

I think when it fails the backup is already done, because the parent is the only one failing, all the child tasks end in status 0.

Is it a database backup ? Yes it is.

Is there any long wait periods in the backup ? I really dont now.

As you are running through a firewall, have you implemented the recommended keepalive value of 4 minutes based on most firewalls has a idle session close after 5 minutes ? I have added keepalive value for 15 minutes but just on the media server.

Is the backup system busy when you get this error ? No

What happens if you run the backup at another time ? I already did it  but I got the same result 636 error at 04 hours and 10 minutes.

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Michael_G_Ander
Level 6
Certified

In addition to the Master and Media, I have found it worth implement the low TCP Keepalive on clients behind firewalls especially database servers as lot of the connections is initiated by database server in the case of database backups.

Sounds like it is either the control file backup or the final validation by rman that is failing, check the stdout and stderr files under the dbclient folder in netbackup/logs on the client, create dbclient with 777 permission if you does not have it already.

Think you are confusing the Netbackup CLIENT_READ_TIMEOUT and the OS TCP Keepalive settings

Are you using the _%t parameter in the rman script for improved speed of the validation ?

Also get the DBA to check the Oracle alert log, there can be indications why the backup/validation does not work, that can't be seen in the Netbackup logs.

If it is a incremental backup talk with the DBA about the possibility to use Block Change Tracking, it makes the incremental faster as is does no have to scan for the changed blocks, but have some caveats on the Oracle side of things,

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

sdo
Moderator
Moderator
Partner    VIP    Certified

The RMAN piece name "_%t" meta-field that Michael is referring to is described here:

https://www.veritas.com/support/en_US/article.000087057

 

Limberth1
Level 3

Yes we use "_%t", as you told me I added the keepAliveTime on the master server a restart the server, the result this time was different, the parent job ends at one hour and 20 minutes with the same 636 error but child tasks continue backing up the data base.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You did not tell us what values you used for KeepAlive?

See this TN: http://www.veritas.com/docs/000020036 

The TCP KeepAliveTime value on master server which was already reduced to 900,000 ms was still too high for this environment.

After reducing the TCP KeepAliveTime setting on Master server to 300,000 ms (5 mins), followed by a reboot of the master server, SQL Server parent backup jobs were now able to complete successfully when backing up to the affected media servers.

 

Limberth1
Level 3

You're right didn't tell you, at first I use  900,000 ms on master and media server but the problem continues, I just  reduce it to 300,000 ms and reboot the master server, I'll let you know how it goes.

 

it is necessary that KeepAlivetime is the same on master and media servers ?

 

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
Over here it is recommended that KeepAlive be made the same: https://www.veritas.com/community/forums/having-trouble-636-status-code#comment-10550201

Limberth1
Level 3

Thank you all, as you tell me I changed the keepalive time on both media and master to 300000 and it works!!!smiley