RMAN restore random error
Hi,
having a random issue where our Oracle DB team are experiencing restore errors from our Release environment to their Prod environment for 2 databases.
Sometimes the one fails - sometimes its the other. The RMAN restores are controlled by them.
I have got logs for the dbclient that suggests a timeout - i have already increased the timeout on master and media and clients from 1800 to 7200.
I also thought maybe it is the control file but that gets restored as part of the restore process. I cant find anything different in comparing the other Dev and QA environments - everything seems to be consistent. If anyone has any suggestions where to look.
Here is a snippet from the dbclient log file:
07:56:39.657 [8867] <16> readCommFile: ERR - timed out after 7200 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/8867.0.1435200986
07:56:39.657 [8867] <32> serverResponse: ERR - could not read from comm file </usr/openv/netbackup/logs/user_ops/dbext/logs/8867.0.1435200986>
07:56:39.657 [8867] <16> RestoreFileObjects: ERR - serverResponse() failed
07:56:39.657 [8867] <4> closeApi: entering closeApi.
07:56:39.657 [8867] <4> closeApi: INF - EXIT STATUS 5: the restore failed to recover the requested files
07:56:39.657 [8867] <16> VxBSAGetObject: ERR - System error occurred trying to retrieve object in RestoreFileObject. Status: 3
07:56:39.658 [8867] <2> xbsa_ProcessError: INF - entering
07:56:39.658 [8867] <2> xbsa_ProcessError: INF - leaving
07:56:39.658 [8867] <16> xbsa_GetObject: ERR - VxBSAGetObject: Failed with error:
Server Status: Communication with the server has not been initiated or the server status has not been retrieved from the serve
07:56:39.658 [8867] <2> xbsa_GetObject: INF - leaving (3)
07:56:39.658 [8867] <16> int_StartJob: ERR - Failed to open backup file for restore.
07:56:39.658 [8867] <2> int_StartJob: INF - leaving
07:56:39.658 [8867] <2> sbtrestore: INF - leaving
07:56:39.658 [8867] <2> sbterror: INF - entering
07:56:39.658 [8867] <2> sbterror: INF - Error=7501: Failed to open backup file for restore.
.
07:56:39.658 [8867] <2> sbterror: INF - leaving
Running 7.5.0.5 master version
client on oracle db server is 7.5.0.3
Redhat 2.6.18
Many thanks
You can replace all names with generic names, for example replace master server hostname (that includes domain name) with 'master' and media server with 'media1', client name with 'client1'.
It is unfortunately not possible to find troubleshoot without seeing logs.
What I normally do is to look for <16> and then look at the lines above that to get a feel for what lead up to the failure.
bprd is not that easy - here you need to look for connection request from client IP address and then how request was interpreted and handled by the master.