Forum Discussion

CoMurphy's avatar
CoMurphy
Level 4
10 years ago

Backups failing nightly

Hello we are having our production databases fail nightly. Prior to this we were in a good spot where we may have had 2 failures or less in a period of three days. Currently we have 2-4 fail a night.

 

We are getting a few "Cannot connect on socket" errors. The machines can talk to each other because rerunning the backup works.

We are getting a slew of "the backup has failed to back up the requested files". This may be a RMAN sided error. However we would like a second opinion. Here is an example of what that looks like on the RMAN side.

requested files

validation failed for archived log

archived log file name=/u01/app/oracle/admin/DB2/arch/arch_1_741300_582858846.arc RECID=85578 STAMP=853620093

validation failed for archived log

archived log file name=/u01/app/oracle/admin/DB2/arch/arch_1_741301_582858846.arc RECID=85579 STAMP=853620358

 

We are getting network read failed, however that is in the morning when we do not do network maintenance.

 

We are running all of our media and master servers on windows server 2008 r2. We have Quantum tape drives, and are running Netbackup 7.1.0.4.

 

 

 

  • Here is a tech note with that matches some of the RMAN error text you have.

    The tech note is quite length to take to to read and analyze the evidence provided.

    http://www.symantec.com/docs/TECH73130

     

     

9 Replies

  • Do you have "CROSSCHECK ARCHIVELOG" in your script? That would produce the validation error. Check with your dba, they (or someone) might have removed / cleaned up logs.
  • Hello,

    Yes we have crosscheck on and our DBA team is looking into it. However this jsut started to come up recently. The socket and network errors would not be explained by this either.

    Our full backups shouldn't fail due to these file moves since that hasn't happened in the past. I am not sure what would have caused this.

     

     

    Thanks,

    CM

  • RMAN-00571: ===========================================================
    RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
    RMAN-00571: ===========================================================
    RMAN-03002: failure of backup plus archivelog command at 08/18/2014 21:29:03

    RMAN-03009: failure of backup command on ch1 channel at 08/18/2014 20:26:57
    ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
    ORA-19511: Error received from media manager layer, error text:
       VxBSAEndTxn: Failed with error:
       The transaction was aborted.

    RMAN> RMAN>

    Recovery Manager complete.
    ORA-20079: full resync from primary database is not done

    doing automatic resync from primary
    resyncing from database with DB_UNIQUE_NAME BED2

    Script /usr/openv/netbackup/ext/db_ext/oracle/production/rman/hotbackup_gdb2.sh

     

    Here is the full error from what happened recently.

     

    Thanks,

    CM

  • Here is a tech note with that matches some of the RMAN error text you have.

    The tech note is quite length to take to to read and analyze the evidence provided.

    http://www.symantec.com/docs/TECH73130

     

     

  • Thank you,

     

    I will read into this and update this thread with further questions if there are any!

     

    Thanks,

    CM

  • Hi,

     

    Also not seen this before

     

    ORA-20079: full resync from primary database is not done

     

    You must be using some "advanced" commands in your scripts.

  • Hello Nicolai,

     

    Thanks for that tech note. We may have to consider upgrading to 7.5.5

     

    Here is another RMAN error we are getting, we are geting different errors across the board.

    RMAN-00571: ===========================================================
    RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
    RMAN-00571: ===========================================================
    RMAN-03002: failure of backup plus archivelog command at 08/16/2014 16:39:35
    RMAN-03014: implicit resync of recovery catalog failed
    RMAN-03009: failure of partial resync command on default channel at 08/16/2014 16:39:35
    ORA-17629: Cannot connect to the remote database server
    ORA-17627: ORA-01017: invalid username/password; logon denied
    ORA-17629: Cannot connect to the remote database server
     

    I know for sure that the database was up during this time.

    RMAN-00571: ===========================================================
    RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
    RMAN-00571: ===========================================================
    RMAN-03002: failure of backup plus archivelog command at 08/16/2014 13:09:57
    RMAN-03014: implicit resync of recovery catalog failed
    RMAN-03009: failure of partial resync command on default channel at 08/16/2014 13:09:57
    ORA-17629: Cannot connect to the remote database server
    ORA-17627: ORA-01017: invalid username/password; logon denied
    ORA-17629: Cannot connect to the remote database server

    RMAN> RMAN>

    Recovery Manager complete.
    ORA-20079: full resync from primary database is not done

    doing automatic resync from primary
    resyncing from database with DB_UNIQUE_NAME DFW33

    Script /usr/openv/netbackup/ext/db_ext/oracle/production/rman/hotbackup_gdb33.sh

     

    Same for this one. Is there a way on the netbackup side to check what is going at these times to see why a connection error would occur?

     

    Thanks,

    CM

  • Difficult to try and assist when you post one discussion for a bunch of different errors.

    It will be best to start a new discussion for each of the errors so that each one can be looked at separately.

    We have not yet seen evidence (info in Details tab of the failed job) for  "Cannot connect on socket"   errors. Logs needed here will be bpbrm on the media server and bpcd log on the client. (Log folders need to be created.)

    Also 2 different RMAN errors.

    This one indicates an error on NBU side:

    RMAN-03009: failure of backup command on ch1 channel at 08/18/2014 20:26:57
    ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
    ORA-19511: Error received from media manager layer, error text:
       VxBSAEndTxn: Failed with error:
       The transaction was aborted.

    Here we need to see the error in the NBU job.
    All the text in Details tab of the failed job will help. It should give an indication of where the failure occurred. Based on this info, we can tell you which log folders to create that will assist with troubleshooting.

    This error is 100% an Oracle issue - only Oracle dba's can troubleshoot and fix this:

    ORA-17629: Cannot connect to the remote database server
    ORA-17627: ORA-01017: invalid username/password; logon denied
     

  • Hi Marianne,

     

    We have contacted our own DBA team along with opening a case with Oracle. We were not sure if anyone else has experienced this issue. Currently our full RMAN backups are failing because a data file is being added while it backs up.

     

     

    Thanks,

    CM