Forum Discussion

jls987's avatar
jls987
Level 3
7 years ago

Oracle Backup channels failing

We are having a problem when one or more of the channels for an RMAN backup fails .  The backup starts with 6 channels, if one fails, the backup continues with 5.  This past weekend, 3 channels going to the same drive/tape all failed with a media error, and the backup continued with only 3 channels, which is not enough bandwidth to get the backup completed in the allowed time.  As a stop gap we're going to increase the channels so if something fails, we'll still have enough to continue effectivly.  Is there a better solution that can be done for a failed channel to be restarted/reused?

  • Have a look at this post https://vox.veritas.com/t5/NetBackup/Oracle-Application-Backups/m-p/843736 for replies by Genericus.
    You will see how rman script can be customized to rerun failed jobs.

    If you have problematic drives or media, then not a good idea to have multiple channels going to the same drive/media. Look at schedule MPX levels.

    You may also want to consider introducing disk as first stage of your backups.
    Dedupe preferrably if budget allows.
  • I do not know of any mechanism to automatically restart channels, if one fails the job continues with one fewer channel.

    Our only solution is to run the backup as SYSDATE-X, so the job could be restarted and not attempt to re-backup pieces already done. ALSO, we had our OPS staff monitor the job and kill and restart if the number of child jobs falls below a certain level. 

    Most of our issues went away once we went to the data domains, so far they accept the data as fast as we can send it. We were able to increase the number of channels.

    BEWARE! If you are sending packets over the LAN, there is CPU overhead caused by encapsulating the TCP/IP packets. I was able to put my system to 100% CPU by increasing the number of channels too high. That is why we use Fiber Channel for most of our large data bases. We are implementing 10G networking, we shall see how that goes...

     

6 Replies

  • Job control in RMAN is controlled by RMAN, so if a channel fails, it's not NBU's job to restart it. I'd focus on getting down the rabbit hole why they fail - I know customers with literally thousands of Oracle jobs daily with 100% success rate - just to highlight that NBU is a reliable sw when it comes to backup of Oracle, so if it's infrastructure issue it has to be fixed first.

    • jls987's avatar
      jls987
      Level 3

      Mouse, In this case we know what the error was.  Tape media, it's just the price of using tapes.  Our environment is slowly moving to a disk based storage solution, but it's going to take time.  I know how the backup works from the NBU side.  I was hoping for something to be added to the RMAN script that might help.  RMAN scripting is not something I do much of.

      Marianne, It's nice to interact with you again.  It's been awhile and a few jobs later.  That link you provided may be helpful.  I have not seen the script they are using yet to know what they are doing.  We have a meeting in a little while to discuss this.  We did talk about making changes to the MPX as you mentioned, but we have not made those changes as of yet.  As I stated above, we'll be moving to disk, it's a slow process around here.  The new environment is stood up (8.x, BOOST with Data Domain), it's just a matter of when all the clients get moved over.  The first priority is the large VM environment without DB or other special needs.

      I'm going to leave this "open" for a few more days in case an RMAN expert sees it and has any suggestions.

      • Genericus's avatar
        Genericus
        Moderator

        I do not know of any mechanism to automatically restart channels, if one fails the job continues with one fewer channel.

        Our only solution is to run the backup as SYSDATE-X, so the job could be restarted and not attempt to re-backup pieces already done. ALSO, we had our OPS staff monitor the job and kill and restart if the number of child jobs falls below a certain level. 

        Most of our issues went away once we went to the data domains, so far they accept the data as fast as we can send it. We were able to increase the number of channels.

        BEWARE! If you are sending packets over the LAN, there is CPU overhead caused by encapsulating the TCP/IP packets. I was able to put my system to 100% CPU by increasing the number of channels too high. That is why we use Fiber Channel for most of our large data bases. We are implementing 10G networking, we shall see how that goes...

         

  • Have a look at this post https://vox.veritas.com/t5/NetBackup/Oracle-Application-Backups/m-p/843736 for replies by Genericus.
    You will see how rman script can be customized to rerun failed jobs.

    If you have problematic drives or media, then not a good idea to have multiple channels going to the same drive/media. Look at schedule MPX levels.

    You may also want to consider introducing disk as first stage of your backups.
    Dedupe preferrably if budget allows.