Solved: Oracle Backup channels failing

jls987 · ‎12-18-2017

We are having a problem when one or more of the channels for an RMAN backup fails . The backup starts with 6 channels, if one fails, the backup continues with 5. This past weekend, 3 channels going to the same drive/tape all failed with a media error, and the backup continued with only 3 channels, which is not enough bandwidth to get the backup completed in the allowed time. As a stop gap we're going to increase the channels so if something fails, we'll still have enough to continue effectivly. Is there a better solution that can be done for a failed channel to be restarted/reused?

Marianne · ‎12-18-2017

Have a look at this post https://vox.veritas.com/t5/NetBackup/Oracle-Application-Backups/m-p/843736 for replies by @Genericus.
You will see how rman script can be customized to rerun failed jobs.

If you have problematic drives or media, then not a good idea to have multiple channels going to the same drive/media. Look at schedule MPX levels.

You may also want to consider introducing disk as first stage of your backups.
Dedupe preferrably if budget allows.

Handy NetBackup Links

View solution in original post

Genericus · ‎12-19-2017

I do not know of any mechanism to automatically restart channels, if one fails the job continues with one fewer channel.

Our only solution is to run the backup as SYSDATE-X, so the job could be restarted and not attempt to re-backup pieces already done. ALSO, we had our OPS staff monitor the job and kill and restart if the number of child jobs falls below a certain level.

Most of our issues went away once we went to the data domains, so far they accept the data as fast as we can send it. We were able to increase the number of channels.

BEWARE! If you are sending packets over the LAN, there is CPU overhead caused by encapsulating the TCP/IP packets. I was able to put my system to 100% CPU by increasing the number of channels too high. That is why we use Fiber Channel for most of our large data bases. We are implementing 10G networking, we shall see how that goes...

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

View solution in original post

Mouse · ‎12-18-2017

Job control in RMAN is controlled by RMAN, so if a channel fails, it's not NBU's job to restart it. I'd focus on getting down the rabbit hole why they fail - I know customers with literally thousands of Oracle jobs daily with 100% success rate - just to highlight that NBU is a reliable sw when it comes to backup of Oracle, so if it's infrastructure issue it has to be fixed first.

Marianne · ‎12-18-2017

Have a look at this post https://vox.veritas.com/t5/NetBackup/Oracle-Application-Backups/m-p/843736 for replies by @Genericus.
You will see how rman script can be customized to rerun failed jobs.

If you have problematic drives or media, then not a good idea to have multiple channels going to the same drive/media. Look at schedule MPX levels.

You may also want to consider introducing disk as first stage of your backups.
Dedupe preferrably if budget allows.

Handy NetBackup Links

jls987 · ‎12-19-2017

Mouse, In this case we know what the error was. Tape media, it's just the price of using tapes. Our environment is slowly moving to a disk based storage solution, but it's going to take time. I know how the backup works from the NBU side. I was hoping for something to be added to the RMAN script that might help. RMAN scripting is not something I do much of.

Marianne, It's nice to interact with you again. It's been awhile and a few jobs later. That link you provided may be helpful. I have not seen the script they are using yet to know what they are doing. We have a meeting in a little while to discuss this. We did talk about making changes to the MPX as you mentioned, but we have not made those changes as of yet. As I stated above, we'll be moving to disk, it's a slow process around here. The new environment is stood up (8.x, BOOST with Data Domain), it's just a matter of when all the clients get moved over. The first priority is the large VM environment without DB or other special needs.

I'm going to leave this "open" for a few more days in case an RMAN expert sees it and has any suggestions.

Genericus · ‎12-19-2017

I do not know of any mechanism to automatically restart channels, if one fails the job continues with one fewer channel.

Our only solution is to run the backup as SYSDATE-X, so the job could be restarted and not attempt to re-backup pieces already done. ALSO, we had our OPS staff monitor the job and kill and restart if the number of child jobs falls below a certain level.

Most of our issues went away once we went to the data domains, so far they accept the data as fast as we can send it. We were able to increase the number of channels.

BEWARE! If you are sending packets over the LAN, there is CPU overhead caused by encapsulating the TCP/IP packets. I was able to put my system to 100% CPU by increasing the number of channels too high. That is why we use Fiber Channel for most of our large data bases. We are implementing 10G networking, we shall see how that goes...

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

jls987 · ‎12-19-2017

Genericus, I was hoping you might chime in. I saw the other thread that Marianne gave me the link to. We unfortunatly don't have a 24x7 ops staff to monitor this stuff. So it's not until Monday that we find out something happened and the backup is still running on fewer channels, if there was a failure. I'm looking at the script(s) to see what they are doing and to have them add that SYSDATE parameter. I spoke to teammates about just moving it to the new env, but what they really want to do is have the RMAN go direct to Data Domain for the larger DB and this one is 43TB. So we have some options we're exploring. Thanks eveyone for your input.

Genericus · ‎12-22-2017

I send mine directly to my data domains via Fiber channel and get excellent throughput. (and do not have those pesky tape issues) although it was not until recently they supported FC BOOST, so I was forced to use VTL. If you do VTL - make sure to use smaller tapes if you need to duplicate the images, I set mine as LTO3 and 50GB, so I get less tape contention when I duplicate them...

I find I get better deduplication using FILESPERSET=1, although I get better throughput at FILESPERSET=3...

BTW - my 40TB database restores from my data domain to my backup server FASTER then it backs up, which is nice...

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

VOX

Oracle Backup channels failing