Oracle RMAN backup status 13, timer expired

road · ‎03-16-2020

Hi all

Sometimes, but not regularly, we experience one of our large Oracle backup ending with status 13.

Archive logs is always successful, so is the FULLs. This applies to our diffs.

We have a few thousand jobs that backup to a 5330HA cluster. Our Oracle backups runs NBU 8.1.2 on RHEL 7. Mediaserver is Appliance 3.1.2, with latest MSDP EEB bundle.

Here is from Job Details of Parent job:

16.mar.2020 03:39:20 - Info bphdb (pid=120725) INF - input datafile file number=00029 name=+DATA1/PXCDBL_DBM/82CDF1F23C1993F2E053C443F80AE973/DATAFILE/datex_lob.2830.1001260637
16.mar.2020 03:39:20 - Info bphdb (pid=120725) INF - input datafile file number=00024 name=+DATA1/PXCDBL_DBM/82CDF1F23C1993F2E053C443F80AE973/DATAFILE/system.2824.1001259231
16.mar.2020 03:39:21 - Info bphdb (pid=120725) INF - input datafile file number=00027 name=+DATA1/PXCDBL_DBM/82CDF1F23C1993F2E053C443F80AE973/DATAFILE/users.2828.1001259241
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - released channel: ch00
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - released channel: ch01
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - RMAN-00571: ===========================================================
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - RMAN-00571: ===========================================================
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - RMAN-03009: failure of backup command on ch00 channel at 03/16/2020 07:25:33
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - ORA-19511: non RMAN, but media manager or vendor specific failure, error text:
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - Failed to process backup file <bk_dPXCDBL_un2ur6tn8_s29410_p1_t1035171560>
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - ORA-19502: write error on file "bk_dPXCDBL_un2ur6tn8_s29410_p1_t1035171560", block number 1 (block size=8192)
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - ORA-27030: skgfwrt: sbtwrite2 returned error
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - ORA-19511: non RMAN, but media manage
16.mar.2020 07:25:50 - Info bphdb (pid=120725) INF - Recovery Manager complete.
16.mar.2020 07:25:55 - Info bphdb (pid=120725) INF - End of Recovery Manager output.
16.mar.2020 07:25:55 - Info bphdb (pid=120725) INF - End Oracle Recovery Manager.

Here is from job details of the failing job:

16.mar.2020 03:39:35 - Info bptm (pid=233514) start backup
16.mar.2020 03:40:38 - Info bptm (pid=233514) backup child process is pid 235121
16.mar.2020 03:40:38 - begin writing
16.mar.2020 06:40:39 - Error bpbrm (pid=233505) socket read failed: errno = 62 - Timer expired
16.mar.2020 06:40:41 - Error bptm (pid=233514) media manager terminated by parent process

Obviousley there is a timeout involved here, but where? The backup job shows always timeout after exactly 3 hrs.

Michal_Mikulik1 · ‎03-16-2020

Hello,

this kind of issue is better to solve with support, however here are some hints:

- does 3hrs correspond to any timeout in NBU configuration (Client Read Timeout etc.)?

- during these 3 hrs, Bytes Written in the corresponding Job Details is increasing, or is stuck at some value, or is stuck at zero?

- if possible, try to switch to client-side dedup, it is usually quicker thus completing below timeouts

- is it a Copilot backup, or traditional RMAN backup? Consider Copilot (incremental merge)

Regards

Michal

road · ‎03-17-2020

Thank you for your hints!

I suspect that the job is queued in NetBackup, and are waiting until resources are available. We have 2 streams pr Oracle Intelligent Policy, with a high priority setting in the policy. Other streams running in the same policy is exiting with staus 0.

I have asked Firewall team if they have settings that could explain the timeout after exactly 3 hrs.

Client Read Timeout set to default 300 sec.

No data written for failed stream.

Client Side Dedupe not used, neither is Copilot.

Will raise a support case when necessary.

Thanks again!

liuyl · ‎06-16-2021

I also have the same issue ！

And increasing the client_read_timeout to 7200 or higher in the media server side does not resolve this problem 。

VOX

Oracle RMAN backup status 13, timer expired