Oracle Database Backup Failures

backup-botw · ‎12-10-2015

We are seeing some issues with DB backups on a large Oracle database. It was working without issue before, but for some reason is now having issues. Nothing has changed with the policy and the failures started prior to our 7.7.1 upgrade earlier this month. The DBA's seem to be having trouble figuring out what is going on and Oracle keeps pointing them back to the backup software vendor so I figured I would reach out here to try and track something down.

Master Server: Solaris 10 NB version 7.7.1

Media Server: Same as above

Client: Linux xp53db015b 3.8.13-68.3.2.el6uek.x86_64 #2 SMP Tue Jun 9 17:07:32 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux

Client NB Version: 7.6.0.2 (Debating on upgrading the client version today to 7.7.1

Detailed Job Status:

1: (54) timed out connecting to client

12/09/2015 19:38:29 - Info nbjm (pid=12665) starting backup job (jobid=4098157) for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental
12/09/2015 19:38:29 - Info nbjm (pid=12665) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=4098157, request id:{B70CDF3E-9EDE-11E5-B8D5-002128A6F97C}) 
12/09/2015 19:38:29 - requesting resource xp53tape001-hcart-robot-tld-1
12/09/2015 19:38:29 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53db015b 
12/09/2015 19:40:29 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001 
12/09/2015 19:42:29 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001
12/09/2015 19:44:27 - awaiting resource xp53tape001-hcart-robot-tld-1. Waiting for resources. 
          Reason: Drives are in use, Media server: xp53tape001, 
          Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A, 
          Volume Pool: Daily, Storage Unit: xp53tape001-hcart-robot-tld-1, Drive Scan Host: N/A, 
          Disk Pool: N/A, Disk Volume: N/A  
12/09/2015 20:50:29 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001
12/09/2015 20:50:35 - awaiting resource xp53tape001-hcart-robot-tld-1. Maximum job count has been reached for the storage unit. 
12/09/2015 20:50:41 - awaiting resource xp53tape001-hcart-robot-tld-1. Waiting for resources. 
          Reason: Drives are in use, Media server: xp53tape001, 
          Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A, 
          Volume Pool: Daily, Storage Unit: xp53tape001-hcart-robot-tld-1, Drive Scan Host: N/A, 
          Disk Pool: N/A, Disk Volume: N/A 
12/10/2015 04:04:30 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001
12/10/2015 04:05:30 - granted resource  xp53tape001.NBU_CLIENT.MAXJOBS.xp53db015b
12/10/2015 04:05:30 - granted resource  OM2136
12/10/2015 04:05:30 - granted resource  HP.ULTRIUM4-SCSI.004
12/10/2015 04:05:30 - granted resource  xp53tape001-hcart-robot-tld-1
12/10/2015 04:05:31 - estimated 68183692 kbytes needed
12/10/2015 04:05:31 - Info nbjm (pid=12665) started backup (backupid=xp53db015b_1449741930) job for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental on storage unit xp53tape001-hcart-robot-tld-1
12/10/2015 04:05:32 - Info bpbrm (pid=12610) xp53db015b is the host to backup data from
12/10/2015 04:05:32 - Info bpbrm (pid=12610) reading file list for client
12/10/2015 04:05:32 - started process bpbrm (pid=12610)
12/10/2015 04:05:32 - connecting
12/10/2015 04:05:34 - Info bpbrm (pid=12610) listening for client connection
12/10/2015 04:05:37 - Info bpbrm (pid=12610) INF - Client read timeout = 14400
12/10/2015 08:05:37 - Error bpbrm (pid=12610) listen for client timeout during accept from data listen socket after 60 seconds
12/10/2015 08:05:37 - Info dbclient (pid=0) done. status: 54: timed out connecting to client
12/10/2015 08:05:37 - end writing
timed out connecting to client  (54)

I have attached logs from the client. If anything else would be needed please let me know.

Michal_Mikulik1 · ‎12-10-2015

Hello,

I see that it is an incremetal backup schedule. Maybe Oracle has problems to identify incremental changes in the database - consult your DBA.

Full database backups work? And what about filesystem backups? Controlfile backups?

Please attach job details as attachments, or at least omit unimportant messages. There are thousands of "Waiting in NetBackup scheduler work queue " messages printed here which have no relevance for the issue.

Regards

Michal

areznik · ‎12-10-2015

The logs you attached cut off at 12/10/15 06:52; the job details you posted have the final error at 08:05... so is the log incomplete or is there just nothing posted in the log at that time? (specifically im looking at dbclient log)

backup-botw · ‎12-10-2015

My DBA is pointing the finger at the backup software. They have no idea whats going on.

We upgraded to 7.7.1 on Dec 1st and now we are not able to run reports via the console and get any data priot to that date. I will check and see what the last full did in OpsCenter if its available there. Filesystem backups finish with no issue. I believe the controlfile backups are done via the Oracle Intelligent Policy as well...I would have to go through the jobs to find the one that did the control file, but in the few jobs that failed I can see its not the controlfile.

Ill do that. The new forum setup seems to post it differently. It used to put the data in a code snippet box.

backup-botw · ‎12-10-2015

My guess would be that the log is incomplete. I pulled all the directories for the logs and then zipped them up. When I did that the job had already completed so one would think the logs would be complete as well.

It's worth noting that not all the jobs fail on this database. Typically I only see 2 out of the many that actually have an error.

backup-botw · ‎12-10-2015

Looks like the full backup had an issue as well.

Arshad_Khateeb · ‎12-10-2015

How many types of backups do you have for this database. (Ex Weekly Full/Daily Incremental/Monthly Full/etc)?

Is this an issue with all these backups?

Have you tried backup with different schedules. Most of such issues occurs from client end scripts, incase someone plays with it or something like the server patched, etc.

Get System Administrator from the respective Client OS (I guess Linux in this case) and have him diff the scripts. This will help finding the differences in the DB scripts for different backup schedules. If there are differences then hit the ball in DBA court and have them fix the issue.

Let DBA create a new fresh script for this database and give it a try from your end.

backup-botw · ‎12-10-2015

Daily Incremental and a Full that happens the 3rd Wednesday of every month.

This is an issue with all backups.

Can you add more detail as to what scripts you would recommend checking to be precise?

Marianne · ‎12-10-2015

I see this: 12/10/2015 04:05:32 - connecting 12/10/2015 04:05:34 - Info bpbrm (pid=12610) listening for client connection Never Connected. Although bpcd seems fine with connections from this media server. I see a couple of <16>'s in bpcd log but it seems to be unrelated. I was once battling with status 54's on Oracle client but every single connection test was successful. Someone suggested reinstallation of NBU Client and Oracle agent (it was separate in those days ). Well, it worked. No explanation of why. It just did. So, you may as well schedule client upgrade as this will remove old binaries and install new ones. Another thought - when you see this Connecting' in job details without 'Connected' - please check 'netstat -a' output on the client. There may simply so many tcp port connections that the client cannot handle any more. But then I would expect bpcd to reflect something to this effect....

Handy NetBackup Links

Michael_G_Ander · ‎12-10-2015

Increasing the CLIENT_READ_TIMEOUT and CLIENT_READ_TIMEOUT on the master, media servers and client can often help with timeout issues on Oracle backups.

Looking at the job details, the underlying problem seems to that the backup ois waiting an available tape drive, so moving other backups away from the time slot could also be an option.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne · ‎12-10-2015

Timeouts seem to be already very high:

Client read timeout = 14400

Handy NetBackup Links

Michael_G_Ander · ‎12-11-2015

Whoops, one of the timeouts should have been CLIENT_CONNECT_TIMEOUT, which seems to have more impact on database backups than file backups, have used it to avoid status 54 on Oracle backups on very active backup systems.

The timeout is very high, but on the other hand the backup seems to be waiting all night on a tape resource all night from 19:44:27 to 04:05:30 so it might not be enough.

In my experience Oracle backups needs to get the write media inside a reasonable time to avoid timeout.

To me this look like a resource/scheduling issue in the backup system

Besides that I would look at block change tracking in Oracle if it is not already used, as it minimize the time Oracle uses before sending data.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

backup-botw · ‎12-11-2015

Thank you much for the recommendation. I went ahead and updated the NB client on both servers in their cluster and I am going to see if that has made a difference at all.

I also spoke with the DBA and they are actually replacing the 2 old nodes with 3 new ones so those 3 have a fresh 7.7.1 client install on them.

backup-botw · ‎12-11-2015

Client read is 14400 on the master, media and clients. Client connect is 14400 on the master and media.

backup-botw · ‎12-11-2015

And today its a different error...


12/10/2015 23:00:01 - Info nbjm (pid=12665) starting backup job (jobid=4105333) for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental
12/10/2015 23:00:01 - Info nbjm (pid=12665) requesting MEDIA_SERVER_ONLY resources from RB for backup job (jobid=4105333, request id:{08999DA2-9FC4-11E5-9AEE-002128A6F97C})
12/10/2015 23:00:01 - requesting resource xp53tape001-hcart-robot-tld-1
12/10/2015 23:00:01 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_xp53db015b_pdst
12/10/2015 23:02:30 - granted resource  xp53tape001.NBU_POLICY.MAXJOBS.Omaha_xp53db015b_pdst
12/10/2015 23:02:30 - estimated 0 kbytes needed
12/10/2015 23:02:30 - Info nbjm (pid=12665) started backup (backupid=xp53db015b_1449810150) job for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental on storage unit 
12/10/2015 23:02:32 - started process bpbrm (pid=3510)
12/10/2015 23:02:33 - Info bpbrm (pid=3510) xp53db015b is the host to backup data from
12/10/2015 23:02:33 - Info bpbrm (pid=3510) reading file list for client
12/10/2015 23:02:33 - connecting
12/10/2015 23:02:35 - Info bpbrm (pid=3510) starting bphdb on client
12/10/2015 23:02:35 - Info bphdb (pid=24368) Backup started
12/10/2015 23:02:35 - Info bphdb (pid=24368) Waiting for the child status
12/10/2015 23:02:35 - connected; connect time: 0:00:00
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - ************************************************************
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - Backup Information:
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - 	Client: xp53db015b
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - 	ORACLE_SID: pdst2
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - 	ORACLE_HOME: /u1/app/oracle/product/11.2.0/db_1
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - 	Online (Hot) Stream backup requested
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - 	Objects being backed up:
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - 		Whole Database
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - 		Archive Logs
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - 		Control File
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - 	Number of channels to allocate (database): 8
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - 	Number of channels to allocate (archive logs): 4
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - 	Oracle Authentication is being used
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - 	A Recovery Catalog is being used
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - 	RMAN Default Limits will be used (Database)
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - 	RMAN Default Limits will be used (Archive Logs)
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - 	Backup Set Identifier: pdst_full_date
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - 	Backup File Name Formats:
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - 		Database: bk_d%d_u%u_s%s_p%p_t%t
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - 		Archive Log: arch_d%d_u%u_s%s_p%p_t%t
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - 		Control File: ctrl_d%d_u%u_s%s_p%p_t%t
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - End of Backup Information
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - ************************************************************
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - Database status is: SHUTDOWN
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - Database open mode is: NOT MOUNTED
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - ARCHIVE LOG mode is: NOARCHIVELOG
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - CDB mode: NOT ENABLED
12/10/2015 23:03:55 - Error bpbrm (pid=3510) from client xp53db015b: ERR - Script exited with status = -5407 <Database is in the wrong state (must be OPEN) for the requested action>
12/10/2015 23:04:02 - Error bpbrm (pid=3510) from client xp53db015b: ERR - bphdb exit status = 5407: Database is in the wrong state (must be OPEN) for the requested action
12/10/2015 23:04:03 - Info bphdb (pid=24368) done. status: 5407: Database is in the wrong state (must be OPEN) for the requested action
12/10/2015 23:04:03 - end writing
Database is in the wrong state (must be OPEN) for the requested action  (5407)

areznik · ‎12-11-2015

12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - Database status is: SHUTDOWN
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - Database open mode is: NOT MOUNTED

Now that is an error for your DBA to fix, DB has to be up to take a hot backup.

I'm guessing you updated the version of the client? They greatly improved error reporting for the different policy types in the recent versions of NB, so this may be the same condition you were getting before except the new version of netbackup is better able to recognize whats going on.

Marianne · ‎12-11-2015

To get back to the original error (status 54) - is this happening when backup is waiting for a tape for extended periods? I think Michael is right - the processes on the client where the backup was initiated many hours ago may have timed out. Long timeouts on media servers won't help here.

Handy NetBackup Links

Tape_Archived · ‎12-11-2015

A thought - Please make sure you have all the required log directories created on client under /usr/openv/netbackup/logs

punbs · ‎12-12-2015

Please ensure that database is in Open State & Mounted. To perfrom the Log Backup ensure that Log style is kept in the ARCHIVELOG if its in NOARCHIVELOG archive log backup will be not possible. DB Admin can log in to instace & give the details. Please cross-check once.

Nicolai · ‎12-14-2015

If there is a firewall between master/media servers, ensure TCP_KEEPALIVE is configured to 15 minutes on master & media servers.

Basically, if there firewall will close close the connection between master/media and client after 30 minutes of no activity - when the client then try to send data it will discover the connection is closed and produce a status 54.

See this technote under "resolution"

https://www.veritas.com/support/en_US/article.TECH125896