β12-10-2015 07:18 AM
We are seeing some issues with DB backups on a large Oracle database. It was working without issue before, but for some reason is now having issues. Nothing has changed with the policy and the failures started prior to our 7.7.1 upgrade earlier this month. The DBA's seem to be having trouble figuring out what is going on and Oracle keeps pointing them back to the backup software vendor so I figured I would reach out here to try and track something down.
Master Server: Solaris 10 NB version 7.7.1
Media Server: Same as above
Client: Linux xp53db015b 3.8.13-68.3.2.el6uek.x86_64 #2 SMP Tue Jun 9 17:07:32 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux
Client NB Version: 7.6.0.2 (Debating on upgrading the client version today to 7.7.1
Detailed Job Status:
1: (54) timed out connecting to client
12/09/2015 19:38:29 - Info nbjm (pid=12665) starting backup job (jobid=4098157) for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental 12/09/2015 19:38:29 - Info nbjm (pid=12665) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=4098157, request id:{B70CDF3E-9EDE-11E5-B8D5-002128A6F97C}) 12/09/2015 19:38:29 - requesting resource xp53tape001-hcart-robot-tld-1 12/09/2015 19:38:29 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53db015b 12/09/2015 19:40:29 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001 12/09/2015 19:42:29 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001 12/09/2015 19:44:27 - awaiting resource xp53tape001-hcart-robot-tld-1. Waiting for resources. Reason: Drives are in use, Media server: xp53tape001, Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A, Volume Pool: Daily, Storage Unit: xp53tape001-hcart-robot-tld-1, Drive Scan Host: N/A, Disk Pool: N/A, Disk Volume: N/A 12/09/2015 20:50:29 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001 12/09/2015 20:50:35 - awaiting resource xp53tape001-hcart-robot-tld-1. Maximum job count has been reached for the storage unit. 12/09/2015 20:50:41 - awaiting resource xp53tape001-hcart-robot-tld-1. Waiting for resources. Reason: Drives are in use, Media server: xp53tape001, Robot Type(Number): TLD(1), Media ID: N/A, Drive Name: N/A, Volume Pool: Daily, Storage Unit: xp53tape001-hcart-robot-tld-1, Drive Scan Host: N/A, Disk Pool: N/A, Disk Volume: N/A 12/10/2015 04:04:30 - Info nbjm (pid=12665) Waiting in NetBackup scheduler work queue on server xp53tape001 12/10/2015 04:05:30 - granted resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53db015b 12/10/2015 04:05:30 - granted resource OM2136 12/10/2015 04:05:30 - granted resource HP.ULTRIUM4-SCSI.004 12/10/2015 04:05:30 - granted resource xp53tape001-hcart-robot-tld-1 12/10/2015 04:05:31 - estimated 68183692 kbytes needed 12/10/2015 04:05:31 - Info nbjm (pid=12665) started backup (backupid=xp53db015b_1449741930) job for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental on storage unit xp53tape001-hcart-robot-tld-1 12/10/2015 04:05:32 - Info bpbrm (pid=12610) xp53db015b is the host to backup data from 12/10/2015 04:05:32 - Info bpbrm (pid=12610) reading file list for client 12/10/2015 04:05:32 - started process bpbrm (pid=12610) 12/10/2015 04:05:32 - connecting 12/10/2015 04:05:34 - Info bpbrm (pid=12610) listening for client connection 12/10/2015 04:05:37 - Info bpbrm (pid=12610) INF - Client read timeout = 14400 12/10/2015 08:05:37 - Error bpbrm (pid=12610) listen for client timeout during accept from data listen socket after 60 seconds 12/10/2015 08:05:37 - Info dbclient (pid=0) done. status: 54: timed out connecting to client 12/10/2015 08:05:37 - end writing timed out connecting to client (54)
I have attached logs from the client. If anything else would be needed please let me know.
β12-10-2015 08:41 AM
Hello,
I see that it is an incremetal backup schedule. Maybe Oracle has problems to identify incremental changes in the database - consult your DBA.
Full database backups work? And what about filesystem backups? Controlfile backups?
Please attach job details as attachments, or at least omit unimportant messages. There are thousands of "Waiting in NetBackup scheduler work queue " messages printed here which have no relevance for the issue.
Regards
Michal
β12-10-2015 08:51 AM
The logs you attached cut off at 12/10/15 06:52; the job details you posted have the final error at 08:05... so is the log incomplete or is there just nothing posted in the log at that time? (specifically im looking at dbclient log)
β12-10-2015 09:10 AM
My DBA is pointing the finger at the backup software. They have no idea whats going on.
We upgraded to 7.7.1 on Dec 1st and now we are not able to run reports via the console and get any data priot to that date. I will check and see what the last full did in OpsCenter if its available there. Filesystem backups finish with no issue. I believe the controlfile backups are done via the Oracle Intelligent Policy as well...I would have to go through the jobs to find the one that did the control file, but in the few jobs that failed I can see its not the controlfile.
Ill do that. The new forum setup seems to post it differently. It used to put the data in a code snippet box.
β12-10-2015 09:12 AM
My guess would be that the log is incomplete. I pulled all the directories for the logs and then zipped them up. When I did that the job had already completed so one would think the logs would be complete as well.
It's worth noting that not all the jobs fail on this database. Typically I only see 2 out of the many that actually have an error.
β12-10-2015 09:28 AM
Looks like the full backup had an issue as well.
β12-10-2015 11:41 AM
How many types of backups do you have for this database. (Ex Weekly Full/Daily Incremental/Monthly Full/etc)?
Is this an issue with all these backups?
Have you tried backup with different schedules. Most of such issues occurs from client end scripts, incase someone plays with it or something like the server patched, etc.
Get System Administrator from the respective Client OS (I guess Linux in this case) and have him diff the scripts. This will help finding the differences in the DB scripts for different backup schedules. If there are differences then hit the ball in DBA court and have them fix the issue.
Let DBA create a new fresh script for this database and give it a try from your end.
β12-10-2015 12:32 PM
Daily Incremental and a Full that happens the 3rd Wednesday of every month.
This is an issue with all backups.
Can you add more detail as to what scripts you would recommend checking to be precise?
β12-10-2015 10:17 PM
β12-10-2015 11:07 PM
Increasing the CLIENT_READ_TIMEOUT and CLIENT_READ_TIMEOUT on the master, media servers and client can often help with timeout issues on Oracle backups.
Looking at the job details, the underlying problem seems to that the backup ois waiting an available tape drive, so moving other backups away from the time slot could also be an option.
β12-10-2015 11:16 PM
β12-11-2015 12:58 AM
Whoops, one of the timeouts should have been CLIENT_CONNECT_TIMEOUT, which seems to have more impact on database backups than file backups, have used it to avoid status 54 on Oracle backups on very active backup systems.
The timeout is very high, but on the other hand the backup seems to be waiting all night on a tape resource all night from 19:44:27 to 04:05:30 so it might not be enough.
In my experience Oracle backups needs to get the write media inside a reasonable time to avoid timeout.
To me this look like a resource/scheduling issue in the backup system
Besides that I would look at block change tracking in Oracle if it is not already used, as it minimize the time Oracle uses before sending data.
β12-11-2015 06:25 AM
Thank you much for the recommendation. I went ahead and updated the NB client on both servers in their cluster and I am going to see if that has made a difference at all.
I also spoke with the DBA and they are actually replacing the 2 old nodes with 3 new ones so those 3 have a fresh 7.7.1 client install on them.
β12-11-2015 06:41 AM
Client read is 14400 on the master, media and clients. Client connect is 14400 on the master and media.
β12-11-2015 07:24 AM
And today its a different error...
12/10/2015 23:00:01 - Info nbjm (pid=12665) starting backup job (jobid=4105333) for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental
12/10/2015 23:00:01 - Info nbjm (pid=12665) requesting MEDIA_SERVER_ONLY resources from RB for backup job (jobid=4105333, request id:{08999DA2-9FC4-11E5-9AEE-002128A6F97C})
12/10/2015 23:00:01 - requesting resource xp53tape001-hcart-robot-tld-1
12/10/2015 23:00:01 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_xp53db015b_pdst
12/10/2015 23:02:30 - granted resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_xp53db015b_pdst
12/10/2015 23:02:30 - estimated 0 kbytes needed
12/10/2015 23:02:30 - Info nbjm (pid=12665) started backup (backupid=xp53db015b_1449810150) job for client xp53db015b, policy Omaha_xp53db015b_pdst, schedule Incremental on storage unit
12/10/2015 23:02:32 - started process bpbrm (pid=3510)
12/10/2015 23:02:33 - Info bpbrm (pid=3510) xp53db015b is the host to backup data from
12/10/2015 23:02:33 - Info bpbrm (pid=3510) reading file list for client
12/10/2015 23:02:33 - connecting
12/10/2015 23:02:35 - Info bpbrm (pid=3510) starting bphdb on client
12/10/2015 23:02:35 - Info bphdb (pid=24368) Backup started
12/10/2015 23:02:35 - Info bphdb (pid=24368) Waiting for the child status
12/10/2015 23:02:35 - connected; connect time: 0:00:00
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - ************************************************************
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - Backup Information:
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - Client: xp53db015b
12/10/2015 23:02:36 - Info bphdb (pid=24368) INF - ORACLE_SID: pdst2
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - ORACLE_HOME: /u1/app/oracle/product/11.2.0/db_1
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - Online (Hot) Stream backup requested
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - Objects being backed up:
12/10/2015 23:02:44 - Info bphdb (pid=24368) INF - Whole Database
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - Archive Logs
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - Control File
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - Number of channels to allocate (database): 8
12/10/2015 23:02:56 - Info bphdb (pid=24368) INF - Number of channels to allocate (archive logs): 4
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - Oracle Authentication is being used
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - A Recovery Catalog is being used
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - RMAN Default Limits will be used (Database)
12/10/2015 23:03:17 - Info bphdb (pid=24368) INF - RMAN Default Limits will be used (Archive Logs)
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - Backup Set Identifier: pdst_full_date
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - Backup File Name Formats:
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - Database: bk_d%d_u%u_s%s_p%p_t%t
12/10/2015 23:03:39 - Info bphdb (pid=24368) INF - Archive Log: arch_d%d_u%u_s%s_p%p_t%t
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - Control File: ctrl_d%d_u%u_s%s_p%p_t%t
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - End of Backup Information
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - ************************************************************
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - Database status is: SHUTDOWN
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - Database open mode is: NOT MOUNTED
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - ARCHIVE LOG mode is: NOARCHIVELOG
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - CDB mode: NOT ENABLED
12/10/2015 23:03:55 - Error bpbrm (pid=3510) from client xp53db015b: ERR - Script exited with status = -5407 <Database is in the wrong state (must be OPEN) for the requested action>
12/10/2015 23:04:02 - Error bpbrm (pid=3510) from client xp53db015b: ERR - bphdb exit status = 5407: Database is in the wrong state (must be OPEN) for the requested action
12/10/2015 23:04:03 - Info bphdb (pid=24368) done. status: 5407: Database is in the wrong state (must be OPEN) for the requested action
12/10/2015 23:04:03 - end writing
Database is in the wrong state (must be OPEN) for the requested action (5407)
β12-11-2015 02:05 PM
12/10/2015 23:03:47 - Info bphdb (pid=24368) INF - Database status is: SHUTDOWN
12/10/2015 23:03:55 - Info bphdb (pid=24368) INF - Database open mode is: NOT MOUNTED
Now that is an error for your DBA to fix, DB has to be up to take a hot backup.
I'm guessing you updated the version of the client? They greatly improved error reporting for the different policy types in the recent versions of NB, so this may be the same condition you were getting before except the new version of netbackup is better able to recognize whats going on.
β12-11-2015 07:45 PM
β12-11-2015 10:29 PM
A thought - Please make sure you have all the required log directories created on client under /usr/openv/netbackup/logs
β12-12-2015 10:37 AM
Please ensure that database is in Open State & Mounted. To perfrom the Log Backup ensure that Log style is kept in the ARCHIVELOG
if its in NOARCHIVELOG
archive log backup will be not possible. DB Admin can log in to instace & give the details. Please cross-check once.
β12-14-2015 01:01 AM
If there is a firewall between master/media servers, ensure TCP_KEEPALIVE is configured to 15 minutes on master & media servers.
Basically, if there firewall will close close the connection between master/media and client after 30 minutes of no activity - when the client then try to send data it will discover the connection is closed and produce a status 54.
See this technote under "resolution"
https://www.veritas.com/support/en_US/article.TECH125896