cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle Database backup - system error occurred (130). Stumped!

Toddman214
Level 6

Master and media servers Windows 2008r2 standard

Netbackup 7.5.0.4

 

 

Hello all.

I have run into an issue that randomly started, and of course, no one know of any changes made, including myself. I've dug into this as far as I can, including considerable phone time with Symantec support, our Oracle DBA's and our Unix team (as the server running this database is a Linux media server, to which I have no access.)

 

One of our Oracle databases suddently stopped backing up, showing the following error. These two lines are the entire job detail log. It doesnt even get far enough to generate a full set of logs on my end.

 

5/1/2014 9:04:24 PM - Error bprd(pid=3416) Unable to write progress log </usr/openv/netbackup/logs/user_ops/dbext/logs/49349.0.1398996262> on client pdc00orao533l. Policy=APP_ORA11_OHLPROD3 Sched=Daily_Full 
5/1/2014 9:04:24 PM - Error bprd(pid=3416) CLIENT pdc00orao533l  POLICY APP_ORA11_OHLPROD3  SCHED Daily_Full  EXIT STATUS 130 (system error occurred)

 

The file system on the server backs up fine, but different processes are used there. I kept pushing toward a permission issue, but our Unix enginneer is looking and assures me that the Oracle owns the files, and CAN write the data. His response is......

"""""[jfarrar@pdc00orao533l ~]$ ls -l /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278

-rw-r--r-- 1 oracle dba 42 May  1 11:54 /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278

[jfarrar@pdc00orao533l ~]$ cat /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278

Backup started Thu May  1 11:54:38 2014

[jfarrar@pdc00orao533l ~]$

The log is clearly inaccurate because the Oracle user owns and wrote to the file.  The Oracle application cannot make a setuid assertion fail.  Rebooting a host because they can't think of anything else to try is not an option.  This is a Production system."""""""

 

What's causing the disagreements are statements like below in the log files, but I am being told this is not accurate.................

bpcd (on client):

12:58:22.914 [76635] <16> process_requests: can't become user oracle and group dba

12:58:22.914 [76635] <16> process_requests: setuid failed for user oracle

 

bprd:

11:54:40.001 [5372.4076] <2> append_to_client_log: can't become user oracle on pdc00orao533l.ohlogistics.com

11:54:40.001 [5372.4076] <8> bkarfiles: failure writing progress log on client pdc00orao533l in log /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278:  system error occurred (130)

 

I looked into the possibility that inetd does not have the needed rights, and so neither would the bpcd, but bpcd is running standalone under root and has full root access.

I'm drawing a total blank at this point. The urgency is that we've been digging at this for a couple days now, and the archive log directory is filling. If it fills, it will shut down the database, and we'll have rf guns and warehouse management systems go offline nationwide. This is why they refuse to restart the client or Oracle, which I have to agree with. Of course, the database stopping is going to have the exact same effect.

 

Please let me know what added details I can provide.

 

Thank you.

 

Todd

1 ACCEPTED SOLUTION

Accepted Solutions

Toddman214
Level 6

Update - The suspected issue in my previous post turned out to be accurate. It was an issue on the linux client where hugepages was using up enough available ram that the db backup processes would instantly die. They took a planned outage late on Saturday night to adjust the needed parameters, and the backups are running well again.

View solution in original post

8 REPLIES 8

Nicolai
Moderator
Moderator
Partner    VIP   

Some tech note with the messages you have found also:

http://www.symantec.com/docs/TECH77533 (inted issue)

Does bp.conf on the Oracle client contain VERBOSE = 5 ?

Usually <16> indicate a error where Netbackup can't continue. 

Update:

One more - titled "Using VERITAS NetBackup for Oracle the RMAN backup fails with status 130"

http://www.symantec.com/docs/TECH20785

RMAN backup failure - Status 130 (system error occurred) - can't become user oracle on 'client name'

http://www.symantec.com/docs/TECH145990

 

carlos_jimenez
Level 3
Employee

I've troubleshooted this issue before.

A Status 130 is always related to a system error on the operating system where the function call is made.  Notice the error is not coming from dbclient per se, but bprd that runs on the master:

bprd:

11:54:40.001 [5372.4076] <2> append_to_client_log: can't become user oracle on pdc00orao533l.ohlogistics.com

11:54:40.001 [5372.4076] <8> bkarfiles: failure writing progress log on client pdc00orao533l in log /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278:  system error occurred (130)

 

So in this case the master server is reporting an error that it can't perform a function, but in this case the funciton is on the client.  Then notice the setuid failure.  

 

bpcd (on client):

12:58:22.914 [76635] <16> process_requests: can't become user oracle and group dba

12:58:22.914 [76635] <16> process_requests: setuid failed for user oracle

 

The master server connects to the client via NBJM and at some points bpbrm to write to the progress log in user_ops as well as the client.  The fact that all you have in user_ops is "backup started" means the oracle user can write to the log, the next entry is actually written by the master server via NBJM.    The issue is that when the master server connects to the client via a remote connection to update the progress log it must switch user (setuid OS call) to stay as the Oracle user and can't, so that the progress log can't be updated thereby causing your backup failure.  

I've run into this issue about a dozen times.   

I've seen it involve ulimit for max user processes being hit when it's set too low i.e., 1024 or less.

You also need to make sure your RMAN script has an entry for switching to the oracle user via su, an example of this can be seen in 

 http://www.symantec.com/docs/TECH20785

The majority of the time it has to do with upgrading the client without bringing down the Oracle server processes which causes what's in memory to conflict with the new setuid calls.  This can be problematic as it would require a downtime for the oracle instance.  Before you do anything like that make sure you open a case with Netbackup Technical Support so we can verify as you would need a maintenance window.

Toddman214
Level 6

Thank you Nicolai and Carlos. I'll check out those ideas. I did finally get a full bpcd log from the linux client. The issues start toward the bottom at 11:03.

carlos_jimenez
Level 3
Employee

I looked at the bpcd log you posted but you've already provided what we need.  The issue is that the master can't switch user via a setuid OS call on the client.  The remote connection is from BPRD > VNETD - PBX > BPCD.  BPCD runs a call from BPRD 

11:06:01.975 [43506] <2> process_requests: BPCD_BECOME_USER_RQST

 

That call translates to SETUID otherwise known as "set user ID upon execution".  The SETUID call fails and the backup fails as NBJM can't update the progress log via bprd > bpcd.  

The issue will be outside of the Netbackup application at this point.  This would be the same troubleshooting done for any instance of the SETUID call failing.  

Toddman214
Level 6

Ok, interesting finding from my Unix engineer. This MAY turn into a resource allocation issue on the client side and could be outside of my scope as the backup dude. It appears that there is a memory failure happening during process fork allocations. According to Unix team ---> The non-working client has nearly double the AnonHugePages count of a working host with the same amount of RAM.  Non-working host is also actively using HugePages for Oracle (which allocates even more RAM that NetBackup cannot use.)  An outage is being scheduled if the business unit will allow it. They really have no choice!

carlos_jimenez
Level 3
Employee

Thanks for the update.  Let us know if that is the root cause please.

Dip
Level 4

Please stop and start NetBackup Client services on Client (I believe the client is Oracle server pdc00orao533l ) using following commands. Then have DBA try a backup.

Run this comand under root:    /usr/openv/netbackup/bin/bp.kill_all 

Then run /usr/openv/netbackup/bin/bp.start_all

 

 

Toddman214
Level 6

Update - The suspected issue in my previous post turned out to be accurate. It was an issue on the linux client where hugepages was using up enough available ram that the db backup processes would instantly die. They took a planned outage late on Saturday night to adjust the needed parameters, and the backups are running well again.