Oracle Database backup - system error occurred (130). Stumped!
Master and media servers Windows 2008r2 standard
Netbackup 7.5.0.4
Hello all.
I have run into an issue that randomly started, and of course, no one know of any changes made, including myself. I've dug into this as far as I can, including considerable phone time with Symantec support, our Oracle DBA's and our Unix team (as the server running this database is a Linux media server, to which I have no access.)
One of our Oracle databases suddently stopped backing up, showing the following error. These two lines are the entire job detail log. It doesnt even get far enough to generate a full set of logs on my end.
5/1/2014 9:04:24 PM - Error bprd(pid=3416) Unable to write progress log </usr/openv/netbackup/logs/user_ops/dbext/logs/49349.0.1398996262> on client pdc00orao533l. Policy=APP_ORA11_OHLPROD3 Sched=Daily_Full
5/1/2014 9:04:24 PM - Error bprd(pid=3416) CLIENT pdc00orao533l POLICY APP_ORA11_OHLPROD3 SCHED Daily_Full EXIT STATUS 130 (system error occurred)
The file system on the server backs up fine, but different processes are used there. I kept pushing toward a permission issue, but our Unix enginneer is looking and assures me that the Oracle owns the files, and CAN write the data. His response is......
"""""[jfarrar@pdc00orao533l ~]$ ls -l /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278
-rw-r--r-- 1 oracle dba 42 May 1 11:54 /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278
[jfarrar@pdc00orao533l ~]$ cat /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278
Backup started Thu May 1 11:54:38 2014
[jfarrar@pdc00orao533l ~]$
The log is clearly inaccurate because the Oracle user owns and wrote to the file. The Oracle application cannot make a setuid assertion fail. Rebooting a host because they can't think of anything else to try is not an option. This is a Production system."""""""
What's causing the disagreements are statements like below in the log files, but I am being told this is not accurate.................
bpcd (on client):
12:58:22.914 [76635] <16> process_requests: can't become user oracle and group dba
12:58:22.914 [76635] <16> process_requests: setuid failed for user oracle
bprd:
11:54:40.001 [5372.4076] <2> append_to_client_log: can't become user oracle on pdc00orao533l.ohlogistics.com
11:54:40.001 [5372.4076] <8> bkarfiles: failure writing progress log on client pdc00orao533l in log /usr/openv/netbackup/logs/user_ops/dbext/logs/31013.0.1398963278: system error occurred (130)
I looked into the possibility that inetd does not have the needed rights, and so neither would the bpcd, but bpcd is running standalone under root and has full root access.
I'm drawing a total blank at this point. The urgency is that we've been digging at this for a couple days now, and the archive log directory is filling. If it fills, it will shut down the database, and we'll have rf guns and warehouse management systems go offline nationwide. This is why they refuse to restart the client or Oracle, which I have to agree with. Of course, the database stopping is going to have the exact same effect.
Please let me know what added details I can provide.
Thank you.
Todd
Update - The suspected issue in my previous post turned out to be accurate. It was an issue on the linux client where hugepages was using up enough available ram that the db backup processes would instantly die. They took a planned outage late on Saturday night to adjust the needed parameters, and the backups are running well again.