cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle (RMAN) job fails, but NetBackup job is ok

konrad
Level 4
Partner Accredited

Hi All

 

I have a strange situation. The RMAN backup job ends with this error:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 11/21/2010 02:50:37
ORA-00604: error occurred at recursive SQL level 1
ORA-06502: PL/SQL: numeric or value error
ORA-19506: failed to create sequential file, name="<name>", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  client process aborted 
 

 

but the NetBackup job has status of "0". This happens during the archive log backup in most cases.

The master, media and clients are on RHEL 5.2. The NetBackup is at 6.5.6. CLIENT_READ_TIMEOUT on a client is set to 5400. The backup job is lunched from a client via cron. The backup data are written to a BasicDisk.

In a dbclient log I can see this:

 

  02:32:06.773 [29038] <2> int_CloseImage: INF - Backup - closing <name>

02:47:12.014 [29038] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/29038.0.1286324889 

It is like the image cannot be closed and the timeout appears, but why after 900s whet CLIENT_READ_TIMEOUT is 5400?

Do you have any idea what is wrong? 

1 ACCEPTED SOLUTION

Accepted Solutions

konrad
Level 4
Partner Accredited

The ware a NBRB performace problems, the backup jobs initiated from the client ware not proccessed durig CLIENT_READ_TIMEOUT interval.

The solution was:

1. Increase the CLIENT_READ_TIMEOUT to 5400 to give more time

2. Do some TCP/IP tuning on master server (RHEL 5.2)

net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216 
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 30000
 
3.  The errors seems to appear rarely. But finely we had to replace a Master Server with a new (powerfull) server box to give more performace

After that the problems ware gone.

View solution in original post

9 REPLIES 9

Marianne
Level 6
Partner    VIP    Accredited Certified

CLIENT_READ_TIMEOUT on the media server is used to determine the timeout.

The timeout error could be a 'red herring'. You will need to look at the entire dbclient log as well as the corresponding period in bprd log on the master server.

konrad
Level 4
Partner Accredited

The backup job was created - I found it. I thought that the backup was not even initiated.

But in nbpem it looks like this

 11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm

 

11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client>  POLICY <policy>  SCHED <sch_name> EXIT STATUS 25 (cannot connect on socket)

(..)

11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket) 
 
I can see some errors in vnetd on a client, but I am not sure they are relevant.
 
 02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1557: vnet_read_service_file failed: 19 0x00000013

02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1548: service_name: inetd_bpcd 
 
 
In a dbclient log I can see the following:
 
 02:44:52.595 [5658] <4> CreateNewImage: INF - backing up File:</DGGFDF_full_12358_1_7762349491>

02:44:53.491 [5658] <2> vnet_async_connect: vnet_vnetd.c.4228: connect: async CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724 fd = 24

02:44:53.491 [5658] <2> logconnections: BPRD CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724

02:44:53.491 [5658] <4> serverResponse: initial client_read_timeout = <5400>

02:44:53.491 [5658] <4> readCommMessages: Entering readCommMessages

02:45:03.491 [5658] <4> serverResponse: read comm file:<02:44:53 Initiating backup>

02:45:03.492 [5658] <4> readCommMessages: Entering readCommMessages

02:50:28.499 [5658] <4> serverResponse: read comm file:<02:50:19 INF - Server status = 50>

02:50:28.499 [5658] <16> serverResponse: ERR - server exited with status 50: client process aborted

02:50:28.499 [5658] <16> CreateNewImage: ERR - serverResponse() failed 
 
 
Any ideas why is it so?

Marianne
Level 6
Partner    VIP    Accredited Certified
CreateNewImage: ERR - serverResponse() failed 

You need bprd on master to troubleshoot this. Look for incoming request from client's IP.

Check if master is able to resolve IP correctly to hostname that corresponds with name in Policy.

konrad
Level 4
Partner Accredited

Hi

 

Thank you for your sugestions.

I have looked into bprd log on the master and the request was successfully proccessed.

I fond a problem in nbpem, look at this:

 

 11/21/2010 02:44:56.876 [Application] NB 51216 nbpem 116 PID:16059 TID:4096248720 File ID:116 [jobid=-1 job_group_id=-1 client=<client> type=2 server= task=ID:0xf36d9e6c policy=<policy>] [Info] V-116-82 received start user backup request for client=<client>, policy=<policy>, schedule=<schedule>, stream #=0, client type=Oracle (4), parent jobid=-1
11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm
11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client> POLICY <policy>  SCHED <schedule> EXIT STATUS 25 (cannot connect on socket)
11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket)
11/21/2010 02:50:26.756 [Application] NB 51216 nbpem 116 PID:16059 TID:4108856208 File ID:116 [No context] [Warning] V-116-226 received completion status 50 for jobid=2242935 but know nothing of the job, will finalize the job 

 

Do you have any idea what does it mean? The server is overloaded and cannot handle connections between nbpem and nbjm? The server statistics seems to be fine.

mpatt
Level 4
Certified

Konrad,

Did you ever this resolved. I have atleast 10 clients which started failing with similar error. My timeout settings on the media server are 

 

CLIENT_READ_TIMEOUT = 4800
CLIENT_CONNECT_TIMEOUT = 4800
SERVER_CONNECT_TIMEOUT = 900
 
Would be great if u can post the solution.

Yogesh9881
Level 6
Accredited

I have atleast 10 clients whi --- is this database level backup or file level backup ?

make sure etc-host entry  in master / client and database backup scripts.

192.168.100.15   myserver01    myserver01.mydomain.com

Jim-90
Level 6

If server is running Linux - check to see it IPtables is not running.

If running make sure the data owner between source and backup target is set to the owner of the data and rman script (usaully oracle).  If its set to root then the backups will fail becasue root does not own the RMAN backup and transmission of backup data. 

Stumpr2
Level 6

First 3 things I check when a system level backup is good but the oracle fails:

1. name resolution for master/media/client forward AND reverse between all. I also place entries for each in each of the local hosts file.

2. the very first line of the bp.conf file must contain

SERVER = masterservername

3. timeouts as described above.

I make sure those 3 things are done befor retesting oracle backups.

konrad
Level 4
Partner Accredited

The ware a NBRB performace problems, the backup jobs initiated from the client ware not proccessed durig CLIENT_READ_TIMEOUT interval.

The solution was:

1. Increase the CLIENT_READ_TIMEOUT to 5400 to give more time

2. Do some TCP/IP tuning on master server (RHEL 5.2)

net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216 
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 30000
 
3.  The errors seems to appear rarely. But finely we had to replace a Master Server with a new (powerfull) server box to give more performace

After that the problems ware gone.