Solved: Resolved??

konrad · ‎11-22-2010

Hi All

I have a strange situation. The RMAN backup job ends with this error:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 11/21/2010 02:50:37
ORA-00604: error occurred at recursive SQL level 1
ORA-06502: PL/SQL: numeric or value error
ORA-19506: failed to create sequential file, name="<name>", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  client process aborted

but the NetBackup job has status of "0". This happens during the archive log backup in most cases.

The master, media and clients are on RHEL 5.2. The NetBackup is at 6.5.6. CLIENT_READ_TIMEOUT on a client is set to 5400. The backup job is lunched from a client via cron. The backup data are written to a BasicDisk.

In a dbclient log I can see this:

  02:32:06.773 [29038] <2> int_CloseImage: INF - Backup - closing <name>

02:47:12.014 [29038] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/29038.0.1286324889

It is like the image cannot be closed and the timeout appears, but why after 900s whet CLIENT_READ_TIMEOUT is 5400?

Do you have any idea what is wrong?

konrad · ‎03-08-2011

The ware a NBRB performace problems, the backup jobs initiated from the client ware not proccessed durig CLIENT_READ_TIMEOUT interval.

The solution was:

1. Increase the CLIENT_READ_TIMEOUT to 5400 to give more time

2. Do some TCP/IP tuning on master server (RHEL 5.2)

net.core.rmem_max = 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216

net.ipv4.tcp_no_metrics_save = 1

net.core.netdev_max_backlog = 30000

3. The errors seems to appear rarely. But finely we had to replace a Master Server with a new (powerfull) server box to give more performace

After that the problems ware gone.

View solution in original post

Marianne · ‎11-22-2010

CLIENT_READ_TIMEOUT on the media server is used to determine the timeout.

The timeout error could be a 'red herring'. You will need to look at the entire dbclient log as well as the corresponding period in bprd log on the master server.

Handy NetBackup Links

konrad · ‎11-24-2010

The backup job was created - I found it. I thought that the backup was not even initiated.

But in nbpem it looks like this

 11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm

11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client>  POLICY <policy>  SCHED <sch_name> EXIT STATUS 25 (cannot connect on socket)

(..)

11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket)

I can see some errors in vnetd on a client, but I am not sure they are relevant.

 02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1557: vnet_read_service_file failed: 19 0x00000013

02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1548: service_name: inetd_bpcd

In a dbclient log I can see the following:

 02:44:52.595 [5658] <4> CreateNewImage: INF - backing up File:</DGGFDF_full_12358_1_7762349491>

02:44:53.491 [5658] <2> vnet_async_connect: vnet_vnetd.c.4228: connect: async CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724 fd = 24

02:44:53.491 [5658] <2> logconnections: BPRD CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724

02:44:53.491 [5658] <4> serverResponse: initial client_read_timeout = <5400>

02:44:53.491 [5658] <4> readCommMessages: Entering readCommMessages

02:45:03.491 [5658] <4> serverResponse: read comm file:<02:44:53 Initiating backup>

02:45:03.492 [5658] <4> readCommMessages: Entering readCommMessages

02:50:28.499 [5658] <4> serverResponse: read comm file:<02:50:19 INF - Server status = 50>

02:50:28.499 [5658] <16> serverResponse: ERR - server exited with status 50: client process aborted

02:50:28.499 [5658] <16> CreateNewImage: ERR - serverResponse() failed

Any ideas why is it so?

Marianne · ‎11-24-2010

CreateNewImage: ERR - serverResponse() failed

You need bprd on master to troubleshoot this. Look for incoming request from client's IP.

Check if master is able to resolve IP correctly to hostname that corresponds with name in Policy.

Handy NetBackup Links

konrad · ‎11-30-2010

Hi

Thank you for your sugestions.

I have looked into bprd log on the master and the request was successfully proccessed.

I fond a problem in nbpem, look at this:

 11/21/2010 02:44:56.876 [Application] NB 51216 nbpem 116 PID:16059 TID:4096248720 File ID:116 [jobid=-1 job_group_id=-1 client=<client> type=2 server= task=ID:0xf36d9e6c policy=<policy>] [Info] V-116-82 received start user backup request for client=<client>, policy=<policy>, schedule=<schedule>, stream #=0, client type=Oracle (4), parent jobid=-1
11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm
11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client> POLICY <policy>  SCHED <schedule> EXIT STATUS 25 (cannot connect on socket)
11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket)
11/21/2010 02:50:26.756 [Application] NB 51216 nbpem 116 PID:16059 TID:4108856208 File ID:116 [No context] [Warning] V-116-226 received completion status 50 for jobid=2242935 but know nothing of the job, will finalize the job

Do you have any idea what does it mean? The server is overloaded and cannot handle connections between nbpem and nbjm? The server statistics seems to be fine.

mpatt · ‎03-03-2011

Konrad,

Did you ever this resolved. I have atleast 10 clients which started failing with similar error. My timeout settings on the media server are

CLIENT_READ_TIMEOUT = 4800

CLIENT_CONNECT_TIMEOUT = 4800

SERVER_CONNECT_TIMEOUT = 900

Would be great if u can post the solution.

Yogesh9881 · ‎03-03-2011

I have atleast 10 clients whi --- is this database level backup or file level backup ?

make sure etc-host entry in master / client and database backup scripts.

192.168.100.15 myserver01 myserver01.mydomain.com

Jim-90 · ‎03-04-2011

If server is running Linux - check to see it IPtables is not running.

If running make sure the data owner between source and backup target is set to the owner of the data and rman script (usaully oracle). If its set to root then the backups will fail becasue root does not own the RMAN backup and transmission of backup data.

Stumpr2 · ‎03-05-2011

First 3 things I check when a system level backup is good but the oracle fails:

1. name resolution for master/media/client forward AND reverse between all. I also place entries for each in each of the local hosts file.

2. the very first line of the bp.conf file must contain

SERVER = masterservername

3. timeouts as described above.

I make sure those 3 things are done befor retesting oracle backups.

konrad · ‎03-08-2011