11-22-2010 07:11 AM
Hi All
I have a strange situation. The RMAN backup job ends with this error:
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 11/21/2010 02:50:37
ORA-00604: error occurred at recursive SQL level 1
ORA-06502: PL/SQL: numeric or value error
ORA-19506: failed to create sequential file, name="<name>", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: client process aborted
but the NetBackup job has status of "0". This happens during the archive log backup in most cases.
The master, media and clients are on RHEL 5.2. The NetBackup is at 6.5.6. CLIENT_READ_TIMEOUT on a client is set to 5400. The backup job is lunched from a client via cron. The backup data are written to a BasicDisk.
In a dbclient log I can see this:
02:32:06.773 [29038] <2> int_CloseImage: INF - Backup - closing <name> 02:47:12.014 [29038] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/29038.0.1286324889
It is like the image cannot be closed and the timeout appears, but why after 900s whet CLIENT_READ_TIMEOUT is 5400?
Do you have any idea what is wrong?
Solved! Go to Solution.
03-08-2011 04:13 AM
The ware a NBRB performace problems, the backup jobs initiated from the client ware not proccessed durig CLIENT_READ_TIMEOUT interval.
The solution was:
1. Increase the CLIENT_READ_TIMEOUT to 5400 to give more time
2. Do some TCP/IP tuning on master server (RHEL 5.2)
After that the problems ware gone.
11-22-2010 11:37 AM
CLIENT_READ_TIMEOUT on the media server is used to determine the timeout.
The timeout error could be a 'red herring'. You will need to look at the entire dbclient log as well as the corresponding period in bprd log on the master server.
11-24-2010 02:00 PM
The backup job was created - I found it. I thought that the backup was not even initiated.
But in nbpem it looks like this
11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm
11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client> POLICY <policy> SCHED <sch_name> EXIT STATUS 25 (cannot connect on socket)
(..)
11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket)
02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1557: vnet_read_service_file failed: 19 0x00000013 02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1548: service_name: inetd_bpcd
02:44:52.595 [5658] <4> CreateNewImage: INF - backing up File:</DGGFDF_full_12358_1_7762349491> 02:44:53.491 [5658] <2> vnet_async_connect: vnet_vnetd.c.4228: connect: async CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724 fd = 24 02:44:53.491 [5658] <2> logconnections: BPRD CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724 02:44:53.491 [5658] <4> serverResponse: initial client_read_timeout = <5400> 02:44:53.491 [5658] <4> readCommMessages: Entering readCommMessages 02:45:03.491 [5658] <4> serverResponse: read comm file:<02:44:53 Initiating backup> 02:45:03.492 [5658] <4> readCommMessages: Entering readCommMessages 02:50:28.499 [5658] <4> serverResponse: read comm file:<02:50:19 INF - Server status = 50> 02:50:28.499 [5658] <16> serverResponse: ERR - server exited with status 50: client process aborted 02:50:28.499 [5658] <16> CreateNewImage: ERR - serverResponse() failed
11-24-2010 08:38 PM
CreateNewImage: ERR - serverResponse() failed
You need bprd on master to troubleshoot this. Look for incoming request from client's IP.
Check if master is able to resolve IP correctly to hostname that corresponds with name in Policy.
11-30-2010 01:05 AM
Hi
Thank you for your sugestions.
I have looked into bprd log on the master and the request was successfully proccessed.
I fond a problem in nbpem, look at this:
11/21/2010 02:44:56.876 [Application] NB 51216 nbpem 116 PID:16059 TID:4096248720 File ID:116 [jobid=-1 job_group_id=-1 client=<client> type=2 server= task=ID:0xf36d9e6c policy=<policy>] [Info] V-116-82 received start user backup request for client=<client>, policy=<policy>, schedule=<schedule>, stream #=0, client type=Oracle (4), parent jobid=-1 11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm 11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client> POLICY <policy> SCHED <schedule> EXIT STATUS 25 (cannot connect on socket) 11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket) 11/21/2010 02:50:26.756 [Application] NB 51216 nbpem 116 PID:16059 TID:4108856208 File ID:116 [No context] [Warning] V-116-226 received completion status 50 for jobid=2242935 but know nothing of the job, will finalize the job
Do you have any idea what does it mean? The server is overloaded and cannot handle connections between nbpem and nbjm? The server statistics seems to be fine.
03-03-2011 01:01 PM
Konrad,
Did you ever this resolved. I have atleast 10 clients which started failing with similar error. My timeout settings on the media server are
03-03-2011 11:03 PM
I have atleast 10 clients whi --- is this database level backup or file level backup ?
make sure etc-host entry in master / client and database backup scripts.
192.168.100.15 myserver01 myserver01.mydomain.com
03-04-2011 11:10 PM
If server is running Linux - check to see it IPtables is not running.
If running make sure the data owner between source and backup target is set to the owner of the data and rman script (usaully oracle). If its set to root then the backups will fail becasue root does not own the RMAN backup and transmission of backup data.
03-05-2011 10:58 AM
First 3 things I check when a system level backup is good but the oracle fails:
1. name resolution for master/media/client forward AND reverse between all. I also place entries for each in each of the local hosts file.
2. the very first line of the bp.conf file must contain
SERVER = masterservername
3. timeouts as described above.
I make sure those 3 things are done befor retesting oracle backups.
03-08-2011 04:13 AM
The ware a NBRB performace problems, the backup jobs initiated from the client ware not proccessed durig CLIENT_READ_TIMEOUT interval.
The solution was:
1. Increase the CLIENT_READ_TIMEOUT to 5400 to give more time
2. Do some TCP/IP tuning on master server (RHEL 5.2)
After that the problems ware gone.