cancel
Showing results for 
Search instead for 
Did you mean: 

RMAN restore ends with error

dtrostli
Level 4
Partner Accredited

Hello Everyone,

Has anyone come across an error while trying to restore using RMAN. The following detail is showing in the job log for the restore. The main issue is that the restore worked 2 out 12 tries. Any help is very much appreciated.

 

1/14/2014 1:59:57 PM - begin Restore

1/14/2014 1:59:59 PM - 1 images required

1/14/2014 1:59:59 PM - media MGU013 required

1/14/2014 2:00:03 PM - restoring image s5a9oc01_1389428357

1/14/2014 2:00:05 PM - Info bpbrm(pid=4816) s5a9octst01 is the host to restore to      

1/14/2014 2:00:05 PM - Info bpbrm(pid=4816) telling media manager to start restore on client    

1/14/2014 2:00:07 PM - Warning bptm(pid=312) failure logging message to client s5a9octst01 in log /usr/openv/netbackup/logs/user_ops/dbext/logs/17007.0.1389715202:  system error occurred (130)

1/14/2014 2:00:08 PM - requesting resource MGU013

1/14/2014 2:00:08 PM - granted resource MGU013

1/14/2014 2:00:08 PM - granted resource HP.ULTRIUM5-SCSI.000

1/14/2014 2:00:08 PM - mounted

1/14/2014 2:00:08 PM - Info bpbrm(pid=2740) s5a9octst01 is the host to restore to     

1/14/2014 2:00:08 PM - Info bptm(pid=312) MGU013           

1/14/2014 2:00:08 PM - Info bpbrm(pid=2740) listening for client connection        

1/14/2014 2:00:08 PM - Info bptm(pid=312) INF - Waiting for positioning of media id MGU013 on server s5a9nbu01 for reading.

1/14/2014 2:00:08 PM - positioning MGU013 to file 84

1/14/2014 2:00:09 PM - Error bpbrm(pid=2740) cannot become user oracle and group oinstall on s5a9octst01   

1/14/2014 2:00:09 PM - Error bpbrm(pid=2740) listen for client protocol error - couldn't write necessary information on /usr/openv/netbackup/logs/user_ops/dbext/logs/17007.0.1389715202

1/14/2014 2:00:10 PM - Info bpbrm(pid=4816) child done, status 25        

1/14/2014 2:00:10 PM - Info bpbrm(pid=4816) sending message to media manager: STOP RESTORE s5a9oc01_1389428357    

1/14/2014 2:00:41 PM - positioned MGU013; position time: 00:00:33

1/14/2014 2:00:41 PM - begin reading

1/14/2014 2:01:06 PM - Info bpbrm(pid=4816) media manager for backup id s5a9oc01_1389428357 exited with status 150: termination requested by administrator

 

Kind Regards


David

1 ACCEPTED SOLUTION

Accepted Solutions

dtrostli
Level 4
Partner Accredited

Hello All,

The following technote solved the error with RMAN restore on a Linux RedHat Server. I had to follow this technote in order to adjust TCP/IP stack so to fix the status 2850 with the restore.

http://www.symantec.com/business/support/index?page=content&id=TECH166414&profileURL=https%3A%2F%2Fs...

Thanks everyone for all of the help on this

David

View solution in original post

19 REPLIES 19

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Error bpbrm(pid=2740) cannot become user oracle and group oinstall on s5a9octst01

 

Check this note. http://www.symantec.com/docs/TECH20582

dtrostli
Level 4
Partner Accredited

Hello Riaan,

 

Customer calls the restore from the client. It does not have any script with su in it. It is a simple restore using rman commands. The client is running RedHat 6.3.


Regards


David

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Hi,

 

And are the executing with user oracle and group oinstall?

dtrostli
Level 4
Partner Accredited

Yes, They are. I have just seen too many CLOSE_WAIT and TIME_WAIT connections. Would that affects NetBackup communication? I know Symantec has some recommended tuning for it but I can't find the information. Do you by any chance have anything on TCP tuning?

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

The failure occurs to quickly too be network related (IMHO). Has this worked before? Maybe post the script /commans they're using so it can be checked.

EDIT, sorry saw you said it works 2/12. Its a bit weird then.

 

Does seem somehow related to comms, or maybe the comms processes. Check this note (older and for backup but same symptoms)

 

http://www.symantec.com/business/support/index?page=content&id=TECH77533

dtrostli
Level 4
Partner Accredited

Hi,

 

This note is for older versions. This does not apply to 7.x. In 7.x bpcd and vnetd are standalone processes. I have seen in the past issues with too many CLOSE_WAIT and TIME_WAIT connections. But that was too long ago. Symantec has sent me some tunig for RedHat to tune the TCP layer. Hope that works otherwise customer will be in my throat. 

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Good luck

dtrostli
Level 4
Partner Accredited

Hi,

If you find anything else related to my problem, please let me know.

Thanks

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Firstly check that user oracle and group oinstall on s5a9octst01 have the same UID and GID as the source system.

Next, ensure all of these log folders exist for further troubleshooting:

On destination client: dbclient (with 777 permissions) and bpcd

On master: bprd (if folder does not exist, create it and restart NBU to enable this log)

On media server: bpbrm and bptm

Copy latest logs to reflect process name (e.g. dbclient.txt) and upload as File attachment.

 

One more thought:

How many NICs/IPs on destination client?
If more than one, is each IP address bound to a different hostname in its own hosts file?

 

dtrostli
Level 4
Partner Accredited

Hi Marianne,

I have asked customer to check the oracle ID and group ID is the same on all of the servers. I have also create the logs with the correct permission. In the meantime I don't see why having a second nic would impact restore. All the communication flows through the production network for the test restore despite the server having a backup network as well. The weird thing is that restore worked twice out of 10 tries and I see in the dbclient a cannot connect on socket error. I am suspecting more of a socket error than anything else.

David

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

 I don't see why having a second nic would impact restore.

If different NICs/IPs are not bound to separate hostnames (i.e all IPs using same hostname) it may cause the client to attempt comms on the wrong NIC. I have seen this many times.

dbclient log is a good start to see source and destination IP addresses when communicating with master and media server.

bprd on the master server will log client's IP address in connection attempt.
 

dtrostli
Level 4
Partner Accredited

I see. Here is a snippet from the bprd. The line in bold is the communication between the master and the client through the backup network.

 

16:33:24.111 [9644.3236] <2> Orb::init: caching EndpointSelectorFactory(Orb.cpp:840)
16:33:24.111 [9644.3236] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1481)
16:33:24.111 [9644.3236] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1490)
16:33:24.173 [9644.3236] <2> get_media_server_version: NBU version of s5a9nbu01 is 750500
16:33:24.173 [9644.3236] <2> start_mpx_group: media server s5a9nbu01 version from db = 0x000b73a4 
16:33:24.173 [9644.3236] <2> start_mpx_group: /usr/openv/netbackup/bin/bpbrm bpbrm -restore -mpx_restore -S s5a9nbu02 -mt 2 -to 0 -mud 180 -mediasvr s5a9nbu01 -masterversion 750000
16:33:24.189 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded
16:33:24.189 [9644.3236] <2> logconnections: BPCD CONNECT FROM 10.21.33.21.55595 TO 10.21.33.20.1556 fd = 3752
16:33:24.204 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded
16:33:24.220 [9644.3236] <8> do_pbx_service: [vnet_connect.c:2108] via PBX VNETD CONNECT FROM 10.21.33.21.55596 TO 10.21.33.20.1556 fd = 3756
16:33:24.220 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:443] VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa
16:33:24.423 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:460] ipc_string 60386
16:33:25.062 [9644.3236] <2> start_mpx_group: bpbrm stdout from bpcr = 3752
16:33:25.062 [9644.3236] <2> start_mpx_group: bpbrm stderr from bpcr = 3756
16:33:25.717 [9644.3236] <2> read_bpbrm_msg: (pid 0) PID of bpbrm = 8904
 
16:33:25.717 [9644.3236] <2> start_mpx_group: bpbrm pid = 8904
16:33:25.717 [9644.3236] <2> set_mpx_group_image_brm_pid: Setting brm_pid = 8904 for s5a9oc01_1389348313 restfiles pid 8480
16:33:25.717 [9644.3236] <2> next_image_using_first_blks: returning backup_id = s5a9oc01_1389348313, tape = MGU013, fileno = 21, firstblk = 0
16:33:25.717 [9644.3236] <2> start_image_restore: Sending listen msg to restfiles 8480 for image with backup time 1389348313
16:33:25.827 [9644.3236] <2> process_msgQ_msg: Setting portnum = 0 for s5a9oc01_1389348313
16:33:25.827 [9644.3236] <2> start_image_restore: restfiles pid 8480 brm pid 8904 s5a9oc01_1389348313
16:33:25.827 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded
16:33:25.827 [9644.3236] <2> logconnections: BPCD CONNECT FROM 10.21.35.21.55600 TO 10.21.35.49.1556 fd = 3768
16:33:25.842 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded
16:33:25.842 [9644.3236] <8> do_pbx_service: [vnet_connect.c:2108] via PBX VNETD CONNECT FROM 10.21.35.21.55601 TO 10.21.35.49.1556 fd = 3776
16:33:25.842 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:443] VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa
16:33:25.889 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:460] ipc_string /tmp/vnet-60236389378812572587000000000-WAUonv
16:33:26.497 [9644.3236] <16> add_msgs_to_progress_file: Can't become user oracle and groupp oinstall on s5a9octst01
16:33:26.497 [9644.3236] <2> start_indep_grp: active_mpx_groups = 1
16:33:26.497 [9644.3236] <2> start_indep_grp: p_active_restore_for_first_grp = NULL
16:33:26.497 [9644.3236] <2> start_indep_grp: p_first_mpx_group_for_restore = NULL
16:34:29.180 [9644.3236] <2> read_bpbrm_msg: (pid 8904) EXIT s5a9oc01_1389348313 25
16:34:29.180 [9644.3236] <2> handle_image_status: s5a9oc01_1389348313 restfiles pid 8480 bpbrm pid 8904 status = 25
16:34:29.180 [9644.3236] <2> remove_completed_restore: unlink record with restfiles pid 8480
16:34:29.180 [9644.3236] <2> remove_completed_restore: no more active restores
16:34:29.180 [9644.3236] <2> done_with_bpbrm: bpbrm pid 8904 has no more images to process
16:34:29.180 [9644.3236] <2> send_bpbrm_msg: (pid 8904) TERMINATE
16:34:29.180 [9644.3236] <2> start_indep_grp: active_mpx_groups = 0
16:34:29.180 [9644.3236] <2> start_indep_grp: p_active_restore_for_first_grp = NULL
16:34:29.180 [9644.3236] <2> start_indep_grp: p_first_mpx_group_for_restore = NULL
16:34:29.180 [9644.3236] <2> next_image_using_first_blks: returning NULL
16:34:29.180 [9644.3236] <2> run_restores: brm pid 0 has no active restores and none to start
16:34:29.180 [9644.3236] <2> start_indep_grp: active_mpx_groups = 0
16:34:29.180 [9644.3236] <2> start_indep_grp: p_active_restore_for_first_grp = NULL
16:34:29.180 [9644.3236] <2> start_indep_grp: p_first_mpx_group_for_restore = NULL
16:36:40.224 [9644.3236] <2> run_restores: After 120 seconds, no messages on queue
16:36:40.224 [9644.3236] <2> remove_msgQ: ?
16:36:40.224 [9644.3236] <2> remove_msgQ: remove succeeded
17:06:21.318 [8764.2436] <2> bprd: Refresh Az handle cache: 0
 

dtrostli
Level 4
Partner Accredited

Marianne,

One more thing I forgot to mention, backup network is not bound to any hostname. So NetBackup should not use it at all.

Master Server = s5a9nbu02

Media Server = s5a9nbu01

Source Client = s5a9oc01

Dest Client = s5a9octst01

Restore starts on s5a9octst01 using rman block statement.

Run {

allocate channel...

PARMS='ENV=(NB_ORA_CLIENT=s5a9oc01)';

restore controlfile;

release channel....

}

 

Regards

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I thought you said  All the communication flows through the production network for the test restore  ?

Now comms is seen on the backup network for restore request?

Can you see how having multiple IPs can cause a problem?

It seems that it is bpbrm on the media server that is failing with status 25:

16:34:29.180 [9644.3236] <2> handle_image_status: s5a9oc01_1389348313 restfiles pid 8480 bpbrm pid 8904 status = 25

You will need to check ALL of the logs that I have listed above. In particular:

bprd on master (need to see full log or at least all of PID 9644)
bpbrm on media server 
s5a9nbu01
bpcd and dbclient logs on the destination client

Host names specified in restore script and bp.conf on client (and $HOME/bp.conf for oracle user if it exists) are important, as this will determine IP/hostname that is chosen for connection to the master.
At the same time, the same NIC/IP must be used for comms/restore between destination client and the media server that performed the initial backup.

 

dtrostli
Level 4
Partner Accredited

Sorry. This output is from an old log when communication was going through the backup network. I have set it to work on the production instead. So now everything is on production network. I am suspecting of a socket error because I had two success restore. There is no bp.conf in oracle home. What do you mean by same NIC/IP between dest client and media server? They all have different IP address on the same subnet.


Regards

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

If client is initiating comms via its backup NIC to the master server, it will need to receive the data through the same backup NIC from the media server.

Showing us an old, irrelevant log with backup network comms does not help to solve your current restore failure.

You need to have a look at all of the relevant logs for the latest restore attempt.

Good luck!

dtrostli
Level 4
Partner Accredited

I understand it. I am going to perform another test today and I will post communication handshake.

dtrostli
Level 4
Partner Accredited

Marianne,

The weird thing about this problem is that RMAN stays waiting for NetBackup to return something and It never happens so RMAN does not end its processing. The restore job on the console ends with 2850 error code. I have set up a lab environment much similar to the one having the problem and It worked perfectly.

dtrostli
Level 4
Partner Accredited

Hello All,

The following technote solved the error with RMAN restore on a Linux RedHat Server. I had to follow this technote in order to adjust TCP/IP stack so to fix the status 2850 with the restore.

http://www.symantec.com/business/support/index?page=content&id=TECH166414&profileURL=https%3A%2F%2Fs...

Thanks everyone for all of the help on this

David