Solved: RMAN restore ends with error

dtrostli · ‎01-14-2014

Hello Everyone,

Has anyone come across an error while trying to restore using RMAN. The following detail is showing in the job log for the restore. The main issue is that the restore worked 2 out 12 tries. Any help is very much appreciated.

1/14/2014 1:59:57 PM - begin Restore

1/14/2014 1:59:59 PM - 1 images required

1/14/2014 1:59:59 PM - media MGU013 required

1/14/2014 2:00:03 PM - restoring image s5a9oc01_1389428357

1/14/2014 2:00:05 PM - Info bpbrm(pid=4816) s5a9octst01 is the host to restore to

1/14/2014 2:00:05 PM - Info bpbrm(pid=4816) telling media manager to start restore on client

1/14/2014 2:00:07 PM - Warning bptm(pid=312) failure logging message to client s5a9octst01 in log /usr/openv/netbackup/logs/user_ops/dbext/logs/17007.0.1389715202: system error occurred (130)

1/14/2014 2:00:08 PM - requesting resource MGU013

1/14/2014 2:00:08 PM - granted resource MGU013

1/14/2014 2:00:08 PM - granted resource HP.ULTRIUM5-SCSI.000

1/14/2014 2:00:08 PM - mounted

1/14/2014 2:00:08 PM - Info bpbrm(pid=2740) s5a9octst01 is the host to restore to

1/14/2014 2:00:08 PM - Info bptm(pid=312) MGU013

1/14/2014 2:00:08 PM - Info bpbrm(pid=2740) listening for client connection

1/14/2014 2:00:08 PM - Info bptm(pid=312) INF - Waiting for positioning of media id MGU013 on server s5a9nbu01 for reading.

1/14/2014 2:00:08 PM - positioning MGU013 to file 84

1/14/2014 2:00:09 PM - Error bpbrm(pid=2740) cannot become user oracle and group oinstall on s5a9octst01

1/14/2014 2:00:09 PM - Error bpbrm(pid=2740) listen for client protocol error - couldn't write necessary information on /usr/openv/netbackup/logs/user_ops/dbext/logs/17007.0.1389715202

1/14/2014 2:00:10 PM - Info bpbrm(pid=4816) child done, status 25

1/14/2014 2:00:10 PM - Info bpbrm(pid=4816) sending message to media manager: STOP RESTORE s5a9oc01_1389428357

1/14/2014 2:00:41 PM - positioned MGU013; position time: 00:00:33

1/14/2014 2:00:41 PM - begin reading

1/14/2014 2:01:06 PM - Info bpbrm(pid=4816) media manager for backup id s5a9oc01_1389428357 exited with status 150: termination requested by administrator

Kind Regards

David

dtrostli · ‎01-17-2014

Hello All,

The following technote solved the error with RMAN restore on a Linux RedHat Server. I had to follow this technote in order to adjust TCP/IP stack so to fix the status 2850 with the restore.

http://www.symantec.com/business/support/index?page=content&id=TECH166414&profileURL=https%3A%2F%2Fs...

Thanks everyone for all of the help on this

David

View solution in original post

RiaanBadenhorst · ‎01-14-2014

Error bpbrm(pid=2740) cannot become user oracle and group oinstall on s5a9octst01

Check this note. http://www.symantec.com/docs/TECH20582

dtrostli · ‎01-14-2014

Hello Riaan,

Customer calls the restore from the client. It does not have any script with su in it. It is a simple restore using rman commands. The client is running RedHat 6.3.

Regards

David

RiaanBadenhorst · ‎01-14-2014

Hi,

And are the executing with user oracle and group oinstall?

dtrostli · ‎01-14-2014

Yes, They are. I have just seen too many CLOSE_WAIT and TIME_WAIT connections. Would that affects NetBackup communication? I know Symantec has some recommended tuning for it but I can't find the information. Do you by any chance have anything on TCP tuning?

RiaanBadenhorst · ‎01-14-2014

The failure occurs to quickly too be network related (IMHO). Has this worked before? Maybe post the script /commans they're using so it can be checked.

EDIT, sorry saw you said it works 2/12. Its a bit weird then.

Does seem somehow related to comms, or maybe the comms processes. Check this note (older and for backup but same symptoms)

http://www.symantec.com/business/support/index?page=content&id=TECH77533

dtrostli · ‎01-14-2014

Hi,

This note is for older versions. This does not apply to 7.x. In 7.x bpcd and vnetd are standalone processes. I have seen in the past issues with too many CLOSE_WAIT and TIME_WAIT connections. But that was too long ago. Symantec has sent me some tunig for RedHat to tune the TCP layer. Hope that works otherwise customer will be in my throat.

RiaanBadenhorst · ‎01-14-2014

Good luck

dtrostli · ‎01-14-2014

Hi,

If you find anything else related to my problem, please let me know.

Thanks

Marianne · ‎01-14-2014

Firstly check that user oracle and group oinstall on s5a9octst01 have the same UID and GID as the source system.

Next, ensure all of these log folders exist for further troubleshooting:

On destination client: dbclient (with 777 permissions) and bpcd

On master: bprd (if folder does not exist, create it and restart NBU to enable this log)

On media server: bpbrm and bptm

Copy latest logs to reflect process name (e.g. dbclient.txt) and upload as File attachment.

One more thought:

How many NICs/IPs on destination client?
If more than one, is each IP address bound to a different hostname in its own hosts file?

Handy NetBackup Links

dtrostli · ‎01-15-2014

Hi Marianne,

I have asked customer to check the oracle ID and group ID is the same on all of the servers. I have also create the logs with the correct permission. In the meantime I don't see why having a second nic would impact restore. All the communication flows through the production network for the test restore despite the server having a backup network as well. The weird thing is that restore worked twice out of 10 tries and I see in the dbclient a cannot connect on socket error. I am suspecting more of a socket error than anything else.

David

Marianne · ‎01-15-2014

I don't see why having a second nic would impact restore.

If different NICs/IPs are not bound to separate hostnames (i.e all IPs using same hostname) it may cause the client to attempt comms on the wrong NIC. I have seen this many times.

dbclient log is a good start to see source and destination IP addresses when communicating with master and media server.

bprd on the master server will log client's IP address in connection attempt.

Handy NetBackup Links

dtrostli · ‎01-15-2014

I see. Here is a snippet from the bprd. The line in bold is the communication between the master and the client through the backup network.

16:33:24.111 [9644.3236] <2> Orb::init: caching EndpointSelectorFactory(Orb.cpp:840)

16:33:24.111 [9644.3236] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1481)

16:33:24.111 [9644.3236] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1490)

16:33:24.173 [9644.3236] <2> get_media_server_version: NBU version of s5a9nbu01 is 750500

16:33:24.173 [9644.3236] <2> start_mpx_group: media server s5a9nbu01 version from db = 0x000b73a4

16:33:24.173 [9644.3236] <2> start_mpx_group: /usr/openv/netbackup/bin/bpbrm bpbrm -restore -mpx_restore -S s5a9nbu02 -mt 2 -to 0 -mud 180 -mediasvr s5a9nbu01 -masterversion 750000

16:33:24.189 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded

16:33:24.189 [9644.3236] <2> logconnections: BPCD CONNECT FROM 10.21.33.21.55595 TO 10.21.33.20.1556 fd = 3752

16:33:24.204 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded

16:33:24.220 [9644.3236] <8> do_pbx_service: [vnet_connect.c:2108] via PBX VNETD CONNECT FROM 10.21.33.21.55596 TO 10.21.33.20.1556 fd = 3756

16:33:24.220 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:443] VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa

16:33:24.423 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:460] ipc_string 60386

16:33:25.062 [9644.3236] <2> start_mpx_group: bpbrm stdout from bpcr = 3752

16:33:25.062 [9644.3236] <2> start_mpx_group: bpbrm stderr from bpcr = 3756

16:33:25.717 [9644.3236] <2> read_bpbrm_msg: (pid 0) PID of bpbrm = 8904

16:33:25.717 [9644.3236] <2> start_mpx_group: bpbrm pid = 8904

16:33:25.717 [9644.3236] <2> set_mpx_group_image_brm_pid: Setting brm_pid = 8904 for s5a9oc01_1389348313 restfiles pid 8480

16:33:25.717 [9644.3236] <2> next_image_using_first_blks: returning backup_id = s5a9oc01_1389348313, tape = MGU013, fileno = 21, firstblk = 0

16:33:25.717 [9644.3236] <2> start_image_restore: Sending listen msg to restfiles 8480 for image with backup time 1389348313

16:33:25.827 [9644.3236] <2> process_msgQ_msg: Setting portnum = 0 for s5a9oc01_1389348313

16:33:25.827 [9644.3236] <2> start_image_restore: restfiles pid 8480 brm pid 8904 s5a9oc01_1389348313

16:33:25.827 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded

16:33:25.827 [9644.3236] <2> logconnections: BPCD CONNECT FROM 10.21.35.21.55600 TO 10.21.35.49.1556 fd = 3768

16:33:25.842 [9644.3236] <2> vnet_pbxConnect: pbxConnectEx Succeeded

16:33:25.842 [9644.3236] <8> do_pbx_service: [vnet_connect.c:2108] via PBX VNETD CONNECT FROM 10.21.35.21.55601 TO 10.21.35.49.1556 fd = 3776

16:33:25.842 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:443] VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa

16:33:25.889 [9644.3236] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:460] ipc_string /tmp/vnet-60236389378812572587000000000-WAUonv

16:33:26.497 [9644.3236] <16> add_msgs_to_progress_file: Can't become user oracle and groupp oinstall on s5a9octst01

16:33:26.497 [9644.3236] <2> start_indep_grp: active_mpx_groups = 1

16:33:26.497 [9644.3236] <2> start_indep_grp: p_active_restore_for_first_grp = NULL

16:33:26.497 [9644.3236] <2> start_indep_grp: p_first_mpx_group_for_restore = NULL

16:34:29.180 [9644.3236] <2> read_bpbrm_msg: (pid 8904) EXIT s5a9oc01_1389348313 25

16:34:29.180 [9644.3236] <2> handle_image_status: s5a9oc01_1389348313 restfiles pid 8480 bpbrm pid 8904 status = 25

16:34:29.180 [9644.3236] <2> remove_completed_restore: unlink record with restfiles pid 8480

16:34:29.180 [9644.3236] <2> remove_completed_restore: no more active restores

16:34:29.180 [9644.3236] <2> done_with_bpbrm: bpbrm pid 8904 has no more images to process

16:34:29.180 [9644.3236] <2> send_bpbrm_msg: (pid 8904) TERMINATE

16:34:29.180 [9644.3236] <2> start_indep_grp: active_mpx_groups = 0

16:34:29.180 [9644.3236] <2> start_indep_grp: p_active_restore_for_first_grp = NULL

16:34:29.180 [9644.3236] <2> start_indep_grp: p_first_mpx_group_for_restore = NULL

16:34:29.180 [9644.3236] <2> next_image_using_first_blks: returning NULL

16:34:29.180 [9644.3236] <2> run_restores: brm pid 0 has no active restores and none to start

16:34:29.180 [9644.3236] <2> start_indep_grp: active_mpx_groups = 0

16:34:29.180 [9644.3236] <2> start_indep_grp: p_active_restore_for_first_grp = NULL

16:34:29.180 [9644.3236] <2> start_indep_grp: p_first_mpx_group_for_restore = NULL

16:36:40.224 [9644.3236] <2> run_restores: After 120 seconds, no messages on queue

16:36:40.224 [9644.3236] <2> remove_msgQ: ?

16:36:40.224 [9644.3236] <2> remove_msgQ: remove succeeded

17:06:21.318 [8764.2436] <2> bprd: Refresh Az handle cache: 0

dtrostli · ‎01-15-2014

Marianne,

One more thing I forgot to mention, backup network is not bound to any hostname. So NetBackup should not use it at all.

Master Server = s5a9nbu02

Media Server = s5a9nbu01

Source Client = s5a9oc01

Dest Client = s5a9octst01

Restore starts on s5a9octst01 using rman block statement.

Run {

allocate channel...

PARMS='ENV=(NB_ORA_CLIENT=s5a9oc01)';

restore controlfile;

release channel....

}

Regards

Marianne · ‎01-15-2014

I thought you said All the communication flows through the production network for the test restore ?

Now comms is seen on the backup network for restore request?

Can you see how having multiple IPs can cause a problem?

It seems that it is bpbrm on the media server that is failing with status 25:

16:34:29.180 [9644.3236] <2> handle_image_status: s5a9oc01_1389348313 restfiles pid 8480 bpbrm pid 8904 status = 25

You will need to check ALL of the logs that I have listed above. In particular:

bprd on master (need to see full log or at least all of PID 9644)
bpbrm on media server s5a9nbu01
bpcd and dbclient logs on the destination client

Host names specified in restore script and bp.conf on client (and $HOME/bp.conf for oracle user if it exists) are important, as this will determine IP/hostname that is chosen for connection to the master.
At the same time, the same NIC/IP must be used for comms/restore between destination client and the media server that performed the initial backup.

Handy NetBackup Links

dtrostli · ‎01-15-2014

Sorry. This output is from an old log when communication was going through the backup network. I have set it to work on the production instead. So now everything is on production network. I am suspecting of a socket error because I had two success restore. There is no bp.conf in oracle home. What do you mean by same NIC/IP between dest client and media server? They all have different IP address on the same subnet.

Regards

Marianne · ‎01-15-2014

If client is initiating comms via its backup NIC to the master server, it will need to receive the data through the same backup NIC from the media server.

Showing us an old, irrelevant log with backup network comms does not help to solve your current restore failure.

You need to have a look at all of the relevant logs for the latest restore attempt.

Good luck!

Handy NetBackup Links

dtrostli · ‎01-15-2014

I understand it. I am going to perform another test today and I will post communication handshake.

dtrostli · ‎01-15-2014

Marianne,

The weird thing about this problem is that RMAN stays waiting for NetBackup to return something and It never happens so RMAN does not end its processing. The restore job on the console ends with 2850 error code. I have set up a lab environment much similar to the one having the problem and It worked perfectly.

dtrostli · ‎01-17-2014

Hello All,

The following technote solved the error with RMAN restore on a Linux RedHat Server. I had to follow this technote in order to adjust TCP/IP stack so to fix the status 2850 with the restore.

http://www.symantec.com/business/support/index?page=content&id=TECH166414&profileURL=https%3A%2F%2Fs...

Thanks everyone for all of the help on this

David

VOX

RMAN restore ends with error