07-08-2015 12:55 PM
Hello,
I have been getting errors as follows randomly for both standard and oracle (RMAN) backups on my netbackup master server.
The version of the server is 7.5.0.7 running on RHEL 5.11
The version of the clients are 7.5.0.4 (SAN Clients)
Ther version of the Media servers are 7.5.0.4 (SAN Media).
The logs shows the following, I would like to know if there is any way to start digging for the cause of this issue:
07/05/2015 00:00:37 - Info nbjm (pid=9601) starting backup job (jobid=1635947) for client seprdu01, policy RMAN_SEPRDU01_BD, schedule Default-Application-Backup 07/05/2015 00:00:37 - Info nbjm (pid=9601) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=1635947, request id:{64B718AC-22CA-11E5-8292-439B8376C6B8}) 07/05/2015 00:00:37 - requesting resource sesegx2x-hcart3-robot-tld-1 07/05/2015 00:00:37 - requesting resource sesegx10.sagir.qc.NBU_CLIENT.MAXJOBS.seprdu01 07/05/2015 00:00:37 - requesting resource sesegx10.sagir.qc.NBU_POLICY.MAXJOBS.RMAN_SEPRDU01_BD 07/05/2015 00:00:37 - Waiting for scan drive stop HP.ULTRIUM4-SCSI.002, Media server: sesegx27 07/05/2015 00:00:38 - granted resource sesegx10.sagir.qc.NBU_CLIENT.MAXJOBS.seprdu01 07/05/2015 00:00:38 - granted resource sesegx10.sagir.qc.NBU_POLICY.MAXJOBS.RMAN_SEPRDU01_BD 07/05/2015 00:00:38 - granted resource S00830 07/05/2015 00:00:38 - granted resource HP.ULTRIUM4-SCSI.002 07/05/2015 00:00:38 - granted resource sesegx27-hcart3-robot-tld-1 07/05/2015 00:00:38 - granted resource TRANSPORT 07/05/2015 00:00:38 - estimated 0 kbytes needed 07/05/2015 00:00:38 - Info nbjm (pid=9601) started backup (backupid=seprdu01_1436068838) job for client seprdu01, policy RMAN_SEPRDU01_BD, schedule Default-Application-Backup on storage unit sesegx27-hcart3-robot-tld-1 07/05/2015 00:00:39 - started process bpbrm (pid=774) 07/05/2015 00:03:34 - Error bpbrm (pid=774) bpcd on seprdu01 exited with status 21: socket open failed 07/05/2015 00:03:34 - Info bpbkar (pid=0) done. status: 21: socket open failed 07/05/2015 00:03:34 - end writing socket open failed (21)
07-08-2015 11:35 PM
Is this only happening when you are using SAN Transport? You could try lan backups to see if the issue still occurs?
Login into the client seprdu01 and run:
netstat -a | grep bpcd (I am assuming unix here, if windows, replace grep with findstr)
You should get a result like this:
tcp 0 0 *:bpcd *:* LISTEN
If bpcd is listening you will need to enable the bpcd log on client seprdu01, increase log verbosity to maximum and reproduce the problem, then post the bpcd log up as an attachment along with the Job Detail Status
07-09-2015 12:33 AM
bpcd on seprdu01 exited with status 21: socket open failed
I agree - we need to see bpcd log on client seprdu01 after a failure.
I am also curious to see why there is a reference to bpbkar process to the client when this is Oracle backup...
bpcd should tell us which process it is trying to spawn.
07-09-2015 10:09 AM
Difficult to say if this happens only for SAN transport because most clients are SAN clients and I cannot move to LAN even for a test because they are huge backups and we do not have a dedicated backup network.
I have added the bpcd log on the client, will post it as soon as the error surfaces again.
07-10-2015 02:34 AM
I have increased CLIENT_CONNECT_TIMEOUT for a similar issues on Oracle backups over LAN, usually increase CLIENT_READ_TIMEOUT to at least the same as CLIENT_CONNECT_TIMEOUT.
07-13-2015 11:21 AM
The problem has not happened again since I added hosts entries to bypass the DNS. I think the problem is related to slow DNS response. I will investigate in this direction.
I guess that increasing the CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT would have masked the problem also.