Highlighted

Randomly getting status: 21: socket open failed for backups

Hello,

 

I have been getting errors as follows randomly for both standard and oracle (RMAN) backups on my netbackup master server.


The version of the server is  7.5.0.7 running on RHEL 5.11

The version of the clients are 7.5.0.4  (SAN Clients)

Ther version of the Media servers are 7.5.0.4 (SAN Media).

 

The logs shows the following, I would like to know if there is any way to start digging for the cause of this issue:

 

07/05/2015 00:00:37 - Info nbjm (pid=9601) starting backup job (jobid=1635947) for client seprdu01, policy RMAN_SEPRDU01_BD, schedule Default-Application-Backup
07/05/2015 00:00:37 - Info nbjm (pid=9601) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=1635947, request id:{64B718AC-22CA-11E5-8292-439B8376C6B8})
07/05/2015 00:00:37 - requesting resource sesegx2x-hcart3-robot-tld-1
07/05/2015 00:00:37 - requesting resource sesegx10.sagir.qc.NBU_CLIENT.MAXJOBS.seprdu01
07/05/2015 00:00:37 - requesting resource sesegx10.sagir.qc.NBU_POLICY.MAXJOBS.RMAN_SEPRDU01_BD
07/05/2015 00:00:37 - Waiting for scan drive stop HP.ULTRIUM4-SCSI.002, Media server: sesegx27
07/05/2015 00:00:38 - granted resource  sesegx10.sagir.qc.NBU_CLIENT.MAXJOBS.seprdu01
07/05/2015 00:00:38 - granted resource  sesegx10.sagir.qc.NBU_POLICY.MAXJOBS.RMAN_SEPRDU01_BD
07/05/2015 00:00:38 - granted resource  S00830
07/05/2015 00:00:38 - granted resource  HP.ULTRIUM4-SCSI.002
07/05/2015 00:00:38 - granted resource  sesegx27-hcart3-robot-tld-1
07/05/2015 00:00:38 - granted resource  TRANSPORT
07/05/2015 00:00:38 - estimated 0 kbytes needed
07/05/2015 00:00:38 - Info nbjm (pid=9601) started backup (backupid=seprdu01_1436068838) job for client seprdu01, policy RMAN_SEPRDU01_BD, schedule Default-Application-Backup on storage unit sesegx27-hcart3-robot-tld-1
07/05/2015 00:00:39 - started process bpbrm (pid=774)
07/05/2015 00:03:34 - Error bpbrm (pid=774) bpcd on seprdu01 exited with status 21: socket open failed
07/05/2015 00:03:34 - Info bpbkar (pid=0) done. status: 21: socket open failed
07/05/2015 00:03:34 - end writing
socket open failed  (21)
5 Replies
Highlighted

Is this only happening when

Is this only happening when you are using SAN Transport? You could try lan backups to see if the issue still occurs?

 

Login into the client seprdu01 and run:

netstat -a | grep bpcd   (I am assuming unix here, if windows, replace grep with findstr)

You should get a result like this:
tcp        0      0 *:bpcd                  *:*                     LISTEN

 

If bpcd is listening you will need to enable the bpcd log on client seprdu01, increase log verbosity to maximum and reproduce the problem, then post the bpcd log up as an attachment along with the Job Detail Status

 

Highlighted

bpcd on seprdu01 exited with

 bpcd on seprdu01 exited with status 21: socket open failed

 

I agree - we need to see bpcd log on client seprdu01 after a failure.

I am also curious to see why there is a reference to bpbkar process to the client when this is Oracle backup... 
bpcd should tell us which process it is trying to spawn.

Difficult to say  if this

Difficult to say  if this happens only for SAN transport because most clients are SAN clients and I cannot move to LAN even for a test because they are huge backups and we do not have a dedicated backup network.

I have added the bpcd log on the client, will post it as soon as the error surfaces again.

Highlighted

I have increased

I have increased CLIENT_CONNECT_TIMEOUT for a similar issues on Oracle backups over LAN, usually increase CLIENT_READ_TIMEOUT to at least the same as CLIENT_CONNECT_TIMEOUT.

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue
Highlighted

The problem has not happened

The problem has not happened again since I added hosts entries to bypass the DNS. I think the problem is related to slow DNS response. I will investigate in this direction.

I guess that increasing the CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT would have masked the problem also.