Solved: Finally you where right! It

madmax00 · ‎11-03-2014

Hi all,

I have a policy that always fails with error code 23 after exactly 10 minutes and 6 seconds, no matter if it's a full or incremental backup. I think it should be a timeout config. but I don't find which one. I'd tried changing CLIENT READ TIMEOUT in the client from 300 to 3600 but still the same issue... any idea?

Thanks

Michael_G_Ander · ‎11-03-2014

Think you have some kind of connectivity issue

Suggest you run the tests here:

https://www-secure.symantec.com/connect/blogs/general-connectivity-troubleshooting

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

View solution in original post

Michael_G_Ander · ‎11-03-2014

Could be something on a specific file system/mount point, assuming you are doing a file backup

Please post the full Job Details from the failing backup

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

madmax00 · ‎11-03-2014

Here it is:

11/03/2014 16:25:38 - Info nbjm (pid=10476) starting backup job (jobid=1737730) for client preside10, policy SIG_TX_PRESIDE10, schedule Full_SIG_TX_PRESIDE10
11/03/2014 16:25:38 - Info nbjm (pid=10476) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=1737730, request id:{AA35F3E8-636D-11E4-AB5E-8888AE45A246})
11/03/2014 16:25:38 - requesting resource STU_MSDP_FLO7
11/03/2014 16:25:38 - requesting resource backup00.NBU_CLIENT.MAXJOBS.preside10
11/03/2014 16:25:38 - requesting resource backup00.NBU_POLICY.MAXJOBS.SIG_TX_PRESIDE10
11/03/2014 16:25:38 - granted resource backup00.NBU_CLIENT.MAXJOBS.preside10
11/03/2014 16:25:38 - granted resource backup00.NBU_POLICY.MAXJOBS.SIG_TX_PRESIDE10
11/03/2014 16:25:38 - granted resource MediaID=@aaab7;DiskVolume=PureDiskVolume;DiskPool=pool_msdp_flo7;Path=PureDiskVolume;StorageServer=bckflo07;MediaServer=bckflo07
11/03/2014 16:25:38 - granted resource STU_MSDP_FLO7
11/03/2014 16:25:38 - estimated 12090163 kbytes needed
11/03/2014 16:25:38 - Info nbjm (pid=10476) started backup (backupid=preside10_1415028338) job for client preside10, policy SIG_TX_PRESIDE10, schedule Full_SIG_TX_PRESIDE10 on storage unit STU_MSDP_FLO7
11/03/2014 16:25:40 - started process bpbrm (pid=1732)
11/03/2014 16:30:41 - Error bpbrm (pid=1732) bpcd on preside10 exited with status 23: socket read failed
11/03/2014 16:35:42 - Error bpbrm (pid=1732) cannot send mail because BPCD on preside10 exited with status 23: socket read failed
11/03/2014 16:35:42 - Info bpbkar (pid=0) done. status: 23: socket read failed
11/03/2014 16:35:42 - end writing
socket read failed (23)

Nicolai · ‎11-03-2014

I think i do. You have a directory the bpbkar process hangs on. Likely a directory with millions of small files.

The good new is you can debug it.

http://www.symantec.com/docs/TECH31513

http://www.symantec.com/docs/TECH29475

Follow instruction in TECH3153 and run a new backup. The bpbkar log file will now show files and folder it processes. And more less it it always the last 10 or 20 lines that reveal in culprit. Re-run the bakup to double check

If you are in doubt, attach the debug log as a file to a post. Do not post the debug text itself - it will be caught by the anti spam robot.

Michael_G_Ander · ‎11-03-2014

Think you have some kind of connectivity issue

Suggest you run the tests here:

https://www-secure.symantec.com/connect/blogs/general-connectivity-troubleshooting

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

madmax00 · ‎11-03-2014

Hi, I don't have the "Follow NFS" setting, so I think that TECH29475 donn't aply to this case...

Thanks

madmax00 · ‎11-03-2014

bptestbpcd and bpclntcmd commands seems to be everything OK so don't seem to be a connectivity issue

mph999 · ‎11-03-2014

Suggest bpbrm log (media server) and bpbkar + bpcd (client) at verbose 5

Nicolai · ‎11-03-2014

No - I referenced the tech note because it also mentioned the bpbkar_path_tr touch file.

Nicolai · ‎11-03-2014

I don't think its a connection issue if timeout occur after exactly 10:06 minutes.

Please perform the debug instruction mentioned in TECH31513

In short do the following on the client except from step 4:

touch /usr/openv/netbackup/bpbkar_path_tr
mkdir /usr/openv/netbackup/log/bpbkar
Add VERBOSE to /usr/openv/netbackup/bp.conf
run backup
Inspect /usr/openv/netbackup/logs/bpbkar/{date}.log

Michael_G_Ander · ‎11-03-2014

Could you sastify my curiousity and post the output from the bptestbpcd and bpclntcmd commands ?

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

CRZ · ‎11-03-2014

11/03/2014 16:25:40 - started process bpbrm (pid=1732)
11/03/2014 16:30:41 - Error bpbrm (pid=1732) bpcd on preside10 exited with status 23: socket read failed
11/03/2014 16:35:42 - Error bpbrm (pid=1732) cannot send mail because BPCD on preside10 exited with status 23: socket read failed

It's probably not coincidence that these log entries are each five minutes apart. More verbose logs (as suggested above) will help diagnose the trouble.

Nicolai · ‎11-04-2014

add CLIENT_READ_TIMEOUT = 1800 to master and media servers as well.

madmax00 · ‎11-05-2014

Finally you where right! It was a conectivity problem. I performed the bptestbpcd from Master Server, but when I tried from the Media Server I had no answer from the client, even when both machines (media and client) have correct /etc/hosts. The client was in a different vlan, now It is solved.

Thanks a lot!

VOX

error code 23 after 10:06 minutes execution