The backup jobs hang up with no more error or soc...

liuyl · ‎01-11-2020

Master/Media Server(Client):
NBU 8.1.1 for RHEL 6.5

Problem:
1) The backup jobs always hang with no more error or socket read failed(23) after the jobs run up normally for a few hours.
2) Meanwhile, the bptestbpcd also hang from the Media to the Master, but the bptestbpcd from master to media always work fine.
3) After the backup jobs hang up, many bpcd child processes are continually forked up via its origial parent bpcd process.

Notes: there is no firewall blocking even during the hang, the port 1556/13724/13782 can always be telneted via bi-direction.

bpbkar log:

19:41:40.998 [3781] <2> bpbkar SelectFile: cwd=/ecmdata/ecmdocdata5/SH120001/content_storage_09/0001d4c1/80/18/9b path=c4.jpg
19:41:40.998 [3781] <2> bpbkar PrintFile: /ecmdata/ecmdocdata5/SH120001/content_storage_09/0001d4c1/80/18/9b/c4.jpg
19:41:40.998 [3781] <2> bpbkar SelectFile: cwd=/ecmdata/ecmdocdata5/SH120001/content_storage_09/0001d4c1/80/18/9b path=c5.jpg
19:41:40.999 [3781] <2> bpbkar PrintFile: /ecmdata/ecmdocdata5/SH120001/content_storage_09/0001d4c1/80/18/9b/c5.jpg
19:41:40.999 [3781] <2> bpbkar SelectFile: cwd=/ecmdata/ecmdocdata5/SH120001/content_storage_09/0001d4c1/80/18/9b path=c6.jpg
19:41:41.006 [3781] <2> bpbkar PrintFile: CPR - 1628233600 3781 1576554998 1576582901 1619062 0 0 1 0 0 1 64777 128 0 810000 0 0 0 0 73 /ecmdata/ecmdocdata5/SH120001/c
ontent_storage_09/0001d4c1/80/18/9b/c6.jpg
21:41:39.662 [3781] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 23: socket read failed
21:41:39.662 [3781] <4> bpbkar Exit: INF - EXIT STATUS 23: socket read failed
21:41:39.663 [3781] <4> bpbkar Exit: INF - setenv FINISHED=0

bpbrm log:

19:34:17.432 [2163.2163] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null
19:34:17.432 [2163.2163] <2> is_connection_INBOUND: non-proxy, localhost connection
19:34:17.432 [2163.2163] <2> is_connection_INBOUND: Connection-Is-INBOUND:[FALSE]
19:38:36.064 [1594.1594] <2> get_exactly_n_bytes_or_eof_abs: socket read timed out: Connection timed out (110)
19:38:36.064 [1594.1594] <2> ts_get_long_abs: error reading long from socket: Connection timed out (110)
19:38:36.064 [1594.1594] <2> ts_get_string: error reading long from socket: Connection timed out (110)
19:38:36.064 [1594.1594] <16> nb_recv_str: ts_get_string() failed: -2 110

liuyl · ‎01-15-2020

Could anyone give me some cues or recommendations about the third sub-question？

Notes: it was so strange that when I run another new bpbestbpcd cmd it would also immediately hang regardless of the normal network communication via telneting the three ports(1556/13724/13782) successfully.

Marianne · ‎01-16-2020

Is the media server also the client?

Where do you see this happening?
3) After the backup jobs hang up, many bpcd child processes are continually forked up via its origial parent bpcd process.

Do you have level 5 log for bpcd on this server/client?

All of these bpcd processes are certainly not normal - so, this is where you should concentrate troubleshooting.

Handy NetBackup Links

liuyl · ‎01-16-2020

1) Yes, the media server is also client.
2) This phenomenon only happens at the problemtica media/client.
3) Yes, all server/clients' debug levels are set to 5.
But it is also a bit strange that the bpcd.log itself has no more obvious errors.

a) this bpcd hung at 07:26:17

07:22:05.145 [5411] <2> is_connection_INBOUND: non-proxy, localhost connection
07:22:05.145 [5411] <2> is_connection_INBOUND: Connection-Is-INBOUND:[FALSE]
07:23:05.134 [22300] <16> resync_host_cache: bpclntcmd fork pid = 5164
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(4), errno = 2, mode = 000021b6
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(5), errno = 2, mode = 00008180
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(6), errno = 2, mode = 0000c1ff
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(7), errno = 2, mode = 0000c1ff
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(8), errno = 2, mode = 0000c1ff
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(9), errno = 2, mode = 0000c1ff
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(10), errno = 2, mode = 0000c1ff
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(11), errno = 2, mode = 000081a4
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(12), errno = 2, mode = 000081a4
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(13), errno = 2, mode = 000081a4
07:23:05.135 [5167] <2> unix_daemonize: 0 = close(14), errno = 2, mode = 000081a4
07:23:05.136 [22300] <2> daemon_check_for_zombies: waited for 1 child processes including: 5164
07:23:05.140 [5167] <16> resync_host_cache: about to execv(/usr/openv/netbackup/bin/bpclntcmd, ...)
07:32:05.115 [22300] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null
07:32:05.115 [22300] <2> is_connection_INBOUND: non-proxy, localhost connection
07:32:05.115 [22300] <2> is_connection_INBOUND: Connection-Is-INBOUND:[FALSE]
07:32:05.116 [22300] <2> vnet_pbxAcceptSocket_ex: Accepted sock[15] from 10.131.33.60:38083
07:32:05.116 [22300] <2> daemon_select_and_accept_ex: pbx setsockopt SO_KEEPALIVE succeeded
07:32:05.116 [22300] <2> bpcd main: accept sock = 15
07:32:05.117 [5763] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null
07:32:05.117 [5763] <2> daemon_proxy_proto: Preparing to do daemon protocol for (10.131.21.177:1556 <- 10.131.33.60:38083)
07:32:05.117 [5763] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null
07:32:05.117 [5763] <2> vnet_proxy_get_peer_sock_names: setting connection_info json as json_null

b) this bpcd hung at 08:14:34

08:12:05.459 [13231] <2> VssIdentityCleanup: ++++ ENTERING ++++
08:12:05.459 [13231] <2> VssIdentityCleanup: ---- EXITING ----
08:12:05.459 [13231] <2> VssAtManager::~freeManager: ++++ ENTERING ++++
08:12:05.459 [13231] <2> VssAtManager::~freeManager: (../../libVnbat/vss_auth.cpp,577): Terminating AT handle 0x176816c8
08:12:05.460 [13231] <2> VssAtManager::~freeManager: ---- EXITING ----
08:12:05.462 [26683] <2> daemon_check_for_zombies: waited for 1 child processes including: 13231
08:20:05.377 [26683] <16> resync_host_cache: bpclntcmd fork pid = 13394
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(4), errno = 2, mode = 000021b6
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(5), errno = 2, mode = 00008180
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(6), errno = 2, mode = 0000c1ff
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(7), errno = 2, mode = 0000c1ff
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(8), errno = 2, mode = 0000c1ff
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(9), errno = 2, mode = 0000c1ff
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(10), errno = 2, mode = 0000c1ff
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(11), errno = 2, mode = 000081a4
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(12), errno = 2, mode = 000081a4
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(13), errno = 2, mode = 000081a4
08:20:05.378 [13396] <2> unix_daemonize: 0 = close(14), errno = 2, mode = 000081a4
08:20:05.379 [26683] <2> daemon_check_for_zombies: waited for 1 child processes including: 13394

VOX

The backup jobs hang up with no more error or socket read failed(23) ....