Master and Media server connectivity issue
Hi All,
I am facing one issue with my current netbackup environment- Breef intro to the infrastructure -
1. we have multiple sites with satalite connectivity.
2. Each site has its local media servers but master server is common.
Issue is - Backups running without issue on the same site where master server is located but on other sites backups are failing with either 40 error or 42 error after wiriting data some time after writing 500 GB some time after writing 50 GB . I think there is connectivity issue between master and media server. is there any way if connectivity between master and media server is broken even then backup continiues to work. Connectivity drop is not more them 20-30 seconds.
I have tried many thing like host entries are fine
Keep alive time registry is also created on media server under tcpip and rebooted server.
below are bpbrm logs from media server-
13:47:39.559 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6873.715 0
13:47:39.980 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 78
13:47:40.448 [4676.6576] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:47:40.448 [4676.6576] <2> logconnections: BPDBM CONNECT FROM 10.25.161.60.23072 TO 10.25.215.64.1556 fd = 620
13:47:40.589 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6891.509 0
13:47:40.589 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6891.509 0
13:47:40.682 [4676.6576] <2> db_end: Need to collect reply
13:47:41.618 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6907.384 0
13:47:41.618 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6907.384 0
13:47:42.133 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6925.116 0
13:47:42.133 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6925.116 0
13:47:43.163 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6939.030 0
13:47:43.163 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6939.030 0
13:50:37.172 [4432.2112] <2> bpbrm read_parent_msg: read from parent
13:57:41.243 [4676.6576] <2> get_long: (1) cannot read (byte 1) from network: Connection timed out.
13:57:41.243 [4676.6576] <2> db_getdata: get_string() failed: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (10060) network read error (-3) WSAGetLastError(): 0
13:57:41.243 [4676.6576] <2> db_end: no DONE from db_getreply(): network read failed
13:57:41.243 [4676.6576] <16> bpbrm handle_backup: db_FLISTsend failed: network read failed (42)
13:57:41.243 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 1
13:57:41.648 [4676.6576] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:57:41.648 [4676.6576] <2> logconnections: BPDBM CONNECT FROM 10.25.161.60.23294 TO 10.25.215.64.1556 fd = 620
13:57:41.851 [4676.6576] <2> db_end: Need to collect reply
13:57:42.054 [4676.6576] <2> inform_client_of_status: INF - Server status = 42
13:57:42.444 [4432.2112] <2> bpbrm brm_child_done: child done, status 42
13:57:42.444 [4432.2112] <2> bpbrm brm_child_done: child 4676 exited with status 42: network read failed
13:57:42.444 [4432.2112] <2> bpbrm send_status_to_parent: bpbrm child is done, but the media manager child is not.
13:57:42.444 [4432.2112] <2> bpbrm tell_mm: sending media manager msg: STOP BACKUP afs470-eng4_1320748231
13:57:41.243 [4676.6576] <2> db_end: no DONE from db_getreply(): network read failed
13:57:41.243 [4676.6576] <16> bpbrm handle_backup: db_FLISTsend failed: network read failed (42)You are correct, you have a Network issue.
By the results you have, the backup will terminate if connectivity is lost.
In these log lines, we see the media server is unable to update the catalog on the master.
It does aquire a new conection :
13:57:41.243 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 1
... but clearly, is unable to recover from this.
The solution, as you are aware, is to fix your network - apologies that this comment is not particularly helpful, but you have no fault in NBU.
Regards,
Martin

