cancel
Showing results for 
Search instead for 
Did you mean: 

Master and Media server connectivity issue

Anwar_Azad
Level 3
Certified

Hi All,

I am facing one issue with my current netbackup environment- Breef intro to the infrastructure -

1.  we have multiple sites with satalite connectivity.

2. Each site has its local media servers but master server is common.

Issue is - Backups running without issue on the same site where master server is located but on other sites backups are failing with either 40 error or 42 error after wiriting data some time after writing 500 GB some time after writing 50 GB . I think there is connectivity issue between master and media server. is there any way if connectivity between master and media server is broken even then backup continiues to work. Connectivity drop is not more them 20-30 seconds.

I have tried many thing like host entries are fine

Keep alive time registry is also created on media server under tcpip and rebooted server.

 

below are bpbrm logs from media server-

13:47:39.559 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6873.715 0
13:47:39.980 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 78
13:47:40.448 [4676.6576] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:47:40.448 [4676.6576] <2> logconnections: BPDBM CONNECT FROM 10.25.161.60.23072 TO 10.25.215.64.1556 fd = 620
13:47:40.589 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6891.509 0
13:47:40.589 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6891.509 0
13:47:40.682 [4676.6576] <2> db_end: Need to collect reply
13:47:41.618 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6907.384 0
13:47:41.618 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6907.384 0
13:47:42.133 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6925.116 0
13:47:42.133 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6925.116 0
13:47:43.163 [4432.2112] <2> bpbrm read_media_msg: read from media manager: WROTE afs470-eng4_1320748231 20032 0 6939.030 0
13:47:43.163 [4432.2112] <2> bpbrm send_parent_msg: WROTE afs470-eng4_1320748231 20032 0 6939.030 0
13:50:37.172 [4432.2112] <2> bpbrm read_parent_msg: read from parent

13:57:41.243 [4676.6576] <2> get_long: (1) cannot read (byte 1) from network: Connection timed out.
13:57:41.243 [4676.6576] <2> db_getdata: get_string() failed: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (10060) network read error (-3) WSAGetLastError(): 0
13:57:41.243 [4676.6576] <2> db_end: no DONE from db_getreply(): network read failed
13:57:41.243 [4676.6576] <16> bpbrm handle_backup: db_FLISTsend failed: network read failed (42)
13:57:41.243 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 1
13:57:41.648 [4676.6576] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:57:41.648 [4676.6576] <2> logconnections: BPDBM CONNECT FROM 10.25.161.60.23294 TO 10.25.215.64.1556 fd = 620
13:57:41.851 [4676.6576] <2> db_end: Need to collect reply
13:57:42.054 [4676.6576] <2> inform_client_of_status: INF - Server status = 42
13:57:42.444 [4432.2112] <2> bpbrm brm_child_done: child done, status 42
13:57:42.444 [4432.2112] <2> bpbrm brm_child_done: child 4676 exited with status 42: network read failed
13:57:42.444 [4432.2112] <2> bpbrm send_status_to_parent: bpbrm child is done, but the media manager child is not.
13:57:42.444 [4432.2112] <2> bpbrm tell_mm: sending media manager msg: STOP BACKUP afs470-eng4_1320748231

1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

 

13:57:41.243 [4676.6576] <2> db_end: no DONE from db_getreply(): network read failed
13:57:41.243 [4676.6576] <16> bpbrm handle_backup: db_FLISTsend failed: network read failed (42)

You are correct, you have a Network issue.

By the results you have, the backup will terminate if connectivity is lost.

In these log lines, we see the media server is unable to update the catalog on the master.

It does aquire a new conection :

 

13:57:41.243 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 1

... but clearly, is unable to recover from this.

The solution, as you are aware, is to fix your network - apologies that this comment is not particularly helpful, but you have no fault in NBU.

Regards,

 

Martin

View solution in original post

7 REPLIES 7

mph999
Level 6
Employee Accredited

 

13:57:41.243 [4676.6576] <2> db_end: no DONE from db_getreply(): network read failed
13:57:41.243 [4676.6576] <16> bpbrm handle_backup: db_FLISTsend failed: network read failed (42)

You are correct, you have a Network issue.

By the results you have, the backup will terminate if connectivity is lost.

In these log lines, we see the media server is unable to update the catalog on the master.

It does aquire a new conection :

 

13:57:41.243 [4676.6576] <2> ConnectionCache::connectAndCache: Acquiring new connection for host zsm465-bkp, query type 1

... but clearly, is unable to recover from this.

The solution, as you are aware, is to fix your network - apologies that this comment is not particularly helpful, but you have no fault in NBU.

Regards,

 

Martin

Anwar_Azad
Level 3
Certified

Is there anyway so that master server will be able to get catalog info once connection is restablished.

Till then backup backup continues to run.

Mark_Solutions
Level 6
Partner Accredited Certified

All you can really do is to use checkpoints in your policies and hope that the job retries happen when the connection si re-established.

You real problem is that this is Media Server to Master communication which is likely to make things worse.

You would be better off investing in the DeDuplication option and running client side dedupe with the Media Servers on the main site, at least this way it would just be client communication so a retry is more likely to pick it back up, apart from which there will be far less data to actually pass inthe first place.

Hope this helps

Anwar_Azad
Level 3
Certified

I just want to know if backup can continue to run if connectivity between master server and media server is lost for some time say 30 seconds.if yes then how.  will increasing polling interval help anything.

Mark_Solutions
Level 6
Partner Accredited Certified

If you set all of your connect and read timeouts at a good level on the Master and Media Server host properties then you should be OK

If in doubt go to the timeout section for each servers host properties and set everything to 3600 and make sure that for all file system backups you use the checkpoint-restart feature on the policy attributes tab and then on the Master Servers host properties you have something like 4 tries in 8 hours set.

Hope this helps (client side de-dupe is still the best option though)

Anwar_Azad
Level 3
Certified

I have already increased time out to 7200. but with no luck backup continues to fail with 42 error after writing data. backup is failing due to master server connectivity lost due to WAN link broken. So master server is not able to update catalog. any way to control this plzz.

I am suggesting Client side dedupe to costomer but he is not agreed as servers are already on heavy loads and they do not want to give extra load of dedupe process to the server.

My question still persist can backup continue to run if connectivity between master and media server is broken. Yes or No please.

if No then there is straightforward solution to have saparate master server on each site with no dependency on WAN link.

Quick comments on this are appreciated.

CRZ
Level 6
Employee Accredited Certified

Martin answered your question two days ago: No.

If the link breaks, so does your backup. 

Sorry for the bad news, but as a best practice, this isn't supposed to happen.  Fix the network!