NetBackup Timeouts
Out NetBackup jobs fail when there is a WAN disconnection (usually due to ISP issue). Media and client servers are in one office and across the WAN is the master server. We have increased timeouts for all servers involved (and ensured their settings match), but WAN disconnections continue to cause jobs to fail, even though the disconnections are often brief (less than 30 seconds) and timeouts were set 10 minutes.
Is anyone aware of any other settings (NetBackup, Windows 2008 R2, HP Network Configuration Utility) that may supersede NetBackup timeouts or casue them to be ineffective?
Further Details of issue:
Our Windows 2008 R2/2008/2003 environment consists of one master server, about a dozen media servers, and approx. 70 clients. With the exception of the master server and one media server, all other media servers are in different offices and have clients in the office they each are in. The clients in these backups are Windows file servers that do not run SQL or Exchange.
Concerning NetBackup timeouts, we have been doing testing these to see when they are used and when they are ignored or overridden by other settings or environmental conditions. So far they have been somewhat inconsistant and not always restarting backup jobs when a network disruption occurs (even brief ones). The 3 tests we performed were:
Test 1: Set timeouts on clients, one media server, and one master sever. All computers involved in the backup have the same timeout setting (10 minutes). TCP Keepalivetime is left at the Windows default of 2 hours (I understand that if we lower this below the timeout setting that could cause issues).
During a test backup of a client, I disconnected the media server for less than 10 minutes and reconnected it. The backup resumed after a little while (checkpoints were being used; set to 10 minutes).
Test 2: In a unexpected "live test", our WAN connection between two offices - one containing the media server and another the master server with timeouts set at 10 minutes from the previous test - dropped very briefly - probably less than 5 or 10 seconds. Our monitoring software only reported a few packets were dropped and when I checked everything appeared OK, with one exception. The backup job was still active, however it reported the below details:
.....
08/14/2012 22:51:42 - positioned DS2476; position time: 0:00:11
08/14/2012 22:51:42 - begin writing
08/15/2012 00:00:51 - Error bpbrm (pid=1952) could not write FILE ADDED message to OUTSOCK
.....
08/15/2012 00:37:40 - Error bpbrm (pid=1952) could not write FILE ADDED message to OUTSOCK
08/15/2012 00:37:48 - Error bpbrm (pid=1952) could not write FILE ADDED message to OUTSOCK
08/15/2012 00:37:54 - Error bpbrm (pid=1952) could not write FILE ADDED message to OUTSOCK
08/15/2012 00:38:01 - Error bpbrm (pid=1952) could not write FILE ADDED message to OUTSOCK
08/15/2012 00:38:08 - Error bpbrm (pid=1952) could not write FILE ADDED message to OUTSOCK
08/15/2012 00:57:20 - Error nbjm (pid=7384) nbrb status: RB deallocated orphaned resources
08/15/2012 01:08:14 - Info bpbrm (pid=2404) Starting delete snapshot processing
08/15/2012 01:08:14 - Info bpfis (pid=0) Snapshot will not be deleted
termination requested by administrator (150)
From 00:00:51 - 00:38:08 it reported the "...FILE ADDED message to..." error every 8 or so seconds. After waiting about 38 minutes of this I canceled the job (status 150) and manually ran it again which finished OK (status 0). The job that I canceled did not visually show it was canceled for about 30 minutes after I canceled it, so the manual job waited until the first was no longer active before starting.
Test 3: In another office, we had planned downtime scheduled to replace that office's WAN router. We decided to do NetBackup timeouts testing again (with checkpoint set for 10 minutes; timeouts on relevant clients, media & master server set to 2 hours; TCP Keepalive set for 2 hours). During a backup job, they replaced the router which, as expected, caused a disconnection. The disconnected lasted less than the 2 hours set for the timeouts, but the backup job ended incomplete and had to be manually run again. Job details:
......
08/23/2012 20:45:32 - Info bpbkar32 (pid=3700) change journal NOT enabled for <D:\>
08/23/2012 21:58:37 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 21:58:48 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 21:59:12 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 21:59:24 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 21:59:39 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 21:59:51 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 22:00:05 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 22:00:22 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 22:00:34 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 22:50:48 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 23:01:40 - Info bptm (pid=5772) waited for full buffer 16953 times, delayed 551732 times
08/23/2012 23:01:40 - Error bpbrm (pid=5792) could not write FILE ADDED message to OUTSOCK
08/23/2012 23:01:52 - Info bptm (pid=5772) EXITING with status 0 <----------
08/23/2012 23:01:52 - Info bpbrm (pid=5792) validating image for client APP06
08/23/2012 23:14:02 - Error nbjm (pid=7640) nbrb status: RB deallocated orphaned resources
08/23/2012 23:24:05 - job resume failed 831
08/23/2012 23:24:09 - Info bpbrm (pid=1560) Starting delete snapshot processing
08/23/2012 23:24:09 - Info bpfis (pid=0) Snapshot will not be deleted
08/23/2012 23:36:26 - job 22629 was restarted as job 22662
client process aborted (50)
The two issues we see are this:
1: Timeouts do not always cause a job to restart and we do not know their limitations or how to resolve/work around this so they will restart properly. We are unsure if When a WAN disrution occurs, they do not seem to work, but when the media server is physically disconnected from the network during a backup and reconnected, they apply.
2. When a WAN disruption occurs between media/master server, a job hangs for awhile even if manually canceled and further causes delay of the following backup job to start. This obviously is minor to the first issue, but if the first occurs, we would rather not wait an hour or so before seeing the manual job start to delay the next backup.
Thank you for taking the time to read this. I have been working with Symantec tech support on these issues but after two weeks we have not been able to resolve them yet. Any ideas, comments, or questions would be appreciated.
Matt
You could give the Network Resiliency feature in 7.5 a try.