02-16-2015 11:13 PM
Hello
Does somebody know how to change the AIR restart time from the seemly default 24 hours ?
Ask as we have an AIR that often fails, because of poor connection, unfortunately we cannot improve the connection so more frrequently retries is desirable
Netbackup 7.6.0.1 on all involved systems
Regards
Michael
02-17-2015 12:10 AM
02-17-2015 12:33 AM
Thanks, we have looked into those and they are all less than 24 hours except Cleanup Interval which we have changed to 1 hour.
As Extend image retry interval is 2 hours, you would except the replication to restart after 2 hours after the failure rather than 24 hours
02-17-2015 12:47 AM
Perhaps SLP window defined?
When you refer to 'AIR restart time' is this the duplication/replication or import phase?
I cannot see any other SLP Parameters that has an effect on SLP/AIR retry times.
... unless... you have discovered a bug.... (but Support will probably tell you to patch master to latest 7.6.0.x version...)
02-17-2015 01:36 AM
The SLP window is 24x7
It is the duplication/replication phase that is the problem
Gues that Replication startup is governed internally by the deduplication database
02-17-2015 01:50 AM
My understanding is that replication is governed by SLP. It is simply a duplication job started by SLP.
Only SLP parameters should play a role.
I am curious to know what Support would say if you logged a call...
02-18-2015 04:25 AM
I haven't seen it wait 24 hours. As Marianne stated, AIR or no AIR, its controlled by the SLP. And they usually just keep on trying, sometimes annoyingly so.
02-18-2015 11:10 PM
In our case it keeps trying, but only once every 24 hours unfortunately
The error we get is
Critical bpdm (pid=XXXX) Storage Server Error: (Storage server: PureDisk:<source storage server) async_get_job_status: Replication started but failed to complete successfully: ___sosend: __crStreamWrite failed: connection reset by peer. Look at the replication logs on the source storage server for more information. V-454-105.
We are also interested in ways to make to replication/optimized duplication connection more resilient
02-19-2015 04:34 AM
Have found an entry called
# Time length in second for reconnect attempts
# @validate [0-3600], 0 for disabling network resiliency
ReconnectTimeout=0
in agent.cfg, but cannot find any documentation for it neither in the current Netbackup Dedupe Guide or the old PureDisk manuals.
03-07-2015 03:08 AM
03-27-2015 02:13 AM
A little, we have discovered that by changing another backup schedule that includes a replication, we can trigger a retry of the failed replication.
Besides that we have increased the spa timeouts in the agent.cfg file to timeoutconnection=300 and timeoutreponse=600
03-28-2015 03:24 AM
Does this mean that your issue is now resolved?
03-30-2015 01:48 AM
Not quite yet, but the timeout changes have helped the replication job to survive some of the network issues.
Will probably do some more changes after easter.
04-26-2015 11:36 PM
A local symantec employee has promised to see if it was possible to find some iinternal documentation on the MSDP <-> MSDP resiliency
05-13-2015 04:12 AM
Unfortunately the local employee has not been able find any documentation regarding this and has been informed the we probably will have difficultities with getting AIR to survive our network outages.
05-20-2015 11:40 PM
Have created a support case about these parameters, as expected I am getting all the WAN resiliency documents for Netbackup which does not answer my question. Will update when I hopefully get some more relevant answers.
05-21-2015 05:13 PM
Just for sharing purpose, I had once tried to trace when the AIR (SLP) retry happen.
It is quite consistent, 3 attempts at first (every 5 min), thereafter it follows the
SLP.IMAGE_EXTENDED_RETRY_PERIOD = 2 hours
to retry. What I don't understand though is when it came to around 9PM (12 hours after 1st attempt), it retried 3 times again for every 5 min, before waiting again to follow the "extended retry period".
First replication: 8:51AM
1st retry: 8:56AM
2nd retry: 9:01AM
3rd retry: 9:06AM
4th retry: 11:07AM
5th retry: 1:07PM
6th retry: 3:07PM
7th retry: 5:07PM
8th retry: 7:08PM
9th retry: 8:53PM (12 hours after 1st attempt)
10th retry:8:58PM
11th retry:9:03PM
12th retry:9:08PM
13th retry:11:08PM
14th retry:1:08AM
15th retry:3:08AM
16th retry:5:09AM
17th retry:7:09AM
It was a 7.6 probably GA version. AIR replication.
05-21-2015 11:34 PM
Thinking that the 12 hours could be related to the tlog queue processing.
Or another replication could trigger a new retry serie. We are using that to try getting the problematic image through.
06-08-2015 04:07 AM
Unfortunately the reconnecttimeout parameter does not seem be documented/known at Symantec/Veritas either. Or at least that is what I am understand from the support case.
Too bad as AIR is a really great function, that could be even greater if it had network resiliency too.