cancel
Showing results for 
Search instead for 
Did you mean: 

Change AIR restart time ?

Michael_G_Ander
Level 6
Certified

Hello

Does somebody know how to change the AIR restart time from the seemly default 24 hours ?

Ask as we have an AIR that often fails, because of poor connection, unfortunately we cannot improve the connection so more frrequently retries is desirable

Netbackup 7.6.0.1 on all involved systems

Regards

Michael

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue
18 REPLIES 18

Marianne
Level 6
Partner    VIP    Accredited Certified
Have a look at all of these settings on your master: Host Properties -> SLP Parameters : Job submission interval Extended image retry interval How long to retry failed AIR import jobs etc...

Michael_G_Ander
Level 6
Certified

Thanks, we have looked into those and they are all less than 24 hours except Cleanup Interval which we have changed to 1 hour.

As Extend image retry interval is 2 hours, you would except the replication to restart after 2 hours after the failure rather than 24 hours

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne
Level 6
Partner    VIP    Accredited Certified

Perhaps SLP window defined?

When you refer to 'AIR restart time' is this the duplication/replication or import phase?

I cannot see any other SLP Parameters that has an effect on SLP/AIR retry times.
... unless... you have discovered a bug.... (but Support will probably tell you to patch master to latest 7.6.0.x version...)

Michael_G_Ander
Level 6
Certified

The SLP window is 24x7

It is the duplication/replication phase that is the problem

Gues that Replication startup is governed internally by the deduplication database

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne
Level 6
Partner    VIP    Accredited Certified

My understanding is that replication is governed by SLP. It is simply a duplication job started by SLP.

Only SLP parameters should play a role.

I am curious to know what Support would say if you logged a call...

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

I haven't seen it wait 24 hours. As Marianne stated, AIR or no AIR, its controlled by the SLP. And they usually just keep on trying, sometimes annoyingly so.

Michael_G_Ander
Level 6
Certified

In our case it keeps trying, but only once every 24 hours unfortunately

The error we get is

Critical bpdm (pid=XXXX) Storage Server Error: (Storage server: PureDisk:<source storage server) async_get_job_status: Replication started but failed to complete successfully: ___sosend: __crStreamWrite failed: connection reset by peer. Look at the replication logs on the source storage server for more information. V-454-105.

We are also interested in ways to make to replication/optimized duplication connection more resilient

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Michael_G_Ander
Level 6
Certified

Have found an entry called

# Time length in second for reconnect attempts

# @validate [0-3600], 0 for disabling network resiliency

ReconnectTimeout=0

in agent.cfg, but cannot find any documentation for it neither in the current Netbackup Dedupe Guide or the old PureDisk manuals.

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne
Level 6
Partner    VIP    Accredited Certified

Any progress on this?

Perhaps logged a support call yet?

Michael_G_Ander
Level 6
Certified

A little, we have discovered that by changing another backup schedule that includes a replication, we can trigger a retry of the failed replication.

Besides that we have increased the spa timeouts in the agent.cfg file to timeoutconnection=300 and timeoutreponse=600

 

 

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Marianne
Level 6
Partner    VIP    Accredited Certified

Does this mean that your issue is now resolved?

Michael_G_Ander
Level 6
Certified

Not quite yet, but the timeout changes have helped the replication job to survive some of the network issues.

Will probably do some more changes after easter.

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Michael_G_Ander
Level 6
Certified

A local symantec employee has promised to see if it was possible to find some iinternal documentation on the MSDP <-> MSDP resiliency

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Michael_G_Ander
Level 6
Certified

Unfortunately the local employee has not been able find any documentation regarding this and has been informed the we probably will have difficultities with getting AIR to survive our network outages.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Michael_G_Ander
Level 6
Certified

Have created a support case about these parameters, as expected I am getting all the WAN resiliency documents for Netbackup which does not answer my question. Will update when I hopefully get some more relevant answers.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

watsons
Level 6

Just for sharing purpose, I had once tried to trace when the AIR (SLP) retry happen. 

It is quite consistent, 3 attempts at first (every 5 min), thereafter it follows the 

SLP.IMAGE_EXTENDED_RETRY_PERIOD = 2 hours

to retry. What I don't understand though is when it came to around 9PM (12 hours after 1st attempt), it retried 3 times again for every 5 min, before waiting again to follow the "extended retry period".

First replication:  8:51AM

1st retry: 8:56AM
2nd retry: 9:01AM
3rd retry: 9:06AM
4th retry: 11:07AM
5th retry: 1:07PM
6th retry: 3:07PM
7th retry: 5:07PM
8th retry: 7:08PM
9th retry: 8:53PM (12 hours after 1st attempt)
10th retry:8:58PM
11th retry:9:03PM
12th retry:9:08PM
13th retry:11:08PM
14th retry:1:08AM
15th retry:3:08AM
16th retry:5:09AM
17th retry:7:09AM

It was a 7.6 probably GA version. AIR replication.

Michael_G_Ander
Level 6
Certified

Thinking that the 12 hours could be related to the tlog queue processing.

Or another replication could trigger a new retry serie. We are using that to try getting the problematic image through.

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Michael_G_Ander
Level 6
Certified

Unfortunately the reconnecttimeout parameter does not seem be documented/known at Symantec/Veritas either. Or at least that is what I am understand from the support case.

Too bad as AIR is a really great function, that could be even greater if it had network resiliency too.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue