Forum Discussion

kongzzzz's avatar
kongzzzz
Level 3
12 years ago

why VCS retry online 2 minutes later

 

Hello all
 
I use SF 6.0.1 on Suse 11. I have a IP resource managed by VCS, in some situation, the IP resource can't be brought online at first time, so I set OnlineRetryLimit of IP to 3, hope VCS may be brought online via retry.
 
I executed my test, IP may be brought online after retry, but the interval between the first online and the second online(retry) is about 2 minutes, here is engine_A.log:
 
2013/04/03 01:42:31 VCS WARNING V-16-10031-4604 IP:MyappVip:online:Address 2001:1b70:0200:1026:0000:0000:0135:0084 
already exists: Res MyappVip will not go online.
2013/04/03 01:44:32 VCS ERROR V-16-2-13066 Agent is calling clean for resource(MyappVip) because the resource is not up even after online completed.
2013/04/03 01:44:33 VCS INFO V-16-2-13716 Resource(MyappVip): Output of the completed operation (clean) 
==============================================
RTNETLINK answers: Cannot assign requested address
==============================================
2013/04/03 01:44:33 VCS INFO V-16-2-13068 Resource(MyappVip) - clean completed successfully.
2013/04/03 01:44:33 VCS INFO V-16-2-13072 Resource(MyappVip): Agent is retrying online (attempt number 1 of 3).
 
 
I hope that VCS can retry online after 4 seconds, but I can not find a suitable parameter to decrease the interval, so my question is:
 
How to retry online after 4 seconds?
 
  • Hi kongzzzz,

    The delay of 2 mins you are seeing is because of OnlineWaitLimit default value of 2. Agent waits for 2 monitor cycles after online is completed. If resource is not online even affter 2 monitor cycles then clean is called and online is retried based on OnlineRetryLimit value.

    You can modify OnlineWaitLimit for IP agent from default value of 2 to 0. This will have following agent behavior -

    If monitor entry point after online entry point reports resrouce as offline, clean will be scheduled immediately (without any delay in between). Once clean completes succussfully online will be retried immediately. The only delay seen in between is time required by the entry point to execute.

    Hope this helps your usecase.

    Thanks and Regards,

    Paresh Bafna

  • Hello kongzzz,

     

    I guess your Onlinetimeout has been set to 120 seconds.

    VCS will wait until this time has passed and then call the clean function before it retries the online procedure. When the OnlineRetryLimit is set to a non-zero value, the agent framework calls the Clean function before rerunning the Online function.

     

    So your total wait time until the next online attempt is OnlineTimeout + time taken by clean function.

     

    Please also note that setting the OnlineTimeout too low might lead to false alarms and/or resource not been able to online at all.

    For the IP resource for example, THe IP agent does a IP online and 2 ARP requests which both take 2-5 seconds each.

    Bringing an IP online using VCS will take roughly 10 seconds, on a heavy loaded system or when you bring several IPs online at the same time you need even more time (as the ARP requests are not send in parallel atm, an enhancement for this will be included in a later version of the IP agent).

    I'd suggest you perform some more tests to find out the maximum time needed  to bring the resources online (especially if you start all service groups at once on a system), based on the time needed you can adjust the OnlineTimeout using below command:

     

    #hatype -display IP | grep OnlineTimeout

    #haconf -makerw

    #hatype -modify IP OnlineTimeout <new timeout>

    #hatype -display IP | grep OnlineTimeout

    #haconf -dump -makero

     

    Thanks,
    Dan

     

     

  • Hello kongzzz,

     

    You can try to change  OnlineWaitLimit  to 1 to see if that can improve.

     

    #haconf -makerw

    #hatype -modify IP OnlineWaitLimit 1

    #hatype -display IP | grep OnlineWaitLimit

    #haconf -dump -makero

     

     

    Regards

     

  • Hi kongzzzz,

    The delay of 2 mins you are seeing is because of OnlineWaitLimit default value of 2. Agent waits for 2 monitor cycles after online is completed. If resource is not online even affter 2 monitor cycles then clean is called and online is retried based on OnlineRetryLimit value.

    You can modify OnlineWaitLimit for IP agent from default value of 2 to 0. This will have following agent behavior -

    If monitor entry point after online entry point reports resrouce as offline, clean will be scheduled immediately (without any delay in between). Once clean completes succussfully online will be retried immediately. The only delay seen in between is time required by the entry point to execute.

    Hope this helps your usecase.

    Thanks and Regards,

    Paresh Bafna

  • Hello Daniel and starflyfly

     

    Thanks for your help.

    The OnlineTimeout is 300 previously. As your suggestion, I changed it to 30 and test my case again, but the CLEAN was still 2 minutes later.

    I have not tried the OnlineWaitLimit, but I checked the current value, that is 2, already is a little value, so I guess it can not improve. Anyway I will try it later.

    Additional info: I also tried OnlineRetryInterval (set it to 4), but it also can not improve.

    Is there any other parameter can decrease the 120 seconds?

  • Hello Paresh

    Very thanks for your professional explaination, I will try the OnlineWaitLimit later.

     

    Thanks and Regards

    kongzzzz

  • The default MonitorInterval is 60 seconds so the delay in retrying is OnlineTimeOut Multipled by 2 (the OnlineWaitLimit), so changing OnlineWaitLimit to 1 should reduce the delay from 120 to 60 and you could reduce it further by saying changing MonitorInterval to 30, but I wouldn't set it any lower as otherwise the Monitor entry point will run too frequently in normal operation.  Also you should change MonitorTimeout to the same value (or less) than MonitorInterval so you don't get overlapping monitor entry points.

    Mike

  • Thanks all of your professional support. VCS trigger next online attempt of IP immediately after changing OnlineWaitLimit to 0. That works fine.