Forum Discussion

NBU_13's avatar
NBU_13
Level 6
11 years ago

Replication has stopped in cluster server

I need help

I am having replication configured from cluster servers on Windows, today I have faced an issue where replication was stop and in DCM Mode.

Today we facing replication issues with one of the clusters. The replication is getting stopped and DCM log is getting activated.

The replication restarts after forced resync command but again stops after sometime. 

 

Thanks

  • Hello,

    I have a different opinion, replication is going to DCM mode because you have network issues. You need to find why is the resyncing getting failed.

    You should solve the network issues first because even if you increase the SRL size to what ever size, if network issue persists, some or other time the SRL will get full & replication will land in DCM log.

    When you are trying to resync forcefully, you are forcing the replication to start flushing the DCM logs to secondary however because there is a network issue (most probably) , the outcome fails. Because you have mentioned that it goes for sometime & then it fails, I am inclined to believe its a network issue or is there any performance issue on server where server is heavily loaded ?

    Try putting a continuous ping test & see if replication is connected & there is no packet drops happening. is the current status shows rlinks connected ? Was there any recent config change made ?

     

    G

  • The SRL may be getting full as this will cause replication to stop and go into DCM mode.  In UNIX this would be reported to system log, so I would guess the same is true in Windows so you should check the Windows logs.

    If this is the case, then you may need to resize SRL if it is small, or it maybe you don't have enough bandwidth in which case you need to increase bandwidth or you may be able to reduce replication traffic if for example you have a database where you can remove temporary tables from replication.

    Mike

  • Hi Mike,

    Thanks for the reply,

    RVG is in DCM state, network bandwidth is unlimited, in cluster SQL database is running and replicate to secondary site.

    we try to resyn forcely, but its failed.

    Let me check the SRL log is full, if possible, i will increase the size.

  • Hello,

    I have a different opinion, replication is going to DCM mode because you have network issues. You need to find why is the resyncing getting failed.

    You should solve the network issues first because even if you increase the SRL size to what ever size, if network issue persists, some or other time the SRL will get full & replication will land in DCM log.

    When you are trying to resync forcefully, you are forcing the replication to start flushing the DCM logs to secondary however because there is a network issue (most probably) , the outcome fails. Because you have mentioned that it goes for sometime & then it fails, I am inclined to believe its a network issue or is there any performance issue on server where server is heavily loaded ?

    Try putting a continuous ping test & see if replication is connected & there is no packet drops happening. is the current status shows rlinks connected ? Was there any recent config change made ?

     

    G

  • Hi Gaurav,

    Thanks for informaiton,

    yes, before few days, its working fine, until, the development team added the new application database to sql volume.

    So, Now, we trying to test, by stopping new application and see if the SRL is getting full and  RVG is going to DCM.

  • Ok, fair test to see if the new addition of database has caused the issue ... It will be worth to monitor replication status to double check if replication is breaking or not ... If in case its only additional writes coming in because of newly added database, replication will not break but SRL will eventually get full & land in DCM ... however if there is any other network issue, I would expect issue with replication network where too many writes are flooding SRL & untimately causing DCM to kick in..

     

    G

  • Hi Gaurav,

    Yes, once we stopped new application and deleted large file, now SRL is getting decrease, the issue with new application it is loading large data to primary node and dat sent is running slow,

    requested to Network and application team to work on this.

    Thanks Gaurav and mike.