Deploying Latency-sensitive Applications with Info...

pritam_bankar · ‎09-01-2020

Veritas Volume Replication (VVR) is an advanced data replication solution that provides organizations with a foundation for continuous data replication, enabling rapid and reliable recovery of mission-critical applications at remote recovery sites. For businesses that require a Recovery Point Object of zero, VVR delivers synchronous replication with additional features that help ensure maximum application availability.

Recovery Point Objective (RPO) refers to the maximum acceptable amount of data loss an application can undergo before causing measurable harm to the business. Synchronous replication comes with a performance cost for applications that are running on servers. This cost depends on network latency between two sites in a disaster recovery configuration.

What happens if an application is sensitive to latency? As latencies increase, the application could crash or may not run with the expected outcome. Network congestion or misconfiguration can increase network latencies significantly, thus diminishing synchronous IO performance.

Adaptive Sync

The Veritas Volume Replication (VVR) feature in InfoScale has a way to protect applications from increasing latency in synchronous replication. VVR’s Adaptive-Sync feature is designed to handle the Increasing-latency-problem for synchronous replication.

Adaptive Sync enables the configuration of time-outs for IO so that if any particular IO duration exceeds the time-out, an acknowledgment that the write operation has completed is returned immediately without worrying about its completion at the remote site. In other words, VVR replication will automatically switch to asynchronous mode.

Additionally, there is a configurable option to trigger the switch to asynchronous mode. When network conditions recover, replication is automatically switched back to synchronous mode.

This Adaptive Sync functionality enables applications to withstand higher IO latencies while preserving the ability to maintain remote copies of the data with low RPOs.

A word of caution: During Asynchronous Replication, RPO doesn’t remain at zero. Adaptive Sync will attempt to achieve an RPO of zero for as long as possible, but to protect application latencies, it may result in a non-zero RPO. This tradeoff must be considered per business requirements.

Data Flow

Figure 1 shows the flow of IO in a traditional VVR setup.Figure 1 IO Flow in Veritas Volume Replication without Adaptive Sync

NOTE:

Numbers repeating in circles denote they are parallel activities.
Network Ack – Used for application IO acknowledgments
Data Ack – Used for data recovery in case of site failures

Figure 2 shows the flow of timed-out IO in adaptive sync replication.

Figure 2 IO Flow in Veritas Volume Replication with Adaptive SyncThe IO time-out is measured by a background process that monitors total elapsed time for IO operations. Similarly, when network latency returns to normal, this process detects the change in latency and switches replication back to synchronous mode. There are a few parameters that provide different options to tailor the configuration of this feature. Please see the following:

iotimeout: This depends on application SLAs. Write operations start timing out if this value is exceeded.
interval: This parameter determines the sampling period before calculating a threshold percentage
threshold: This is a percentage that determines when a switch is made between asynchronous and synchronous mode.

This Infoscale feature is currently supported with InfoScale on Linux platforms.

More information on different parameters can be found at https://www.veritas.com/content/support/en_US/doc/79604030-141543652-1.

Performance Testing

The TIBCO EMS application is qualified with the Adaptive Sync feature by Veritas. The following is the observed performance of storage IO when Adaptive Sync is enabled.

Setup details

2 nodes at each site clustered using FSS + VVR replication
10 gig private interconnect between two sites
OS – RHEL 7.5
Fixed TIBCO EMS load is used which gives around 200 MB/s throughput without replication.
iotimeout is set to 1500 (1.5 ms)

ResultsFigure 3 IO Performance vs Time without Adaptive Sync

The graph above shows performance with synchronous replication. Latency is injected into the network to show the impact on IO performance.Figure 4 IO Performance vs Time with Adaptive Sync

This graph shows how Adaptive Sync helps latency-sensitive applications. After injecting a higher latency, writes start timing out. The replication service then switches to asynchronous mode. The application is thus protected from higher latencies and continues to function nominally. A slight drop in storage IO is due to a readback operation occurring on a volume logging the IO.

To learn more about how InfoScale can improve resiliency and maximize availability with TIBCO Enterprise Message Service (EMS), please see the following whitepaper: https://www.veritas.com/content/dam/Veritas/docs/white-papers/V1023-infoscale-for-tibco-enterprise-m...

For more information on InfoScale, please visit the Technical Library.

VOX