Measuring Latency Variation in the Internet

This is the companion web site to the paper “Measuring Latency Variation in the Internet”, which is currently under submission.

Data and scripts

The file below contains the scripts, BQL queries and intermediate data files used to produce the graphs in the paper:

measuring-latency-variation.tar.gz (284MiB; sha1sum c963339efc5bdc9579e441a2f05480c2945f7c50)

Identifying queueing latency in the M-Labs dataset

This is a longer explanation of the algorithm to identify self-induced queueing than that given in the paper. For context, see the paper. The algorithm is implemented in analyse.py in the file linked above.

Our analysis is based upon the fact that for some flows, we have observed a distinct pattern where the sample RTT increases over the lifetime of the flow until a congestion event¹, then sharply decreases afterwards. An example of this pattern is seen in Figure 1. We believe it is reasonable to assume that when this pattern occurs, the drop in RTT is because the queueing delay induced by the flow dissipates as it slows down. Thus, with this sharp correlation between a congestion event and a subsequent drop in RTT, we can measure the magnitude of the drop and use it as a measure of (the lower bound of) queueing delay.

Figure 1

Example of the self-induced queueing pattern. The red cross marks the congestion event

We limit the analysis to flows that have exactly one congestion event, and spend the larger part of its lifetime being limited by the congestion window (and not by other factors such as the receiver window or sender processing). Additionally, we filter out flows that run for less than 9 seconds, or transfer less than 0.2 MB of data. For the remaining flows, we identify the pattern mentioned above by the following algorithm:

Find three values: first_rtt, the first non-zero RTT sample; cong_rtt, the RTT sample at the congestion event; and cong_rtt_next, the first RTT sample after the congestion event that is different from cong_rtt.
Compute the differences between first_rtt and cong_rtt and between cong_rtt and cong_rtt_ next. If both of these values are above 40 ms², return the difference between cong_rtt and cong_ rtt_next.

We found this basic algorithm to give good results in itself. However, we found that it could be improved further by the following refinements of the basic detection algorithm:

When comparing first_rtt and cong_rtt, use the median of cong_rtt and the two previous RTT samples. This weeds out tests where only a single RTT sample (coinciding with the congestion event) is higher than the baseline.
When comparing cong_rtt and cong_rtt_next, use the minimum of the five measurements immediately following the first RTT sample that is different from cong_rtt. This makes sure we include cases where the decrease after the congestion event is not instant, but happens over a couple of RTT samples.
Compute the maximum span between the largest and smallest RTT sample in a sliding window of 10 data samples over the time period following the point of cong_rtt_next. If this span is higher than the drop in RTT after the congestion event, filter out the flow.

A congestion event is anything that causes the TCP sender to decrease its sending rate, i.e. it includes both fast retransmits and timeouts. ^[return]
We found 40 ms empirically to be a suitable threshold. ^[return]