Order Statistics For the Working Engineer

Order Statistics For the Working Engineer


Imagine you’re writing a program that does one of the following:

Dubious claim. Let me give an example. It’s almost certain your system measures percentiles of some primitive operation (e.g.Β πš πš›πš’πšπšŽ_𝚝𝚘_πš—πš˜πšπšŽ\texttt{write_to_node}) and it’s possible your system measures percentiles of some aggregate operation (e.g.Β πššπšžπš˜πš›πšžπš–_πš πš›πš’πšπšŽ_πš—πš˜πšπšŽπšœ\texttt{quorum_write_nodes}). This post is about napkin-math’ing timing of aggregate operations using only metrics about the primitive.

These operations depend on the timing of the first, third, and last of nn calls respectively. Estimating the timings of operations like these can be challenging when observability systems are built only to measure the performance of single operations. However, if we can assume that response times from $πš‚π™Ύπ™Όπ™΄_πš‚π™΄πšπš…π™Έπ™²π™΄\texttt{SOME_SERVICE} are independent, we can compose an estimate of the timing of the kthk^{th} of nn calls with order statistics.

The order statistics of a random sample are the sample values placed in ascending order. We denote the minimum value from this sample as X(1)X_{\Big(1)}, the second smallest as X(2)X_{\Big(2)}, and so forth (i.e.Β X(1)≀X(2)...≀X(n)X_{\Big(1)} \leq X_{\Big(2)} ... \leq X_{\Big(n)}). Given the CDF (FXF_X) for a single call, the CDF of the kthk^{th} order statistic of nn calls is given as:

FX(k)(x)=βˆ‘j=kn(nj)[FX(x)]j[1βˆ’FX(x)]nβˆ’j\begin{equation} F_X_\left(k\right)\Big(x) = \sum_{j=k}^{n} \binom{n}{j} \Big[F_X\big(x)]^j \Big[1 - F_X\big(x)]^{n-j} \end{equation}

In some cases we can further simplify an estimate. For example, we can show that the π‘šπ‘–π‘›π‘–π‘šπ‘’π‘š\textit{minimum} of nn exponential random variables with parameter Ξ»\lambda is exponentially distributed with parameter nΞ»n\lambda.

FX(k)(x)=1βˆ’[1βˆ’(1βˆ’eβˆ’Ξ»x)]n=1βˆ’eβˆ’nΞ»x\begin{equation} F_X_\left(k\right) \big(x) = 1 - \Big[1 - \big(1 - e^{-\lambda x})]^{n} = 1 - e^{-n\lambda x} \end{equation}

In each term of the CDF we calculate the probability that jj values are smaller and nβˆ’jn-j values are larger than xx. When using k=nk = n or k=1k = 1, we have easier to calculate special cases:

FX(n)(x)=Pr(X(n)≀x)=∏k=1nPr(Xk≀x)=FX(x)nFX(1)(x)=Pr(X(1)≀x)=1βˆ’βˆk=1n[1βˆ’Pr(Xk≀x)]=1βˆ’[1βˆ’Fx(x)]n\begin{equation} \begin{aligned} F_X_{\Big(n)}\Big(x) & = & Pr\Big(X_{\Big(n)} \leq x) & = & \prod_{k=1}^n Pr\Big(X_k \leq x) & = & {F_{X}\Big(x)}^n \\ F_X_{\Big(1)}\Big(x) & = & Pr\Big(X_{\Big(1)} \leq x) & = & 1 - \prod_{k=1}^n \Big[1 - Pr\Big(X_k \leq x)] & = & 1 - \Big[1 - F_x\Big(x)]^{n} \end{aligned} \end{equation}

Reframe our Base CDF. Replace FX:ℝ→(0,1)F_X\colon \mathbb{R} \to \Big(0, 1) with FX:(0,1)β†’(0,1)F_X\colon \Big(0, 1) \to \Big(0, 1) s.t. FX(x)=x,x∈(0,1)F_X\Big(x) = x, \ x \in \Big(0, 1).

However, we don’t often have closed-form distributions for responses from $πš‚π™Ύπ™Όπ™΄_πš‚π™΄πšπš…π™Έπ™²π™΄\texttt{SOME_SERVICE}. Luckily we don’t need them to provide good estimates. Rather than working with a base CDF (FXF_X) that maps response times (in ms) to a percentile (on the unit interval), imagine that our base CDF maps a percentile to itself. Unlike distributions, percentiles of response times are readily available in our observability system.

We can also take the inverse of FX(k)F_X_{\Big(k)} to make more intuitive statements, e.g.Β β€œpXX of the aggregate is pYY of the base”, but we don’t need to belabor this point.

Under this reframing, FX(k)(pb)=paF_X_\left(k\right)\Big(p_b) = p_a indicates that the pbthp_b^{th} percentile of the singular call maps to the pathp_a^{th} percentile of the aggregate distribution. Consider an example using the 1st1^{st} response from 7 nodes. The p95 of our aggregate call roughly corresponds to the p35 of the base distribution.

This β€œtrick” is a general property and can be used without involving any parametric distribution of response times. If we wanted to be (slightly) more formal we could take FXβˆ’1(pa)F^{-1}_X\Big(p_a) to get the p95 π‘‘π‘–π‘šπ‘’\textit{time} of this aggregate call, but in most cases a quick glance at the metrics of the base distribution will suffice.