Spreading Rumors on The Internet Gossip Algorithms in Distributed Systems

Spreading Rumors on The Internet
Gossip Algorithms in Distributed Systems

1: Gossip Algorithms in Distributed Systems

We say a system is $\textit{distributed}$ if it runs on multiple servers that communicate by sending messages over a network. Although distributed systems can be challenging to program correctly, they can offer better performance and resiliency than single-node systems. In the background, many distributed systems use $\textbf{gossip algorithms}$ to coordinate sharing information quickly over large networks.

If you’re familiar with distributed systems you’ve likely heard of (or even used!) applications that run a variation of gossip. Notably, the distributed databases Riak, Apache Cassandra, and DynamoDB all use gossip algorithms for communicating information about cluster membership and node failure detection.

$\textbf{Fig. 1.1}$ : Randomized Gossip on Complete Graph

In a gossip algorithm, one node broadcasts an update to all other nodes through multiple rounds of point-to-point messaging. A node spreads an update by sending a message to (or “gossiping with”) a randomly selected neighbor once per round. In subsequent rounds, the selected node is “activated” and begins gossiping with its neighbors. We allow this process to continue until all nodes have received a message. For the remainder of our post, I’ll ignore many of the challenges associated with building real distributed systems and represent them as $\textit{undirected graphs}$ .

Consider a distributed database with nodes storing an identical set $\Big(\textbf{S})$ . At any moment, any node may receive an update that modifies its copy of $\textbf{S}$ . Following the update, the node stores a different set $\Big(\textbf{S}')$ . Until this change is propagated to all nodes, those storing $\textbf{S}$ and those storing $\textbf{S}'$ will return different results for identical operations. To understand the magnitude of this problem, we must understand the performance of the gossip algorithm. It’s clear that the number of rounds ( $R_g$ ) required for gossip to complete is variable, but what can we say about its $\textit{distribution}$ ?

Section 2: Lower & Upper Bounds on The Gossip Process

We’ll begin with a brief look at arbitrary graphs. Any results from this general case should also apply to our analysis on complete graphs. On an arbitrary graph with an update initialized at $v_o$ we can see that $R_g \geq \max\{\ d\Big(v_o,v_j) \ | \ v_j \in \textbf{V} \ \}$ . In the worst case, $v_o$ is in the pair of nodes that define the diameter. In that event, we have $R_g \geq \text{Diam}\Big(\textbf{G})$ .

The $\textit{diameter}$ of a graph is the max. shortest path distance between any pair of nodes. For a graph $\Big(\textbf{G)}$ with a set of vertexes $\Big(\textbf{V})$ :

$\begin{equation} \text{Diam}\Big(\textbf{G}) = \max\{ \ d\Big(v_i,v_j) \ | \ v_i, v_j \in \textbf{V} \ \} \end{equation}$

Where $d \colon V \times V \to \mathbb{R}$ a distance function defined on the graph. We’ll assume graphs are unweighted and $d\Big(v_i, v_j)$ is the count of edges between $v_i$ and $v_j$ .

$\textbf{Fig. 2.1}$ : Randomized Gossip on Wheel Graph

I calulate an upper bound for gossiping on a star graph in a seperate post.

$\textbf{Fig. 2.2}$ : Deterministic Gossip

We can simulate a system completing gossip in $\lceil \log_2\Big(8) \rceil = 3$ rounds. With randomized gossip, the probability the algorithm completes in $\lceil \log_2\Big(n) \rceil$ rounds becomes $\textit{very small}$ as $n$ increases.

$\textbf{Fig. 2.3}$ : Rounds to Gossip Is Unbounded

This inequality asserts that an update cannot travel further from its origin node than rounds have elapsed. If $v_o$ is five nodes from its furthest peer, we can be certain the algorithm does not complete in four rounds. This is a fine observation, but we can construct examples where $\text{Diam}\Big(\textbf{G})$ is well below the minimum of $R_g$ . One such example is a $\textbf{star graph}$ .

From looking at an $n$ node star graph, we can see $R_g \geq n - 1$ regardless of the location of $v_o$ . There are just two cases to consider:

The central node is the origin node. This node must send at least $n - 1$ messages to update the $n - 1$ outer nodes.
An outer node is the origin node. This node sends one message to the central node, and the central node sends at least $n - 2$ messages to update the remaining nodes.

Using specific information about a graph allows us to produce a tighter bound than we could otherwise. Let’s turn our focus to $\textit{complete graphs}$ . As with star graphs, the diameter is not a tight lower bound for $R_g$ .

Let’s define $U_t$ as the number of updated nodes after $t$ rounds. Intuitively, at the beginning of the process we expect $U_t$ to nearly double in each round. Because we send just one message per node/round, doubling $U_t$ is the optimal outcome for any round. As $U_t$ increases, message efficiency decreases and we expect less round-over-round growth.

Though incredibly unlikely, we $\textit{could}$ double $U_t$ in each round through the entire gossip process and the process would complete in $\lceil \log_2\Big(n) \rceil$ rounds. This gives us a greatest lower bound on $R_g$ , gossip will never complete before $\lceil \log_2\Big(n) \rceil$ rounds!

We could design a $\textit{deterministic}$ algorithm that always finishes in $\lceil \log_2\Big(n) \rceil$ rounds (ignoring unreliable networks, failed nodes, etc.), but this choice is likely not ideal in most distributed systems as it trades fault tolerance for a relatively minor performance improvement.

Now let’s consider an upper bound for $R_g$ . For systems with $n > 2$ , $R_g$ is unbounded from above. Consider a three node system with two updated nodes. For gossip to complete in the next round, either of the updated nodes must message the third. Because peers are chosen independently, the distribution of the number of rounds until the third node is updated is geometrically distributed. Although exceedingly unlikely, this gives us a non-zero probability for all $r \in \mathbb{N}$ . Although we don’t have a strict upper bound, In the following sections we’ll provide a probabilistic upper bound on $R_g$ by estimating how often it exceeds any given threshold.

Section 3: Calculating Expected Time to Gossip

Before finding a probabilistic upper bound we’ll consider the $\textit{expectation}$ of $R_g$ . In some literature, gossip algorithms are referred to as “epidemic algorithms” and analyzed using models from epidemiology. We’ll use a similar model to find the expected number of updated nodes as a function of rounds, $u\Big(t)$ .

Let’s model $du/dt$ as a function of updated nodes ( $u$ ), non-updated nodes ( $n - u$ ), and the “contact rate” between any pair of an updated and non-updated node ( $\beta$ ). On a complete graph there are $\big(n - u)u$ edges between these pairs. Instead of a “contact” rate, we’ll treat $\beta$ as the message probability per edge-round.

$\begin{equation} \frac{du}{dt} = \beta\Big(n - u)u \end{equation}$

Instead, we can set $\beta$ to be dependent on the current state of the system. From the perspective of a non-updated node, the probability of being updated by $\textit{any}$ one of $u$ updated nodes is:

$\begin{equation} \beta = \big(1 - \big(1 - 1/n)^u)u^{-1} \approx \big(1 - e^{-u/n})u^{-1}. \end{equation}$

Where $\big(1 - \big(1 - 1/n)^u)$ represents the probability of being updated by at least one node and dividing by $u$ gives the average update rate per edge. Using a second-degree Taylor Expansion of $e^{-u/n}$ gives:

$\begin{equation} \frac{du}{dt} \approx \big(1 - e^{-u/n})\Big(n - u) \approx \big( u/n - u^2/2n^2 )\big(n - u) \end{equation}$

As in the simpler case, we can find $u\big(t)$ and then solve for $u\big(t) = n - 1$ . I will admit that I needed to poke at Wolfram a while to resolve this.

$\begin{equation} n - 1 = u\big(t) \implies t \approx \log\big(n^3/2) \end{equation}$

The key result from this exercise is that a result using $\beta = 1/n$ has the same order as the result obtained from a more rigorous analysis. When $n$ is large, the ratio of the two is just $\sim1.5$ .

To match the literature, we’ll use $\beta=1/n$ as the contact rate, but we should have some minor reservations about this choice. A constant $\beta$ implies any updated node will update any non-updated node with fixed probability (e.g. $1/n$ ). This is an $\textit{overestimate}$ of the contact rate and implies a faster completion time. This value necessarily ignores collisions and the one message per node-round cap we’ve imposed.

$\begin{equation} \frac{du}{dt} = \beta\Big(n - u)u \ \ ,\ \ \ u\big(0) = 1 \end{equation}$

If we solve for $u\big(t)$ and set $t = k\log\Big(n)$ , we get an equation that allows us to draw a some nice conclusions about the average performance of the algorithm.

$\begin{equation} \begin{aligned} u\big(t) & = & \frac{ne^{\beta n t}}{e^{\beta nt} + n - 1} \\ & = & \frac{ne^{t}}{e^{t} + n - 1} \\ & = & \frac{n^{k + 1}}{n^k + n - 1} \qquad \textit{Set: } t=2\log\big(n) \end{aligned} \end{equation}$

When $n$ is large and $t = 2\log\big(n)$ , $u\big(t) \approx n - 1$ . With this, we can say that after $2\log\big(n)$ rounds almost all nodes have received the update, $E\big(R_g) \approx 2\log\big(n)$ .

Section 4: Calculating Variance of Time To Gossip

We’re still in search of a way to describe the right tail of the distribution of $R_g$ . In this section we’ll calculate the distributions of the $\textit{time}$ the system spends with exactly $u$ nodes updated. This approach will confirm our previous result about the expected completion time and provide the variance for the entire process.

As in the previous section, we’ll assume each of $\big(n - u)u$ edges sends a message with probability $\beta = 1/n$ per round. Because peer selection is randomized, we may treat the time to send a message on any edge as exponentially distributed with mean of $n$ rounds (i.e. $\textit{Exp}\big(1/n)$ ).

Regardless of the number of rounds elapsed, the waiting time until $u$ is incremented is distributed as the minimum of $\big(n - u)u$ independent exponential variables with parameter $n$ .

Using order statistics we can show that the $\textit{minimum}$ value from a sample of $k$ exponential random variables with parameter $\lambda$ is exponentially distributed with parameter $k\lambda$ . In our case, the time until another node is updated is distributed as $\textit{Exp}\big(\frac{u\big(n - u)}{n})$ .

Here each $\textbf{state}$ means “1 node updated, n - 1 nodes unchanged”, “2 nodes updated, n - 2 nodes unchanged”, etc. We can sume the expectation and waiting times in each of these states up to “n nodes updated” to find an alternative estimate for $R_g$ .

We now must compute the expectation and variance (in rounds) of time spent in each state. Let’s define $\tau_u$ as the first time the system has at least $u$ nodes updated. We’ll then confirm that $E\big(\tau_n) \approx E\big(R_g)$ . Our new framing of the problem should preserve our expectation of $R_g$ .

$\begin{equation} \begin{aligned} E\big(\tau_n) & = & \sum_{u = 1}^{n - 1} \frac{n}{u\big(n - u)} \\ & = & \sum_{u = 1}^{n - 1} \frac{1}{u} + \frac{1}{\big(n - u)} \quad \textit{as partial fractions}\\ & \approx & 2 H_{n} \\ & \approx & 2 \log\big(n) + 0.5772 + \frac{1}{2n} \quad \textit{approx. of Harmonic Series} \\ & \approx & 2\log\big(n) \end{aligned} \end{equation}$

Excellent, just as expected! Because the waiting times are independent variables we can sum the variances of all waits to get $\textit{Var}\big(\tau_{n})$ .

$\begin{equation} \begin{aligned} \text{Var}\big(\tau_n) & = & \sum_{u = 1}^{n - 1} (\frac{n}{u\big(n - u)})^{2} \\ & = & \sum_{u=1}^{n-1} \frac{1}{u^{2}} + \frac{1}{\big(n-u)^{2}} - \frac{2}{n^2 - nu} \quad \textit{as partial fractions} \\ & \approx & \sum_{u=1}^{n-1} \frac{1}{u^{2}} + \frac{1}{\big(n-u)^{2}} \\ & \approx & 2 \pi^2 / 6 \\ \end{aligned} \end{equation}$

The last approximation uses the fact that the sum of inverse squares (see: The Basel Problem) converges to $\pi^2/6$ . When $n$ is sufficiently large we can approximate $\sum_{i=1}^{n - 1}\left(1/u^2 + 1/\big(n - u)^2 \right)$ as $2\pi^2/6$ . This is an interesting sum by itself, but it also gives us a quick, useful result. It suggests variance is bounded even with large $n$ !

$\textbf{Comment:}$ This is an OK answer. We set out to understand the risk of large divergences when $n$ is very large, but what if we $\textit{really}$ needed a distribution? I’d suggest the following:

The maximum-entropy distribution given a minimum ( $\log_2\big(n)$ ) and a mean ( $3\log\big(n)$ ) follows the pareto distribution. Can we solve for the shape parameter of the Pareto to produce a distribution which we $\textit{know}$ will overstimate the weight on the tails?

With any luck (focus, time, etc.), I will have a follow up post on this soon ${^{TM}}$ .

In a diversion off the last section, I showed $3\log\big(n/2)$ may be a more reasonable estimate for $E\big(R_g)$ than $2\log\big(n)$ . We can show the variance is still bounded when $du/dt = \big(1 - e^{-u/n})\big(n - u)$ .

$\begin{equation} \begin{aligned} \left(\big(n - u)\big(\frac{u}{n} - \frac{u^2}{2n^2})\right)^{-1} & = & \frac{1}{u - 2n} - \frac{2}{u - n} + \frac{1}{u} \\ \left(\big(n - u)\big(\frac{u}{n} - \frac{u^2}{2n^2})\right)^{-2} & = &\frac{1}{\big(u - 2n)^2} + \frac{4}{\big(u - n)^2} + \frac{3}{nu} + \frac{1}{u^2} - \frac{3}{n\big(u - 2n)} \\ \end{aligned} \end{equation}$

$\begin{equation} \begin{aligned} \text{Var}\big(\tau_n) & = & \sum_{u=1}^{n-1} \frac{1}{\big(u - 2n)^2} + \frac{4}{\big(u - n)^2} + \frac{3}{nu} + \frac{1}{u^2} - \frac{3}{n\big(u - 2n)} & \approx & 5 \pi^2 / 6 \end{aligned} \end{equation}$

So what are the implications of having bounded variance? We confirmed that $E\big(\tau_n)$ increases slowly with $n$ and showed $\tau_n$ stays near the expectation even for very large $n$ . If we use gossip to disseminate information in massive systems we don’t need to worry about worst case performance becoming orders of magnitude worse than the expectation. If we had the distribution of $\tau_n$ we could find the cumulative distribution function and get a neat final answer, but for today the mean and variance will suffice.

Section 5: Gossip Algorithms For the Real World

Finally, I’d like to mention a few variations of the idealized gossip algorithm that can make it more suitable for real-world distributed systems. This section does not build on any previous results, but instead answers a few questions about making gossip more robust.

$\textbf{Fig. 5.1}$ : Stopping The Gossip Process — Deactivate After Message to Updated Node

The graph above illustrates what happens when the deactivation probability is set $\textit{too high}$ . When set properly the failure percentage is near zero. In this example, the algorithm will fail in ~20% of attempts.

We haven’t described a mechanism for $\textbf{stopping}$ gossip. In distributed systems we communicate information about remote state by passing messages, so are we meant to send another round of messages to signal that gossip has completed?

Not necessarily! From the previous section, we expect that gossip completes in $\sim3\log\big(n)$ rounds. If we set $3k\log\Big(n)$ as a cap on the number of rounds gossiped per node, we can show that for some $k > 1$ , all nodes will be updated (with high probability) $\textit{before}$ the algorithm stops. Luckily, there are solutions which are less message intensive. Demers demonstrates we can achieve the same completeness guarantees if we deactivate nodes with probability $1/n$ after each message sent to an already updated node.

$\textbf{Fig. 5.2}$ Broadcast to All Peers Per Round

$\textbf{Fig. 5.3}$ Spreading Multiple Updates w. $k > 1$ .

The graphs above illustrate how reliable broadcast built from point-to-point messaging and single-round broadcast perform when faced with unreliable networks. The latter is successful with far fewer wasted messages.

In $\textbf{Section 3}$ and $\textbf{Section 4}$ we assumed $\beta = 1/n$ and then revised these values to be a bit more realistic. What about redoing those calculations with $\beta = k/n, k > 1$ ? We’ll try to achieve faster gossip time at the cost of some additional message overhead. Can we simply select multiple peers per round? What about broadcasting a single update to $\textit{all}$ nodes in the network in a single round? Gossiping multiple point-to-point messages per round can be an effective way to improve the speed of the gossip algorithm. However, broadcasting an update to all nodes from a single site (“single-round broadcast”) is normally not the best choice in large distributed systems.

In practice we must account for unreliable networks and incomplete information. Creating a reliable broadcast from point-to-point messages helps us be resilient to both of these failure modes. We $\textit{can}$ construct reliable broadcast from single-round broadcast, but it comes at a cost of significantly more messages transmitted when networks are unreliable.

Imagine a system that receives twenty updates in quick succession. Should we initialize twenty unique gossip processes or is there a more efficient way to spread this information? One solution is to use pull-based messaging. With pull-based difference resolution, every node contacts one peer per round and we mark a node updated if they’ve $\textit{contacted}$ an updated node. When there are many active updates, It’s very likely that any node contacts a peer with at least one new update. If the system receives updates infrequently and far apart, pulling creates additional messaging overhead with relatively little benefit.

$\textbf{Fig. 5.4}$ Pull-Based Gossip

The average performance of pull-based messaging is a bit faster than the standard (push) algorithm. We can approximate the probability a node has not been updated after $i$ rounds as a recurrence relationship for both push and pull-based gossip.

$\begin{equation} \begin{align} p_{i + 1} & = & p_i\Big(1 - 1/n)^{n\Big(1-p_i)} \\ p_{i + 1} & = & p_i^2 \\ \end{align} \end{equation}$

In the earlier rounds, the performance of the two algorithms is close because $p_i \approx \big(1 - 1/n)^{n\big(1 - p_i)}$ . When $p$ is smaller, the pull-based algorithm decreases to zero quadratically while the push based method decreases by a constant factor $\frac{1}{e}$ .
It was demonstrated that using a combination of push in the beginning and then pull towards the end, total messages sent and time to finish gossip can be improved compared to either algorithm in isolation.

Section 6: Closing Remarks

This concludes our tour of gossip algorithms. Let’s do a brief review:

Gossip is used to spread information in a distributed system and is resilient to incomplete membership information and unreliable networks.
Gossiping with point-to-point messaging starts relatively slowly and has a minimum of $\lceil\text{log}_2\big(n)\rceil$ rounds.
The number of rounds needed to complete the gossip process is unbounded from above.
Gossip scales well as the number of nodes in a system increases, $E\big(R_g) \approx 3\log\big(n/2)$ with bounded variance $Var\Big(R_g) \sim 5 \pi^2 / 6$
Variations can be made on the idealized gossip algorithm to make it more suitable for use in real distributed systems.

References & Further Reading