Spreading Rumors on The Internet Gossip Algorithms in Distributed Systems

Spreading Rumors on The Internet
Gossip Algorithms in Distributed Systems

1: Gossip Algorithms in Distributed Systems

We say a system is π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘’π‘‘\textit{distributed} if it runs on multiple servers that communicate by sending messages over a network. Although distributed systems can be challenging to program correctly, they can offer better performance and resiliency than single-node systems. In the background, many distributed systems use 𝐠𝐨𝐬𝐬𝐒𝐩 𝐚π₯𝐠𝐨𝐫𝐒𝐭𝐑𝐦𝐬\textbf{gossip algorithms} to coordinate sharing information quickly over large networks.

If you’re familiar with distributed systems you’ve likely heard of (or even used!) applications that run a variation of gossip. Notably, the distributed databases Riak, Apache Cassandra, and DynamoDB all use gossip algorithms for communicating information about cluster membership and node failure detection.

𝐅𝐒𝐠. 𝟏.𝟏\textbf{Fig. 1.1}: Randomized Gossip on Complete Graph

In a gossip algorithm, one node broadcasts an update to all other nodes through multiple rounds of point-to-point messaging. A node spreads an update by sending a message to (or β€œgossiping with”) a randomly selected neighbor once per round. In subsequent rounds, the selected node is β€œactivated” and begins gossiping with its neighbors. We allow this process to continue until all nodes have received a message. For the remainder of our post, I’ll ignore many of the challenges associated with building real distributed systems and represent them as π‘’π‘›π‘‘π‘–π‘Ÿπ‘’π‘π‘‘π‘’π‘‘ π‘”π‘Ÿπ‘Žπ‘β„Žπ‘ \textit{undirected graphs}.

Consider a distributed database with nodes storing an identical set (𝐒)\Big(\textbf{S}). At any moment, any node may receive an update that modifies its copy of 𝐒\textbf{S}. Following the update, the node stores a different set (𝐒′)\Big(\textbf{S}'). Until this change is propagated to all nodes, those storing 𝐒\textbf{S} and those storing 𝐒′\textbf{S}' will return different results for identical operations. To understand the magnitude of this problem, we must understand the performance of the gossip algorithm. It’s clear that the number of rounds (RgR_g) required for gossip to complete is variable, but what can we say about its π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘›\textit{distribution}?


Section 2: Lower & Upper Bounds on The Gossip Process

We’ll begin with a brief look at arbitrary graphs. Any results from this general case should also apply to our analysis on complete graphs. On an arbitrary graph with an update initialized at vov_o we can see that Rgβ‰₯max{d(vo,vj)|vjβˆˆπ•}R_g \geq \max\{\ d\Big(v_o,v_j) \ | \ v_j \in \textbf{V} \ \}. In the worst case, vov_o is in the pair of nodes that define the diameter. In that event, we have Rgβ‰₯Diam(𝐆)R_g \geq \text{Diam}\Big(\textbf{G}).

The π‘‘π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ\textit{diameter} of a graph is the max. shortest path distance between any pair of nodes. For a graph (𝐆)\Big(\textbf{G)} with a set of vertexes (𝐕)\Big(\textbf{V}):

Diam(𝐆)=max{d(vi,vj)|vi,vjβˆˆπ•}\begin{equation} \text{Diam}\Big(\textbf{G}) = \max\{ \ d\Big(v_i,v_j) \ | \ v_i, v_j \in \textbf{V} \ \} \end{equation}

Where d:VΓ—V→ℝd \colon V \times V \to \mathbb{R} a distance function defined on the graph. We’ll assume graphs are unweighted and d(vi,vj)d\Big(v_i, v_j) is the count of edges between viv_i and vjv_j.

𝐅𝐒𝐠. 𝟐.𝟏\textbf{Fig. 2.1}: Randomized Gossip on Wheel Graph

I calulate an upper bound for gossiping on a star graph in a seperate post.

𝐅𝐒𝐠. 𝟐.𝟐\textbf{Fig. 2.2}: Deterministic Gossip

We can simulate a system completing gossip in ⌈log2(8)βŒ‰=3\lceil \log_2\Big(8) \rceil = 3 rounds. With randomized gossip, the probability the algorithm completes in ⌈log2(n)βŒ‰\lceil \log_2\Big(n) \rceil rounds becomes π‘£π‘’π‘Ÿπ‘¦ π‘ π‘šπ‘Žπ‘™π‘™\textit{very small} as nn increases.

𝐅𝐒𝐠. 𝟐.πŸ‘\textbf{Fig. 2.3}: Rounds to Gossip Is Unbounded

This inequality asserts that an update cannot travel further from its origin node than rounds have elapsed. If vov_o is five nodes from its furthest peer, we can be certain the algorithm does not complete in four rounds. This is a fine observation, but we can construct examples where Diam(𝐆)\text{Diam}\Big(\textbf{G}) is well below the minimum of RgR_g. One such example is a 𝐬𝐭𝐚𝐫 𝐠𝐫𝐚𝐩𝐑\textbf{star graph}.

From looking at an nn node star graph, we can see Rgβ‰₯nβˆ’1R_g \geq n - 1 regardless of the location of vov_o. There are just two cases to consider:

Using specific information about a graph allows us to produce a tighter bound than we could otherwise. Let’s turn our focus to π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’ π‘”π‘Ÿπ‘Žπ‘β„Žπ‘ \textit{complete graphs}. As with star graphs, the diameter is not a tight lower bound for RgR_g.

Let’s define UtU_t as the number of updated nodes after tt rounds. Intuitively, at the beginning of the process we expect UtU_t to nearly double in each round. Because we send just one message per node/round, doubling UtU_t is the optimal outcome for any round. As UtU_t increases, message efficiency decreases and we expect less round-over-round growth.

Though incredibly unlikely, we π‘π‘œπ‘’π‘™π‘‘\textit{could} double UtU_t in each round through the entire gossip process and the process would complete in ⌈log2(n)βŒ‰\lceil \log_2\Big(n) \rceil rounds. This gives us a greatest lower bound on RgR_g, gossip will never complete before ⌈log2(n)βŒ‰\lceil \log_2\Big(n) \rceil rounds!

We could design a π‘‘π‘’π‘‘π‘’π‘Ÿπ‘šπ‘–π‘›π‘–π‘ π‘‘π‘–π‘\textit{deterministic} algorithm that always finishes in ⌈log2(n)βŒ‰\lceil \log_2\Big(n) \rceil rounds (ignoring unreliable networks, failed nodes, etc.), but this choice is likely not ideal in most distributed systems as it trades fault tolerance for a relatively minor performance improvement.

Now let’s consider an upper bound for RgR_g. For systems with n>2n > 2, RgR_g is unbounded from above. Consider a three node system with two updated nodes. For gossip to complete in the next round, either of the updated nodes must message the third. Because peers are chosen independently, the distribution of the number of rounds until the third node is updated is geometrically distributed. Although exceedingly unlikely, this gives us a non-zero probability for all rβˆˆβ„•r \in \mathbb{N}. Although we don’t have a strict upper bound, In the following sections we’ll provide a probabilistic upper bound on RgR_g by estimating how often it exceeds any given threshold.



Section 3: Calculating Expected Time to Gossip

Before finding a probabilistic upper bound we’ll consider the 𝑒π‘₯π‘π‘’π‘π‘‘π‘Žπ‘‘π‘–π‘œπ‘›\textit{expectation} of RgR_g. In some literature, gossip algorithms are referred to as β€œepidemic algorithms” and analyzed using models from epidemiology. We’ll use a similar model to find the expected number of updated nodes as a function of rounds, u(t)u\Big(t).

Let’s model du/dtdu/dt as a function of updated nodes (uu), non-updated nodes (nβˆ’un - u), and the β€œcontact rate” between any pair of an updated and non-updated node (Ξ²\beta). On a complete graph there are (nβˆ’u)u\big(n - u)u edges between these pairs. Instead of a β€œcontact” rate, we’ll treat Ξ²\beta as the message probability per edge-round.

dudt=Ξ²(nβˆ’u)u\begin{equation} \frac{du}{dt} = \beta\Big(n - u)u \end{equation}

Instead, we can set Ξ²\beta to be dependent on the current state of the system. From the perspective of a non-updated node, the probability of being updated by π‘Žπ‘›π‘¦\textit{any} one of uu updated nodes is:

Ξ²=(1βˆ’(1βˆ’1/n)u)uβˆ’1β‰ˆ(1βˆ’eβˆ’u/n)uβˆ’1.\begin{equation} \beta = \big(1 - \big(1 - 1/n)^u)u^{-1} \approx \big(1 - e^{-u/n})u^{-1}. \end{equation}

Where (1βˆ’(1βˆ’1/n)u)\big(1 - \big(1 - 1/n)^u) represents the probability of being updated by at least one node and dividing by uu gives the average update rate per edge. Using a second-degree Taylor Expansion of eβˆ’u/ne^{-u/n} gives:

dudtβ‰ˆ(1βˆ’eβˆ’u/n)(nβˆ’u)β‰ˆ(u/nβˆ’u2/2n2)(nβˆ’u)\begin{equation} \frac{du}{dt} \approx \big(1 - e^{-u/n})\Big(n - u) \approx \big( u/n - u^2/2n^2 )\big(n - u) \end{equation}

As in the simpler case, we can find u(t)u\big(t) and then solve for u(t)=nβˆ’1u\big(t) = n - 1. I will admit that I needed to poke at Wolfram a while to resolve this.

nβˆ’1=u(t)⟹tβ‰ˆlog(n3/2)\begin{equation} n - 1 = u\big(t) \implies t \approx \log\big(n^3/2) \end{equation}

The key result from this exercise is that a result using β=1/n\beta = 1/n has the same order as the result obtained from a more rigorous analysis. When nn is large, the ratio of the two is just ∼1.5\sim1.5.

To match the literature, we’ll use Ξ²=1/n\beta=1/n as the contact rate, but we should have some minor reservations about this choice. A constant Ξ²\beta implies any updated node will update any non-updated node with fixed probability (e.g.Β 1/n1/n). This is an π‘œπ‘£π‘’π‘Ÿπ‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’\textit{overestimate} of the contact rate and implies a faster completion time. This value necessarily ignores collisions and the one message per node-round cap we’ve imposed.

dudt=Ξ²(nβˆ’u)u,u(0)=1\begin{equation} \frac{du}{dt} = \beta\Big(n - u)u \ \ ,\ \ \ u\big(0) = 1 \end{equation}

If we solve for u(t)u\big(t) and set t=klog(n)t = k\log\Big(n), we get an equation that allows us to draw a some nice conclusions about the average performance of the algorithm.

u(t)=neΞ²nteΞ²nt+nβˆ’1=netet+nβˆ’1=nk+1nk+nβˆ’1𝑆𝑒𝑑: t=2log(n)\begin{equation} \begin{aligned} u\big(t) & = & \frac{ne^{\beta n t}}{e^{\beta nt} + n - 1} \\ & = & \frac{ne^{t}}{e^{t} + n - 1} \\ & = & \frac{n^{k + 1}}{n^k + n - 1} \qquad \textit{Set: } t=2\log\big(n) \end{aligned} \end{equation}

When nn is large and t=2log(n)t = 2\log\big(n), u(t)β‰ˆnβˆ’1u\big(t) \approx n - 1. With this, we can say that after 2log(n)2\log\big(n) rounds almost all nodes have received the update, E(Rg)β‰ˆ2log(n)E\big(R_g) \approx 2\log\big(n).



Section 4: Calculating Variance of Time To Gossip

We’re still in search of a way to describe the right tail of the distribution of RgR_g. In this section we’ll calculate the distributions of the π‘‘π‘–π‘šπ‘’\textit{time} the system spends with exactly uu nodes updated. This approach will confirm our previous result about the expected completion time and provide the variance for the entire process.

As in the previous section, we’ll assume each of (nβˆ’u)u\big(n - u)u edges sends a message with probability Ξ²=1/n\beta = 1/n per round. Because peer selection is randomized, we may treat the time to send a message on any edge as exponentially distributed with mean of nn rounds (i.e. 𝐸π‘₯𝑝(1/n)\textit{Exp}\big(1/n)).

Regardless of the number of rounds elapsed, the waiting time until uu is incremented is distributed as the minimum of (nβˆ’u)u\big(n - u)u independent exponential variables with parameter nn.

Using order statistics we can show that the π‘šπ‘–π‘›π‘–π‘šπ‘’π‘š\textit{minimum} value from a sample of kk exponential random variables with parameter Ξ»\lambda is exponentially distributed with parameter kΞ»k\lambda. In our case, the time until another node is updated is distributed as 𝐸π‘₯𝑝(u(nβˆ’u)n)\textit{Exp}\big(\frac{u\big(n - u)}{n}).

Here each 𝐬𝐭𝐚𝐭𝐞\textbf{state} means β€œ1 node updated, n - 1 nodes unchanged”, β€œ2 nodes updated, n - 2 nodes unchanged”, etc. We can sume the expectation and waiting times in each of these states up to β€œn nodes updated” to find an alternative estimate for RgR_g.

We now must compute the expectation and variance (in rounds) of time spent in each state. Let’s define Ο„u\tau_u as the first time the system has at least uu nodes updated. We’ll then confirm that E(Ο„n)β‰ˆE(Rg)E\big(\tau_n) \approx E\big(R_g). Our new framing of the problem should preserve our expectation of RgR_g.

E(Ο„n)=βˆ‘u=1nβˆ’1nu(nβˆ’u)=βˆ‘u=1nβˆ’11u+1(nβˆ’u)π‘Žπ‘  π‘π‘Žπ‘Ÿπ‘‘π‘–π‘Žπ‘™ π‘“π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘›π‘ β‰ˆ2Hnβ‰ˆ2log(n)+0.5772+12nπ‘Žπ‘π‘π‘Ÿπ‘œπ‘₯. π‘œπ‘“ π»π‘Žπ‘Ÿπ‘šπ‘œπ‘›π‘–π‘ π‘†π‘’π‘Ÿπ‘–π‘’π‘ β‰ˆ2log(n)\begin{equation} \begin{aligned} E\big(\tau_n) & = & \sum_{u = 1}^{n - 1} \frac{n}{u\big(n - u)} \\ & = & \sum_{u = 1}^{n - 1} \frac{1}{u} + \frac{1}{\big(n - u)} \quad \textit{as partial fractions}\\ & \approx & 2 H_{n} \\ & \approx & 2 \log\big(n) + 0.5772 + \frac{1}{2n} \quad \textit{approx. of Harmonic Series} \\ & \approx & 2\log\big(n) \end{aligned} \end{equation}

Excellent, just as expected! Because the waiting times are independent variables we can sum the variances of all waits to get π‘‰π‘Žπ‘Ÿ(Ο„n)\textit{Var}\big(\tau_{n}).

Var(Ο„n)=βˆ‘u=1nβˆ’1(nu(nβˆ’u))2=βˆ‘u=1nβˆ’11u2+1(nβˆ’u)2βˆ’2n2βˆ’nuπ‘Žπ‘  π‘π‘Žπ‘Ÿπ‘‘π‘–π‘Žπ‘™ π‘“π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘›π‘ β‰ˆβˆ‘u=1nβˆ’11u2+1(nβˆ’u)2β‰ˆ2Ο€2/6\begin{equation} \begin{aligned} \text{Var}\big(\tau_n) & = & \sum_{u = 1}^{n - 1} (\frac{n}{u\big(n - u)})^{2} \\ & = & \sum_{u=1}^{n-1} \frac{1}{u^{2}} + \frac{1}{\big(n-u)^{2}} - \frac{2}{n^2 - nu} \quad \textit{as partial fractions} \\ & \approx & \sum_{u=1}^{n-1} \frac{1}{u^{2}} + \frac{1}{\big(n-u)^{2}} \\ & \approx & 2 \pi^2 / 6 \\ \end{aligned} \end{equation}

The last approximation uses the fact that the sum of inverse squares (see: The Basel Problem) converges to Ο€2/6\pi^2/6. When nn is sufficiently large we can approximate βˆ‘i=1nβˆ’1(1/u2+1/(nβˆ’u)2)\sum_{i=1}^{n - 1}\left(1/u^2 + 1/\big(n - u)^2 \right) as 2Ο€2/62\pi^2/6. This is an interesting sum by itself, but it also gives us a quick, useful result. It suggests variance is bounded even with large nn!

π‚π¨π¦π¦πžπ§π­:\textbf{Comment:} This is an OK answer. We set out to understand the risk of large divergences when nn is very large, but what if we π‘Ÿπ‘’π‘Žπ‘™π‘™π‘¦\textit{really} needed a distribution? I’d suggest the following:

With any luck (focus, time, etc.), I will have a follow up post on this soonTM{^{TM}}.

In a diversion off the last section, I showed 3log(n/2)3\log\big(n/2) may be a more reasonable estimate for E(Rg)E\big(R_g) than 2log(n)2\log\big(n). We can show the variance is still bounded when du/dt=(1βˆ’eβˆ’u/n)(nβˆ’u)du/dt = \big(1 - e^{-u/n})\big(n - u).

((nβˆ’u)(unβˆ’u22n2))βˆ’1=1uβˆ’2nβˆ’2uβˆ’n+1u((nβˆ’u)(unβˆ’u22n2))βˆ’2=1(uβˆ’2n)2+4(uβˆ’n)2+3nu+1u2βˆ’3n(uβˆ’2n)\begin{equation} \begin{aligned} \left(\big(n - u)\big(\frac{u}{n} - \frac{u^2}{2n^2})\right)^{-1} & = & \frac{1}{u - 2n} - \frac{2}{u - n} + \frac{1}{u} \\ \left(\big(n - u)\big(\frac{u}{n} - \frac{u^2}{2n^2})\right)^{-2} & = &\frac{1}{\big(u - 2n)^2} + \frac{4}{\big(u - n)^2} + \frac{3}{nu} + \frac{1}{u^2} - \frac{3}{n\big(u - 2n)} \\ \end{aligned} \end{equation}

Var(Ο„n)=βˆ‘u=1nβˆ’11(uβˆ’2n)2+4(uβˆ’n)2+3nu+1u2βˆ’3n(uβˆ’2n)β‰ˆ5Ο€2/6\begin{equation} \begin{aligned} \text{Var}\big(\tau_n) & = & \sum_{u=1}^{n-1} \frac{1}{\big(u - 2n)^2} + \frac{4}{\big(u - n)^2} + \frac{3}{nu} + \frac{1}{u^2} - \frac{3}{n\big(u - 2n)} & \approx & 5 \pi^2 / 6 \end{aligned} \end{equation}

So what are the implications of having bounded variance? We confirmed that E(Ο„n)E\big(\tau_n) increases slowly with nn and showed Ο„n\tau_n stays near the expectation even for very large nn. If we use gossip to disseminate information in massive systems we don’t need to worry about worst case performance becoming orders of magnitude worse than the expectation. If we had the distribution of Ο„n\tau_n we could find the cumulative distribution function and get a neat final answer, but for today the mean and variance will suffice.



Section 5: Gossip Algorithms For the Real World

Finally, I’d like to mention a few variations of the idealized gossip algorithm that can make it more suitable for real-world distributed systems. This section does not build on any previous results, but instead answers a few questions about making gossip more robust.

𝐅𝐒𝐠. πŸ“.𝟏\textbf{Fig. 5.1}: Stopping The Gossip Process β€” Deactivate After Message to Updated Node

The graph above illustrates what happens when the deactivation probability is set π‘‘π‘œπ‘œ β„Žπ‘–π‘”β„Ž\textit{too high}. When set properly the failure percentage is near zero. In this example, the algorithm will fail in ~20% of attempts.

We haven’t described a mechanism for 𝐬𝐭𝐨𝐩𝐩𝐒𝐧𝐠\textbf{stopping} gossip. In distributed systems we communicate information about remote state by passing messages, so are we meant to send another round of messages to signal that gossip has completed?

Not necessarily! From the previous section, we expect that gossip completes in ∼3log(n)\sim3\log\big(n) rounds. If we set 3klog(n)3k\log\Big(n) as a cap on the number of rounds gossiped per node, we can show that for some k>1k > 1, all nodes will be updated (with high probability) π‘π‘’π‘“π‘œπ‘Ÿπ‘’\textit{before} the algorithm stops. Luckily, there are solutions which are less message intensive. Demers demonstrates we can achieve the same completeness guarantees if we deactivate nodes with probability 1/n1/n after each message sent to an already updated node.



𝐅𝐒𝐠. πŸ“.𝟐\textbf{Fig. 5.2} Broadcast to All Peers Per Round

𝐅𝐒𝐠. πŸ“.πŸ‘\textbf{Fig. 5.3} Spreading Multiple Updates w. k>1k > 1.

The graphs above illustrate how reliable broadcast built from point-to-point messaging and single-round broadcast perform when faced with unreliable networks. The latter is successful with far fewer wasted messages.

In π’πžπœπ­π’π¨π§ πŸ‘\textbf{Section 3} and π’πžπœπ­π’π¨π§ πŸ’\textbf{Section 4} we assumed Ξ²=1/n\beta = 1/n and then revised these values to be a bit more realistic. What about redoing those calculations with Ξ²=k/n,k>1\beta = k/n, k > 1? We’ll try to achieve faster gossip time at the cost of some additional message overhead. Can we simply select multiple peers per round? What about broadcasting a single update to π‘Žπ‘™π‘™\textit{all} nodes in the network in a single round? Gossiping multiple point-to-point messages per round can be an effective way to improve the speed of the gossip algorithm. However, broadcasting an update to all nodes from a single site (β€œsingle-round broadcast”) is normally not the best choice in large distributed systems.

In practice we must account for unreliable networks and incomplete information. Creating a reliable broadcast from point-to-point messages helps us be resilient to both of these failure modes. We π‘π‘Žπ‘›\textit{can} construct reliable broadcast from single-round broadcast, but it comes at a cost of significantly more messages transmitted when networks are unreliable.





Imagine a system that receives twenty updates in quick succession. Should we initialize twenty unique gossip processes or is there a more efficient way to spread this information? One solution is to use pull-based messaging. With pull-based difference resolution, every node contacts one peer per round and we mark a node updated if they’ve π‘π‘œπ‘›π‘‘π‘Žπ‘π‘‘π‘’π‘‘\textit{contacted} an updated node. When there are many active updates, It’s very likely that any node contacts a peer with at least one new update. If the system receives updates infrequently and far apart, pulling creates additional messaging overhead with relatively little benefit.

𝐅𝐒𝐠. πŸ“.πŸ’\textbf{Fig. 5.4} Pull-Based Gossip

The average performance of pull-based messaging is a bit faster than the standard (push) algorithm. We can approximate the probability a node has not been updated after ii rounds as a recurrence relationship for both push and pull-based gossip.

pi+1=pi(1βˆ’1/n)n(1βˆ’pi)pi+1=pi2\begin{equation} \begin{align} p_{i + 1} & = & p_i\Big(1 - 1/n)^{n\Big(1-p_i)} \\ p_{i + 1} & = & p_i^2 \\ \end{align} \end{equation}

In the earlier rounds, the performance of the two algorithms is close because piβ‰ˆ(1βˆ’1/n)n(1βˆ’pi)p_i \approx \big(1 - 1/n)^{n\big(1 - p_i)}. When pp is smaller, the pull-based algorithm decreases to zero quadratically while the push based method decreases by a constant factor 1e\frac{1}{e}.
It was demonstrated that using a combination of push in the beginning and then pull towards the end, total messages sent and time to finish gossip can be improved compared to either algorithm in isolation.



Section 6: Closing Remarks

This concludes our tour of gossip algorithms. Let’s do a brief review:



References & Further Reading