August 2023
Familiarity with single-variable calculus, probability, and discrete math may be helpful for sections 2, 3, and 4. Familiarity with distributed systems may be helpful for section 5. These are suggestions, not prerequisites.
You can navigate through the text with the L/R arrow keys or the arrows in the bottom left.
Definitions, comments, and other supplementary information is contained in green boxes throughout the text.
There are also several animations throughout the text. You can pause these with the “Toggle Animations” button.
We say a system is if it runs on multiple servers that communicate by sending messages over a network. Although distributed systems can be challenging to program correctly, they can offer better performance and resiliency than single-node systems. In the background, many distributed systems use to coordinate sharing information quickly over large networks.
In a gossip algorithm, one node broadcasts an update to all other nodes through multiple rounds of point-to-point messaging. A node spreads an update by sending a message to (or “gossiping with”) a randomly selected neighbor once per round. In subsequent rounds, the selected node is “activated” and begins gossiping with its neighbors. We allow this process to continue until all nodes have received a message. For the remainder of our discussion, I’ll ignore many of the challenges associated with building real distributed systems and represent them as .
If you’re familiar with distributed systems you’ve likely heard of (or even used!) applications that run a variation of gossip. Notably, the distributed databases Riak, Apache Cassandra, and DynamoDB all use gossip algorithms for communicating cluster membership and node failure detection.
Consider a distributed database with nodes storing an identical set . At any moment, any node may receive an update that modifies its copy of . Following the update, the node stores a different set . Until this change is propagated to all nodes, those storing and those storing will return different results for identical operations.
To understand the scope of this problem, we must understand the performance of the gossip algorithm. Its clear that the number of rounds () required for gossip to complete is random, but what can we say about its ? In the following sections, we’ll calculate bounds on by studying undirected complete graphs. By the end of this discussion, we’ll have a general answer to the following question:
Given an n node system, what’s the probability gossip fails to complete within x rounds?
: A real number is an upper bound of a set if for all . For our purposes, the is the smallest demonstrable upper bound. A similar definition applies for the lower bound. For a given system, let represents the set of all values may take. We’ll focus on finding the least upper bound () and greatest lower bound () for this set.
As you’ll soon see, is so doesn’t exist. Instead, we’ll aim to find an alternative to the upper bound, some value that we can prove is larger than a significant portion of .
We’ll begin with a brief look at arbitrary graphs. Any results from this general case should also apply to our analysis on complete graphs.
The of a graph is the maximum shortest path distance between any pair of nodes. For a graph with a set of vertexes :
Where a distance function defined on the graph. We’ll assume graphs are and is simply the count of edges between and .
On an with an update initialized at we can see that . In the worst case, is in the pair of nodes that define the diameter. In that event, we have .
This inequality simply asserts that an update cannot travel further from its origin node than rounds have elapsed. If is five nodes from its furthest peer, we can be certain the algorithm does not complete in four rounds. This is a fine observation, but we can construct examples where is well below the minimum of . One such example is a .
From looking at an node wheel graph, we can see regardless of the location of . There are just two cases to consider:
The central node is the origin node. This node must send at least messages to update the outer nodes.
An outer node is the origin node. This node sends one message to the central node, and the central node sends at least messages to update the remaining nodes.
: A wheel graph is among the worst possible configurations for gossip. Do you see why this is the case?
Assume an node wheel graph with the central node as the update origin. With nodes updated, the probability the central node gossips to a non-updated node is . If you’ve taken a course on probability, you may recognize this as the coupon collector’s problem. The expected number of rounds to complete the gossip process is as the solution to this well-studied problem:
The expectation of doesn’t give any insight into the worst-case performance of the algorithm. We can approximate the right tail of the distribution by studying a single node’s state after rounds.
Finally, we union bound over nodes and find the probability that is . For large , this bound nicely relates the completion probability and the number of rounds beyond the expectation.
Using specific information about a graph allows us to produce a tighter bound than we could otherwise. In the next section we’ll apply this thinking to produce a lower bound on complete graphs.
Let’s turn our focus to . As with wheel graphs, the diameter is not a tight lower bound for . Let’s try running a simulation and seeing if it reveals any interesting patterns.
At right, I show the results from 10,000 iterations of the gossip process in a 64 node system. The first plot shows the updated nodes over time, the second shows the distribution of completion times, and the third shows the round-over-round growth rate.
If we had a reliable way to describe this probability distribution as a function of our job would be done. Unfortunately, we’re not there yet.
Let’s define as the number of updated nodes after rounds. Notice that near the beginning of the process nearly doubles in each round. Because we send just one message per node-round, doubling is the optimal outcome for any round. As increases, message efficiency decreases and we expect less round-over-round growth.
Though incredibly unlikely, we double in each round through the entire gossip process. If this were to happen, the process would complete in rounds. This gives us our greatest lower bound on , gossip will never complete before rounds! The graph at right simulates a system with eight nodes completing gossip in rounds.
Consider that it requires at least messages to update nodes. The function calculates the maximum messages sent through rounds in an node system, is the smallest value that satisfies for .
With a randomized algorithm the probability that gossip completes in rounds is when is large. For example, A node system can complete in steps if nodes are already updated through rounds and the round is executed perfectly. Conditioning on the former, the probability of the latter event is . Even at small values of this event is very improbable (e.g. gives success probability).
We could design a algorithm that always finishes in rounds (ignoring unreliable networks, failed nodes, etc.). This choice is probably not ideal in most distributed systems as it trades fault tolerance for a relatively minor performance improvement.
Earlier I claimed we’d aim to find an upper bound on . This was misleading, no such bound exists. For systems with , we can show that is unbounded from above.
Consider a three node system with two updated nodes. For gossip to complete in the next round, either of the updated nodes must message the third. Because this is a randomized algorithm there is a small probability the first two nodes gossip between themselves indefinitely.
The probability mass function (pmf) of the geometric distribution is given by:
The probability a round passes without a message sent to the third node is . Because peers are chosen independently, the distribution of the number of rounds until the third node is updated is geometric with parameter . Although the pmf of decreases quite quickly, the distribution is memoryless and assigns non-zero probability to all .
Do we stop our analysis here and say we’re limited to “On an undirected complete graph for all ”?
No, surely we can do better. In the following sections we’ll provide an alternative to the upper bound by estimating how often exceeds any given value.
So far, we’ve found a lower bound and shown the non-existence of an upper bound. Before finding an alternative to the upper bound we’ll find the expectation of . In some literature, gossip algorithms are referred to as “epidemic algorithms” and analyzed using models from epidemiology. We’ll use a similar model to find the expected number of updated nodes as a function of rounds, .
Let’s model as a function of updated nodes, non-updated nodes, and the “contact rate” between any pair of an updated and non-updated node. On a complete graph there are edges between these pairs. Instead of a “contact” rate, we’ll treat as the update probability per edge-round.
To match the literature, we’ll use as the contact rate, but we should have some reservations about this choice. Think about this choice, does it imply a faster or slower gossip process than what we described previously?
A constant implies any updated node will update any non-updated node with fixed probability (e.g. ). This is an of the contact rate and implies a faster completion time. This value ignores collisions and the one message per node-round cap we’ve imposed. It’s an OK approximation for the moment but better estimates are available.
If we solve for and set , we get an equation that allows us to draw a some nice conclusions about the average performance of the algorithm.
When is large and , . With this, we can say that after rounds almost all nodes have received the update, .
For a moment, we’ll ignore the fact that is an underestimate of for any given . We can estimate the total messages sent by integrating .
For a 128 node system this gives us an expectation of messages. Does this seem lightweight to you?
The choice of is quite important. We can set to be dependent on the current state of the system. By considering message collisions we’ll produce a more conservative estimate. From the perspective of a single, non-updated node the probability of being updated by one of updated nodes is:
Where represents the probability of being updated by at least one node. We divide that value by to give the average update rate per edge. Using this value as gives:
We can solve this equation as well, but the solution is not quite as nice. If we set how much longer to we expect the gossip process to take? Our prior analysis gives rounds but the more precise analysis gives rounds. From the simulations run in , 11.8 seems like a much more reasonable estimate than 8.32. How did this happen?
As it turns out, we have not stumbled into a decades-long error in the theoretical computer science community. These numbers are quite different, but we can show these two functions behave well asymptotically. To do so, we’ll solve both equations as functions of and compare the ratios as .
Using the original model that ignores collisions and the a message cap we get as expected.
To solve the equation with collisions and a message cap, we’ll treat and as equivelent and solve the following equation:
When is large, the ratio of these two functions is 1.5. The key result from this diversion is that a result obtained using is of the same order (technically , see: Landau notation) as the result obtained from a more rigorous analysis. Let’s revise our statement from earlier, .
We’re still in search of a way to describe the right tail of the distribution of . In this section we’ll calculate the distributions of the the system spends with exactly nodes updated. This approach will confirm our previous result about the expected completion time and provide the variance for the entire process.
As in the previous section, we’ll assume each of edges sends a message with probability per round. Because peer selection is randomized, we may treat the time to send a message on any edge as exponentially distributed with mean of rounds (i.e. ).
Regardless of the number of rounds elapsed, the waiting time until is incremented is distributed as the minimum of independent exponential variables with parameter .
Using order statistics we can show that the value from a sample of exponential random variables with parameter is exponentially distributed with parameter . In our case, the time until another node is updated is distributed as .
The probability distribution function (pdf) and cumulative distribution function (cdf) of the exponential distribution are given by:
An exponential distribution with parameter has mean and variance .
The order statistic is the smallest value from a sample of random variables. If we know the distribution used to generate a sample, we can determine the distribution of any order statistic of that sample. Generally, the order statistic is given as:
In each term we calculate the probability that events are smaller and events are larger than . When we have a special case that allows us to simplify this sum to:
And when the sample is exponential, we get:
This result gives us the cumulative distribution of .
We’ve completed the challenging work of formulating the problem, now we must compute the expectation and variance on each interval. Let’s define as the first time the system has at least nodes updated. We’ll then confirm that . Our new framing of the problem should preserve our expectation of .
Excellent, just as expected! Because the waiting times are independent variables we can sum the variances of all waits to get .
The last approximation uses the fact that the sum of inverse squares (see: The Basel Problem) converges to . When is sufficiently large we can approximate as . This is a very interesting sum by itself, but it also gives us a very useful result. It suggests variance is bounded!
: The variance of is almost entirely comprised of the variance waiting on the first and last few nodes. Consider an node system with one updated node. In the discrete case, we’d expect the waiting time to update a second node to be tightly concentrated around one round.
However, the distribution for the first wait is . This distribution provides a mean that matches our intuition, but also produces . Does this seem reasonable? Counter-intuitively, adding the first additional node is responsible for almost a third of the reported variance.
In the last section we showed is a more reasonable estimate for than . Can we show the variance is still bounded when ?
We can! So what are the implications of having bounded variance? We confirmed that increases slowly with and showed stays near the expectation even for very large . If we use gossip to disseminate information in massive systems we don’t need to worry about worst case performance becoming orders of magnitude worse than the expectation.
We should be a but careful about this estimate though. We can show that the second-degree Taylor series the variance when is near . It’s possible this would improve with a better approximation, but we’ll proceed for now.
We’re still missing a tidy answer to the question posed in the first section. If we had the distribution of we could find the cumulative distribution function and we’d have a neat final answer. Unfortunately, we’ll have to rely on one more approximation before moving on.
Having a probability distribution for gossip time would be a very satisyfing result. The sum of exponentials with the same parameter is called the Erlang distribution, it’s a specific case of the gamma distribution. Is there an argument we can make that allows us to just use the gamma?
Chebyshev’s Inequality bounds the probability a random variable deviates from its mean by more than a given amount. It’s an incredibly general result and requires only the variance of the distribution.
By definition, Chebyshev’s ineuquality provides another of the probability that gossip exceeds x rounds to finish. Consequently, given an n node system, the probability that gossip fails to complete within x rounds is no more than:
For today, this is a satisfactory answer. I’d encourage you to try a few values. What is an upper bound on the 99th percentile of gossip time on a 1M node distributed system?
This is an OK answer. We set out to understand the risk of large divergences when is very large, but what if we needed a distribution? I suggest two options:
The maximum-entropy distribution given a minimum () and a mean () follows the pareto distribution. Can you solve for the shape parameter of the Pareto to find a closed form distribution which we will overstimate the weight on the tails?
If we want a more realistic (rather than more conservative estimate) we might use method of moments to fit the mean and variance of this distribution to an Erlang distribution. The erlang distribution is given by the sum of exponentials with fixed parameter (). Although we have mixed parameters, can some approximation using some and get close?
Finally, I’d like to mention a few variations of the idealized gossip algorithm that can make it more suitable for real-world distributed systems. This section does not build on any previous results, but instead answers a few questions about making gossip more robust.
We haven’t described a mechanism for gossip. In distributed systems we communicate information about remote state by passing messages, so are we meant to send another round of messages to signal that gossip has completed?
Not necessarily! From the previous section, we expect that gossip completes in rounds. If we set as a cap on the number of rounds gossiped per node, for some , all nodes will be updated the algorithm stops. We can set even higher to achieve a given success probability. However, there are less message intensive solutions. Demers demonstrates we can achieve the same completeness guarantees if we deactivate nodes with probability after each message they send to an already updated node.
The graph at right illustrates what happens when the deactivation probability is set . When set properly the failure percentage is near zero. In this example, the algorithm will fail in ~20% of attempts.
In we assumed and then in revised these values to be a bit more realistic. What about redoing those calculations with ? This suggests faster gossip time at the cost of some additional message overhead. Can we simply select multiple peers per round? What about broadcasting a single update to nodes in the network in a single round?
Gossiping multiple point-to-point messages per round can be an effective way to improve the speed of the gossip algorithm. However, broadcasting an update to all nodes from a single site (“single-round broadcast”) is normally not the best choice in large distributed systems.
We’ve mostly focused on idealized models of distributed systems, but in practice we must account for unreliable networks and incomplete information. Creating a reliable broadcast from point-to-point messages helps us be resilient to both of these failure modes. We construct reliable broadcast from single-round broadcast, but it comes at a cost of significantly more messages transmitted when networks are unreliable.
The graphs above illustrate how reliable broadcast built from point-to-point messaging and single-round broadcast perform when faced with unreliable networks.
Highly correlated traffic is a common occurrence in real systems. Imagine a system that receives twenty updates in quick succession. Should we initialize twenty unique gossip processes or is there a more efficient way to spread this information?
One solution is to use pull-based messaging. With pull-based difference resolution, every node contacts one peer per round and we mark a node updated if they’ve an updated node. When there are many active updates, Its likely to a node contacts a peer with at least one new update.
However, if the system receives updates infrequently and far apart, pulling creates additional messaging overhead with relatively little benefit.
The average performance of pull-based messaging is a bit faster than the standard (push) algorithm. We can approximate the probability a node has not been updated after rounds as a recurrence relationship for both push and pull-based gossip.
In the earlier rounds, the performance of the two algorithms is close because . When is smaller, the pull-based algorithm decreases to zero quadratically while the push based method decreases by a constant factor .
It was demonstrated that using a combination of push in the beginning and then pull towards the end, total messages sent and time to finish gossip can be improved compared to either algorithm in isolation.
This concludes our tour of gossip algorithms. Let’s do a brief review:
Gossip is used to spread information in a distributed system and is resilient to incomplete membership information and unreliable networks.
Gossiping with point-to-point messaging starts relatively slowly and has a minimum of rounds.
The number of rounds needed to complete the gossip process is unbounded from above.
Gossip scales well as the number of nodes in a system increases,
is bounded. We can show that the probability the gossip process exceeds rounds is .
Variations can be made on the idealized gossip algorithm to make it more suitable for use in real distributed systems.
Thanks for reading! I’ve enjoyed studying gossip algorithms the past few weeks because they’re easy to understand and easy to implement, but analyzing them involves pulling together bits of mathematics from many areas of study. If you’re an engineer reading this, hopefully its gotten you to think of analysis of algorithms as a bit more than a set of facts about complexity. If you’re a math enthusiast, hopefully you’ve learned a bit more about distributed computer systems and are able to apply techniques mentioned in this post to your own problems.
Course Notes: Ayalvadi Ganesh, Complex Networks
Course Notes: Indranil Gupta, Distributed Systems
Course Notes: Martin Kleppmann, Concurrent and Distributed Systems
Course Notes: Sun He, Gossip Algorithms
Dahlia Malkhi on Randomized Gossip Methods
Patterns of Distributed Systems: Gossip Dissemination
Martin Kleppmann on Distributed Broadcast