10000 nodes and beyond with Akka Cluster and Rapid

Click for: original source

At the foundation of clustered systems are so-called membership protocols. The job of a membership protocol is to keep clustered applications up-to-date with the list of nodes that are part of the cluster, allowing all the individual nodes to act as one system. By Manuel Bernhardt.

One of the motivations for Rapid was to be faster at scale than existing consensus protocols. The Rapid paper shows that Rapid can form a 2000 node cluster 2-5.8 times faster than Memberlist (Hashicor’s Lifeguard implementation) and Zookeeper. One of the problems with membership services that rely entirely on random gossip is that random gossip leads to higher tail latencies for convergence.

Rapid is designed around the central idea that a scalable membership service needs to have high confidence in failures before acting on them and that membership change decisions affecting multiple nodes should be grouped as opposed to happen on a per-case basis. Dissemination of membership information in Rapid happens via broadcast.

The article then guides you through:

  • The Rapid membership service
  • Dissemination
  • Consensus
  • Failure detection
  • Joining the cluster
  • What Rapid brings to the table

Another one of the big contrasts that in author’s opinion sets Rapid ahead of other membership protocols is the strong stability provided by the multi-node cut detector. A flaky node can be quite problematic in protocols where the failure detection mechanism doesn’t provide this type of confidence. Super interesting read!

[Read More]

Tags scala java programming akka devops performance software-architecture