Saturday, July 26, 2014

Distributed Systems : Consensus Protocols

Modern software systems scale by partitioning the data and distributing data across several machines. Systems are made highly available by replicating data across multiple machines. When multiple systems are involved in managing state, they need to agree when a particular piece data needs to change.

You are familiar with the concept of a transaction in a relational database. A transaction is a unit of work (like a insert or update or some combination of multiple statements) that as a whole can be committed and aborted. What if the work involves updating multiple databases that are on different machines ? To ensure consistent state across the system, all the databases should agree on what to do, whether to commit or abort the state change.

Modern distributed NoSql databases have a similar but slightly different problem. If you had a single server and set a value v=8 in the server. There is no doubt what the value of v is. Any client that connects to the server reads the value as 8.  What if you had a cluster of 3 servers ? Would a client connecting to one the servers see the value as 8 ? Consensus is required to ensure all servers agree on what the value of v is.

Consider systems like Apache Zookeeper or Apache Cassandra. To ensure high availability, clients can connect to any node in the cluster and read  or write data. To ensure consistency in the cluster, some consensus is required among the nodes in the cluster when state changes.

In the rest of this blog we briefly cover some distributed protocols starting with two phase commit, which users of relational databases are very familiar with. We will then talk about Paxos , ZAB and Raft. Paxos became popular because it was used by google for its distributed systems. ZAB is used by Zookeeper which is an important component of the Hadoop echosystem. These protocols are hard to understand and no attempt is made to go into detail. The purpose is to introduce readers to some of the consensus concepts that are important in distributed systems.

1. Two phase commit

Used in databases to ensure all participants in distributed updates either commit or abort the changes.
One node called the co-ordinator originates the transaction.

1.1 Co-ordinator sends a prepare message to all participants.
1.2 Each participant replies with a yes if it can commit its part of the transaction or No otherwise.
1.3 If the co-ordinator receives a yes from all participants, it sends a commit message to the participants. Otherwise it sends an abort message.
1.4 If the participant receives a commit message, it commits its change. If it receives an abort message, it aborts the change. In both cases, it sends an acknowledgement back to the co-ordinator.
1.5 Transaction is complete when the coordinator receives all acknowledgments.

One limitation of this protocol is that if the co-ordinator crashes, the participants do not know whether to commit or abort the transaction, as they do not know how the other participants responded.

2. Three phase commit

The protocol attempts to let the participants make progress even if the co-ordinator fails.

2.1 Co-ordinator sends a prepare message to all participants.
2.2  Each participant replies with a yes if it can commit its part of the transaction or No otherwise.
2.3. If the co-ordinator receives yes from all of participants, it send a pre-commit  message to all participants.
2.4 When the co-ordinator receives an acknowledgment from a majority of participants, it sends a commit message to all participants.

If the co-ordinator fails, the participants can communicate with each other and determine whether to commit or abort.


Paxos was first published in that nineties but it became more popular after Google implemented and used it in its distributed infrastructure. The protocol is notorious for being difficult to understand. Below is a very brief description. See references for more details.

There are nodes that propose values called proposers and that accept values called acceptor.

3.1 A proposer with a value to propose submits a proposal (v,n) with value v and sequence number n.

3.2.  When an acceptor receives a proposal (v,.n), it compares it with the highest version proposal accepted for that value. If this proposal is higher version that any accepted proposal, the acceptor replies agree and sends the value of any previously accepted proposal. If the acceptor has already accepted a higher version, it rejects the current proposal.

3.3 If the proposer receives agree from majority of acceptors, it can pick one of the values sent by the acceptors. If they acceptors have not sent any value, it can pick its own value. It then sends a commit message with the chosen value to acceptors. If majority reject or do not respond, abort this proposal and try another one.

3.4  When the acceptor receives a commit message, it agrees to commit if the sequence number is the highest it has agreed to or if the value is the same as the last accepted proposal. Otherwise it rejects the commit.

3.5 If a majority accept the commit, the proposal is complete. Otherwise abort and try again.

Key takeaway is that majorities are used to accept proposal. If there are multiple proposers competing for a value, it is possible that no progress is made in accepting values. The solution is to elect a leader that proposes values. Other players in the system could be learners who learn about accepted values from either the leader or other participants.

4. ZAB (Zookeeper Atomic Broadcast)

ZAB was developed for use in Apache Zookeeper due to limitations in PAXOS. In Zookeeper , the order in which changes are applied in important. In PAXOS, it is possible that updates get applied by acceptors out of order.

ZAB is similar to PAXOS in that a leader proposes values and values are accepted based on majority vote. The key difference is that strict order of updates is maintained. If the leader crashes and a new leader is elected, the updates are applied in the original order.


RAFT is another distributed consensus protocol that claims to be simpler that PAXOS or ZAB

A node can either be a leader, follower or candidate.

5.1 By default all nodes are followers. When there is no leader, a node can make itself a candidate for leadership and solicit votes.

5.2 The candidate that gets majority votes is elected leader.

5.3 A client submits its updates to the leader. Leader updates a log (uncommitted) and sends the update to followers.

5.4 When leaders hears from a majority of followers that they have made the update, leader commits the change and informs the followers of the commit

5.5 Followers commit the update.

5.6 If a leader terminates for some reason, one of the followers turns itself into a candidate and gets elected as the leader.

We have a given a brief description of some consensus protocols. If you use Hadoop, Cassandra, Kafka or similar distributed systems, you will run into these protocols. For more details, some references are provided below.


1.  Database Management Systems by RamaKrishnan and Gehrke
2. PAXOS made simple
PAXOS by example
4. The secret lives of data
5. Apache Zookeeper
6. Paxos paper trail
7. Raft Consensus