Code & Cluster: Distributed Consensus: Raft

In the Paxos blog, we discussed the distributed consensus problem and how the Paxos algorithm describes a method for a cluster of servers to achieve consensus on a decision.

However the Paxos as a protocol is hard to understand and even harder to implement.

Examples of problems that need consensus are :

- Servers needing to agree on a value, such as whether a distributed lock is acquired or not.
- Servers needing to agree on order of events.
- Server need to agree on the state of a configuration value
- Any problem where you need 100% consistency in a distributed environment.

in a highly available environment.

The raft algorithm described https://raft.github.io/ solves the same problem, It is easier to understand and implement.

Overview

The key elements of raft are:

Leader Election
Log Replication
Consistency ( in the spec they refer to this as safety)

A server in the cluster is the leader. All other servers are followers.

The leader is elected by majority vote.

Every consensus decision begins with the leader sending a value to followers.

If a majority of followers respond having received the value, the leader commits the value and then tells all servers to commit the value.

Clients communicate with leader only.

If a leader crashes, another leader is elected. Messages between leaders and followers enable the followers to determine if leader is still alive.

If a follower does not receive messages from the leader for a certain period, it can try to become a leader by soliciting votes.

If multiple leaders try to get elected at the same time, it is possible there is no majority. In such situations, the candidates try to get elected again after a random delay.

In Raft, time is set of sequential terms. Term is time of certain length. Leadership is for a term.

2 Main RPC messages between leader and followers :

Request Vote : sent when a candidate solicits vote.
Append Entry : sent by leader to replicate a log entry

Scenario 1 : leader election cluster start up

let us say 5 servers s1 to s5.

Every server is a follower.

No one is getting messages from a leader.

s1 and s3 decide to become leaders (called candidates) and send message to other servers to solicit vote.
Servers always vote for themselves.
s2 and s4 respond to s1.
s1 is elected leader

Scenario 2 : log replication

A client connects to a leader s1 to set x = 3.

s1 writes x=3 to its log. But its state is unchanged.

At this point the change is uncommitted.

s1 sends appendEntry message to all followers that x =3. Each follower writes that entry to log.

Followers respond to s1 that change is written to log.

When majority of followers respond, s1 commits the change. It applies the change to its state so x is now 3.

Followers are told to commit by piggybacking the last committed entry, in the next appendEntry message. Followers commit the entry by applying the change to its state.

When a change is committed, all previous changes are considered committed.

The next appendEntry message from leader to followers will include the previous committed entry. The servers can they commit any previous entries they have not yet committed.

The cluster of servers has consensus.

Scenario 3 : Leader goes down

When a leader goes down, one or more of the followers can detect that there are no messages from the leader and decide to be come a candidate by soliciting votes.

But the leader that just went down has the accurate record of committed entries, that some of the followers might not have.

If a follower that was behind on committed entries became a leader, it could force other servers with later committed entries to overwrite their entries. That should not be allowed. Committed entries should never change.

Raft prevents this situation by requiring candidate to send with the requestVote message the term and index of the latest message it accepted from the previous leader. Each Follower rejects requestVote with a term/index lower than its highest term/index.

Since a leader only commits entries accepted by a majority of servers and a majority of servers is required to get elected, it follows that a majority or a least half of the remaining servers have accepted that highest committed entry of the leader that went down.Thus a follower that does not have the highest committed entry from the previous leader can never get elected.

Scenario 4 : Catch up for servers that have been down

It is possible for a follower to miss committed entries either because it went down or did not receive the message.

To ensure followers catch up and stay consistent with leaders, RAFT has a consistency check.

Every append entry message also includes term and index of previous message in leaders log. If it does not match in the follower, the follower rejects the new message. When AppendEntry is accepted by a server, it means leader and server have identical entries.

When a follower rejects an appendEntry, the server retries that follower with a previous entry. This continues until the follower accepts an entry. Once an entry is accepted, the leader will again send subsequent entries that will be accepted.

Leader maintains a nextIndex for each follower. This is the index in the log that the leader will send to the follower next.

Scenario 5 : Cluster membership changes

Cluster membership changes refers to a bunch of servers being added or removed from the cluster.
May be even the current leader is no longer in new configuration.

This needs some attention because it is possible we end up with 2 leaders and 2 majorities.

Raft takes a two phase approach

First switch to joint consensus
-- entries committed to servers in both configuration
-- 2 leaders one from each configuration
-- majority from each configuration needs to approve stuff

Second switch to new configuration

Leader receives request to change configuration to new.
Leader uses appendEntry to send old,new config pair to followers
Once (old,new) is committed , we are in joint consensus period.
Leader then sends appendEntry for new configuration.
Once committed, new configuration is in effect.

Summary

Consensus is required when you need 100% consistency in a distributed environment that is also highly available. Raft simplifies distributed consensus by breaking the problem into leader election and log replication. Easier to understand means easier to implement and use to solve real world problems.

A future blog will go into an implementation.

References:

1. In Search of an Understandable Consensus Algorithm by Diego Ongaro and John Ousterhout Stanford University. https://raft.github.io/raft.pdf

Related Blogs:

1. Distributed Consensus : PAXOS

Code & Cluster

Tuesday, July 4, 2017

Distributed Consensus: Raft