How to build reliable highly available distributed systems that are consistent ? Paxos is a protocol that addresses this problem.
Paxos was authored by Leslie Lamport in his paper "Part time parliament" and explained better in his paper "Paxos made simple". It has been implemented and used in many of the modern distributed systems built by Google, Amazon, Microsoft etc.
Consider a banking system with clients c1,c2 and server s.
c1 can issue command to s : add $200 to account A
c2 can issue command: add 2% interest to A
A single server can easily determine the order c1,c2 and execute the commands.
But if S crashes, no client can do any work. The traditional way to solve this problem is to fail over using standby. Another server S' , identical to S is standing around doing nothing. When S crashes, the system detects that S is no longer servicing requests, starts sending requests to S'. For reasons that merit a blog of its own, fail over using standby is hard to implement , to test and more expensive. That will not be discussed here.
A second limitation is that when the number of client requests increase, the server may not be able to keep up and respond in a reasonable time.
The way to solve scalability and fail over issues is to have multiple servers says s1 and s2 servicing clients with replication between s1 ans s2, so the clients are presented with a single system view. Commands that execute on s1 due to its clients are also made to execute on s2 and vice versa, so that both s1 and s2 are in the same state.
In Figure 1 Both S1 and S2 are active and servicing clients.
Figure 1 : Multiple server cluster with replication
However a difference in the order of execution can lead to a consistency issue.
Assume account A has $1000
Say c1 connects to s1 invokes command subtract 200 from account A.
Say c2 connects to s2 and invokes command debit 5% interest to account A.
If the order of execution is c1,c2 then c1 increases A to 1200. c2 increases it to 1260. If the order of execution is c2,c1, then c2 increases A to 1250. c1 increases it to 1250. One server may have a value 1260 , while the other may have a value 1250.
To ensure consistent results, both servers need to agree on the order of execution. In other words, there needs to be consensus among the servers. You can have more then 2 servers and the same is true.
Paxos is a protocol for achieving consensus among a group of servers.
2.0 Paxos assumptions
A group of distributed servers communicating with each other.
non byzantine : no devious unpredictable stuff
Messages can be lost.
A majority of servers are always available. So your system should have an odd number of server 3,5,7... so that a majority can be established. Common logic tells us that with an odd number of servers, any two majorities should have at least one overlapping member. This observation is critical to the correctness of the protocol.
Server can join or leave the system at any time.
3.0 What Paxos achieves
Reliable system with unreliable components.
Only one value may be chosen.
The value is chosen when it chosen by a majority of the servers.
Once a value is chosen, it cannot be changed.
This "one value" concept can be hard to understand for a first time reader. How is choosing just one value useful in solving any real world problems ? In reality , the value is likely to be a command that needs to be executed on the server. It could be command and data or both. Value is a simplification.
For this to be useful, the servers probably need consensus on not just one value but several values. That can be achieved by a minor extension to basic Paxos and will be discussed in a subsequent blog.
4.0 Actors in Paxos
Proposers propose a value to be chosen. Proposers are generally the ones handling client requests.
Acceptors respond to proposers and can be part of the majority that lead to a value being chosen.
Learners learn the chosen value and may put it to some use.
In reality, a single server may function as all 3 and this is what we will assume
5.0 The protocol
(1) Proposer proposes a value (n,v) where n is a proposal number and v is a value.
(2) If an acceptor has not received any other proposal, it sends a response agreeing to not accept
any other proposals with number less than n.
If proposal number is less than what it has accepted or agreed to accept, it can ignore the proposal.
If it has other lower number proposals accepted with value v or any other v', it responds with the accepted proposal number and value v'.
The acceptor continues to do this with any subsequent proposal it receives. It must remember the highest proposal number it has. This is important because as described in step 5, it should never accept any lower numbered proposals.
(3)The proposer examines responses to its proposal.
If majority of acceptors responded with value v', then mean that v' is either chosen or has a good chance of being chosen. Proposer must take v' as value . If majority does not have value, it can stay with original v.
(4) Proposer sends accept message (n,v')
(5) When an acceptor receives an accept message (n,v) or (n,v') . It must accept the value if n is still the highest proposal it has.
Between step 2 and 5, other proposers could have send other proposals with number higher than n. If the acceptor has any such proposals, it cannot accept n.
In either case, it returns to the proposer, the highest proposal number it has.
(6) The proposer can use the proposal number in response from 5 to determine if its accept message is accepted. If the proposal number is the same as n, then it known that n is accepted.
Other wise the returned number is that of the larger proposal number that is around. It has to go back to step 1 and start with new proposal number greater than this.
Proposals have order as indicated by n. Newer proposals override older proposals. If an acceptor has received proposal n. It can ignore all proposals less than n.
Proposal numbers need to be unique across proposers.
Multiple rounds of propose and accept may be necessary before a majority for a chosen value is reached.
Once a value is chosen, future proposers will also have to choose that value. That is the only way we can get to one and only one value chosen.
Proposals are ordered. Older ones are ignored or rejected
In this section we go through some scenarios of how the protocol works.
7.1 Case 1 : Value chosen for the first time
Figure 2 : Value chosen for first time
This is the most basic case of no value yet chosen and a value proposed for the first time.
3 servers s1,s2,s3. 2 is majority
1. s1 sends proposal 1 with value X to s2
2. s2 has no previous proposal or value , so it responds agreeing to not accept any proposals numbers less than 1
3. s1 has agreement from majority. So it sends accept message to s2 which accepts and X is the chose value.
7.2 Case 2 : Value proposed after one already chosen
s1 and s2 have agreed on value X.
s3 does not know of this. s3 send proposal 2 with value Y to s2.
s1 responds that it has accepted proposal 1 with value X.
s2 has to update its value to X. s2 sends an accept message with value X which is accepted.
s1,s2,s3, all have value X.
7.3 Case 3: No value yet chosen two competing proposals 1 wins
Figure 4: Competing proposals
There are 5 servers s1,s2,s3,s4,s5. 3 is majority
s1 sends proposal 1 with value X to s2,s3
s2 agrees to accept 1,X
Before s3 receives accept message for (1,X) s5 sends (2,Y) to s3,s4
Now s3 cannot accept (1,x) because its highest proposal is 2.
s3,s4 respond agreement to (2,Y) to s5
s3 ignores proposal (1,X)
s5 sends accept(2,Y) to s3,s4 which accept
s1 sends a new proposal (3,X).
s3 responds (2,Y)
s1 sends accept(3,Y)
s1 and s2 also agree on Y
7.4 Case 4 : Not making progress or liveness
Figure 5 : Not making progress
s1 proposes values to s2 ,s3. s5 proposes values to s3,s4.
s1 proposes (1,X). s3 agrees to not accept proposal less than 1.
Before (1,X) can be accepted s5 proposes (2,Y). s3 now agrees not to accept less than 2
When s1 tries to get (1,X) accepted, It will not get accepted because there is a proposal 2. It sends out a proposal (3,x).
s2 will not be able to get (2,Y) accepted because there is a (3,X). It sends outs a (4,Y).
s1 will not be able to get (3,X) accepted because there is a (4,Y). It sends out a (5,x)
This may go on and on. One way to avoid this is for each server to introduce a random delay before issuing the next proposal, there by givings the others a chance to get their proposals accepted.
Another solution is to have a leader among the servers and have the leader be the only one that issues proposals.
8.0 Paxos usage in real world
The basic protocol enable servers to arrive at a consensus on one value. How does one value apply to real world system ? To solve real world problems like the one described in the introduction, you have run multiple instances or interactions of Paxos. For a group of servers to agree on the order of a set of commands, think of a list of command 0 .. n. Each Paxos instance would pick a command at each index. This is multi Paxos and merits a blog or discussion on its own.
Some real world usages of Paxos have been to arrive at consensus on locks, configuration changes, counters.
"Part time parliament" by Leslie Lamport
"Paxos made Simple" by Leslie Lamport
"Time, clocks and the ordering of events in a distributed system" by Leslie Lamport