Saturday, March 1, 2025

Multi Version Concurrency Control (MVCC) in databases

Introduction

Multi version concurrency control (MVCC) is a popular optimistic technique used in modern databases for concurrency control.

MVCC does not use locking. In that regard it is an optimistic technique but distinct from what is know as the optimistic concurrency control (OCC).

MVCC is timestamp based.

MVCC does its concurrency control by keeping multiple copies or versions of each data item. Every transaction sees data only of a specific version , also known as snapshot. Changes made by a transaction will not be seen by others, until the changes are committed.

Concepts

Versioning

Versioning tuples is at the core of how MVCC does its concurrency control.

Consider the table 

id, balance

A, 500

B, 1000

Now logically assume that under the hood, the database adds 2 hidden columns - start time (st) and end time (et)

id, balance, st. , et

A, 500 , 1 , -

B, 1000, 2, -

Let us assume that increasing numbers 1,2 ..... represent time and a - implies infinity or no end time. The version number can be a timestamp or something analogous to a timestamp.

In the above example, we are saying that the first tuple (A, 500) is valid from time 1 to infinity. The second tuple (B, 1000) is valid from 2 to infinity.

With versioning, when a tuple is updated, the original one is unchanged. Instead a new row is added.

Let us say at time 3, A is updated to 550. Logically the table would look like

A, 500 , 1 , 3

A, 550 , 3 , -

B, 1000, 2, -

The latest committed version has an end time of infinity. When a new update for a tuple is committed, the end timestamp of the previous latest is changed to the start timestamp of the new latest.

A transaction reads the latest committed version at the time it starts. A transaction that starts at time 2 and reads A will read 500. But a transaction that starts at 4 and reads A will read 550.

Snapshot isolation

Every transaction only reads committed values at the time it starts. It is a snapshot of the database at that time. This referred to as Snapshot isolation.

However it will read any changes that it makes within the scope of its transaction.

Use cases

It is easier to understand MVCC with some examples,

Lost update problem

Let us start with the same table again.

id, balance, st. , et

A, 500 , 1 , -

At time 2, transaction T1 reads A. The latest version with timestamp (1, -) has value 500. It reads 500.

At time 3, transaction T2 reads the same row. The latest version is still (1, -) . It too reads 500.

At time 3, transaction T2 updates the value to 550 and commits. We now have an addition version

A, 500 , 1 , 3
A, 550, 3 , - 

At time 5, T1 updates the value to 600. 
At time 6, T1 tries to commit. 

T1s commit is disallowed and it is forced to abort. 
The reason is that T1's snapshot is of time 1. After T1 started, another transaction came along , performed a WRITE on the same record and committed. If we allowed T1 to also commit, it would overwrite T2's update. This is the lost update problem.

The rule we can deduce here is : The write set of the committing transaction T1 should not intersect with the write set of any transactions that committed after T1 started and before T1 commits.

Write Skew

Let us start this time with the table having 2 tuples.

id, balance, st. , et

A, 500 , 1 , -

B, 1000, 2, -


At time 1, transaction T1 reads A. The latest version with timestamp (1, -) has value 500. It reads 500.
At time 3 , transaction T1 writes A to 600. But it is still uncommitted.

A, 500 , 1 , -, 

A, 600 , 3 , - , uncommitted

B, 1000, 2, -

At time 4, transaction T2 reads A. The latest committed record is still at timestamp 1. It will read 500 ( not 600 ).

At time 5, T1 commits. The write set to T1 does not intersect with the write set of any other transaction. So it allowed to commit. So we have

A, 500 , 1 , 5, 

A, 600 , 5 , - , 

B, 1000, 2, -, 

At time 6, T2 updates B to 1200

B, 1200, 6, -, uncommitted

At time 7, T2 tries to commit. It is disallowed and T2 is forced to abort. The reason is that T2 read A which was committed after T2 started. Even though T2 does not update A. It might be using A (that is read earlier) to update say B and that update could be incorrect.This is the write skew problem.

We can thus deduce rule 2: The read set of a committing transaction T should not intersect with the write set of any transaction that has committed after T started.

Phantom reads

Let us look at a case where a new record is insert into the table.

id, balance, st. , et

A, 500 , 1 , -

B, 1000, 2, -

At time 3 , transaction T1 runs the query "select sum(balance) from table where balance >=500 ". It gets a result 1500.

At time 4, transaction T2 inserts a new row, (C, 600) and commits

id, balance, st. , et

A, 500 , 1 , -

B, 1000, 2, -

C, 600, 4 , -

At time 5, transaction T1  inserts the 1500 value it read into another table.

At time 6, T1 tries to commit. The transaction is not allowed to proceed and aborts.

The reason is that the 1500 value is no longer accurate. At time 4 , another transaction committed a new row and the value should be 2100. 

Rule 3 : If the committing transaction T has predicate queries  (where clause ) that depend to number of rows and affected by inserts/deletes, then just before committing the queries need to be run again. The commit is allowed only if the results are the same as before.

Obviously running the queries a second time can be expensive. But that is implementation detail and databases do various optimization.

Garbage collection

Maintaining versions for each and every record can take up a lot of disk space. Databases have background process or other method to remove old versions that have no active transaction depending on them.

Algorithm

Every tuple is versioned using timestamp or something analogous to it.

Every transaction only reads rows that are committed as of the start time of the transaction. This holds for the entire duration of the transaction. It cannot read any commits that happened after it starts. In other words it sees a "Snapshot" of the data, as of its start time. This is Snapshot isolation.

When a transaction tries to commit, we check if any other transaction has committed a new version for the rows that we read or write. If there are any such transactions, then we must abort.

Even if our read/write set does not intersect with any other transaction, if our read is impacted by inserts or deletes, we much abort.

To make this happen, the database needs to maintain start time, end time for every tuple. Not just the latest version, but older versions as well. 

For every tuple that is read or writen, the database needs to know which transactions are active and which committed.

Older versions need to cleaned up when there are no active transactions depending on them.

For a transaction to be allowed to commit, there rules we discussed apply:

#1 : Write set of a committing transaction T should not intersect with Write set of any committed transaction that committed after T started.

#2: The Read set of a committing transaction T should not intersect with the write set of any transaction that has committed after T started.

#3: When the committing transaction T has predicate queries (WHERE clause, depend on number of rows meeting condition ), then before committing, the queries need to be rerun to ensure that they return the same result.

Some additional rules:

#4 Write set of a committing transaction T should not intersect with Read set of any transaction that committed after T started.

#5 If a transaction T updates a row, T should use its own updated version, not the one in older snapshot it started with.

Commercial Databases supporting MVCC

Almost all the modern databases we know off support MVCC.

PostgreSql
Mysql
Oracle
CockroachDB
SqlServer
.... and more

Advantages of MVCC

  • Readers do not block readers. Writers do not block readers.
  • Each transaction works under a consistent snapshot of data.
  • High concurrency is possible.
  • Databases can allow querying of old versions, which are known as time travel queries.

Disadvantages of MVCC

  • Increased storage requirement because we are storing older version.
  • Need to do garbage collection.
  • When conflicts are detect and transactions aborted, applications need to retry.

Conclusion

MVCC is a popular concurrency control technique used in many currently popular databases like Postgresql.

It does not use two phase locking. Unlike OCC, it does not use a staging area. The staging area is built in with versioning.

Like OCC, it works best for workloads where there are few conflicts.

In conclusion, Multi-Version Concurrency Control (MVCC) offers a robust approach to handling concurrent transactions in modern databases by maintaining multiple versions of data

While it provides significant benefits such as high read throughput, reduced contention, and snapshot isolation, it also introduces complexity in garbage collection and storage management.

Modern distributed databases combine versioning with clock time to provide strict serializability even in  distributed databases without the use of locking. But that is a topic for another blog.

References

Prof. Jen Dittrich Youtube videos
Andy Pavlov CMU Database Youtube video
Database Internals by Alex Petrov
Database Management Systems by Ramakrishnan & Gehrke

Saturday, February 1, 2025

Optimistic Concurrency Control In Databases

Introduction

Optimistic concurrency control refers to concurrency control in databases without the use of explicit locks. Traditionally it is done using locks.  But locking reduces throughput and scalability. Modern databases therefore prefer not to use locking and use optimistic techniques. This is a brief description of the optimistic technique.

Why concurrency control ?

Concurrency control in databases refers to techniques used to execute transactions concurrently.

From a end users standpoint, every transaction needs to execute correctly and preserve the consistency of the database, irrespective of what other transactions are doing. For the user, it should appear that the transactions are executing in some serial order.

When multiple transactions read and/or write the same data several kinds of errors can happen such as

  • dirty reads - reading uncommitted data
  • non repeatable reads - read a second time and you get a different value
  • phantom reads - read a second time and you get additional rows
  • Lost update - one transaction overwrites others write
  • dirty write - a transaction reads an uncommitted value and updates it
  • write skew - the combination of two transactions breaks some invariant like minimum balance
At a user level, databases give the users a knob called the isolation level to protect against some of these errors. For an explanation of isolation levels, see the blog I wrote many a few years ago.

But under the hood databases implement concurrency control to ensure that users see a consistent view of the data at all times.

There are 3 main types of concurrency control:
  • Pessimistic concurrency control
  • Optimistic concurrency control (OCC) 
  • Multi version concurrency control (MVCC)

The pessimistic approach involves locking resources - rows, tables , pages etc while they are used by one transactions. Other transaction that need that resource wait until the lock is released. This approach prevents conflicts but comes at the cost of throughput.

OCC and MVCC take the optimistic approach that conflicts rarely happen. They lets transactions proceed without acquiring locks. If a conflict is detect downstream, one of the transaction is forced to abort.

In this blog we discuss one such concurrency control technique - optimistic concurrency control. I will discuss the other two in subsequent blogs.

What is optimistic concurrency control ?

As the name suggests, the approach is optimistic. Optimistic implies a positive view of things that most transactions do not conflict with other and there is no need to slow things down by acquiring locks and holding on to them. 

Assumptions are that
  • Transactions are short lived
  • They rarely conflict
The concurrency control here is done in 3 stages.

In the first stage (read) the transaction executes its operations in a private or staging area. Rows that are read known as the read set and rows that are updated known as the write set are identified.

In the second stage (validation) the operations are checked for conflicts. Rows that are being read by one transaction should not be be updated by others. If a conflict is detected, then the transaction is aborted and the staging area cleared

In the third stage, if there are no conflicts the changes are committed.


Detecting the conflicts


Given 2 transactions A and B, 

A and B are given timestamps(TS) before the validation starts.

Let assume TS (A)  < TS (B).

If A committed before B started read, then there are no issues and B is also allowed to commit.

If A committed after B started read but before B started write, then both can commit if the write set of A does not intersect with the read set of B.

If A read completes before B read completes, and the write set of B does not intersect with both the read set of A and write set of A, then both can commit.

In all the 3 cases, A and B are operating on different records and do not conflict.

In this example, we are looking "backward" from B. If there is a conflict B will asked to abort. Some databases look "forward" but the concept is the same.

When a validation is being done, no other transaction is allowed to commit. Only one transaction is validated at a time.

Timestamp based

In pessimistic concurrency control, transactions are ordered in the order in which locks are acquired.

In optimistic concurrency control, transactions are ordered based on timestamps given to transactions.

Commercial databases using OCC

Some example are:

Amazon Aurora DSQL
Yugabyte
CockroachDB
FoundationDB
Fauna
Google Spanner

Advantages of OCC

Higher scalability
Higher throughput
No deadlocks

In general this is suitable for low conflict data access.

Disadvantages

Transactions might need to be aborted.
Applications need to handle aborted transactions and retry.
This might not be suitable for cases when conflict is high - that is multiple transactions needs to access the same data.

Conclusion

Optimistic concurrency control is becoming increasing popular is modern databases that need to scale and can be distributed across multiple node. It really works well in low data contention scenarios. It is ideal for distributed databases where the locking approach becomes even more expensive when you need to lock across nodes - as would be needed in modern distributed databases. For reasons mentioned above, distributed , cloud native, high concurrency systems are moving towards OCC.

References:

1. Database Management Systems - Ramakrishnan and Gehrke
2. Database Internals - Petrov
3. Andy Pavlov CMU lecture on youtube




Saturday, January 4, 2025

Dynago : A highly available distributed key value store

Introduction

Dynago is my  hobby project to implement a minimal implementation of the distributed system described the paper Dynamo : Amazons highly available key-value store.

This is the first in a series of blogs when I describe how I go about building this and post updates on the progress.

The is my way of  "#buildinpublic".

The github repository is at https://github.com/mdkhanga/dynago

Why am I doing it ?

The primary goal is learning. Learning not just for me but also for others who may read this blog and review the code.

A secondary goal is build something useful.  Perhaps I may use it somewhere or someone else might. 

It is possible that parts of the code might be reusable in another project.

Why a Dynamo clone ?

The Dynamo paper is the first paper that got me interested in distributed systems. It has been by desire to build something like that for a long time.

In the real world, DynamoDB is Amazon highly available key/value database. Apache Cassandra is also a project inspired by this paper.

Tech stack

I pick Go as the programming language. Why Go ? I have already done a couple of complex projects in Java (B+tree,  Raft protocol). So I was not interested in doing it in Java. I am not yet that familiar with Rust. And C/C++ felt like going back in time. So this is an opportunity to do a complex project in Go.

For network communication / RPC, I use GRPC instead of  programming to sockets. My Raft implementation is based on sockets. I wanted to try Grpc. If it does not work out, I can switch to sockets.

Storage engine is TBD. Early versions will be in-memory. In the future Rocksdb or another embeddable store is a possibility.

Some technologies we might learn here:

  • Go programming
  • Distributed systems 
  • Databases
  • Network programming
  • Programming servers from scratch
  • Concurrency
  • High availability

Implementation plan

I plan to implement in the following tentative order:

1. Cluster membership

Be able to start multiple servers connect and form a cluster. Gossip to detects servers joining and leaving the cluster.

2. Client API

Simple Get and Put API

Replicate to all nodes.

3. Partitioning using consistent hashing

4. Partitioning with replicas

5. Quorum based read / writes

6. Versioning

7. Hinted handoff

...... and so on

Conclusion

If you are interested in distributed systems and/or databases, I invite you to follow along.

I am open to suggestions and discussion. If you know more, I am happy to learn from you.

If you like what I am working on, please follow me on twitter/X or LinkedIn.