Showing posts with label distributed applications. Show all posts
Showing posts with label distributed applications. Show all posts

Sunday, June 30, 2024

CockroachDb Review: Should I use CockroachDb ?


CockroachDb is a modern distributed database that promises linear scalability with strict serializability.

Server side sharding is automatic. Nodes can be added easily as needed. Claims to provide the SERIALIZABLE isolation level.

Most distributed databases such as Cassandra, MongoDb, HBase etc sacrifice consistency to achieve high availability. CockroachDb distinguishes itself by claiming to be distributed and the same time offer strong consistency that even single node databases do not offer.

This falls into a database category called NewSql or DistributedSQL as opposed to NoSql (Cassandra, MongoDb)

When to choose CockroachDb ?

You should choose CockroachDb if

    Your data is of a global scale.

    As data size increases, you need to scale horizontally to several nodes.

    You need data to be distributed and localized in specific geographical regions. For
    example EU data resides in Europe while US data resides in US.

    You need strong consistency. Serializable isolation level.

    You need to keep the SQL / relational data model.

    You need distributed transactions.

You may want to pass on it if

    You data size can easily fit on a node for the foreseeable future.
    You organization is more comfortable with a stable proven database. (CockroachDb is
    still maturing).
    You data model is heavily normalized and you do a lot of joins in your queries. While this
    database can support joins, it is still not recommended in a highly distributed


Architecture is based on Google's Spanner paper.

It is a key value store with a SQL interface on top of it.

Database is a cluster of nodes. All nodes are equal. Nodes may join and leave the cluster at any time.

Sorted map of key values pairs. Fully ordered monolithic key space. All tables/indexes go into the same key space by encoding tablename/indexname/key together.


Key value pairs are broken up into contiguous ranges. 

When range size reaches 512 Mib (2 power 20) It is split into 2 ranges.

Each range is assigned to a node and replicated. 

If you have 1 node all the shards are in that node. To scale, you add more nodes and the shards get distributed across nodes. A minimum of 3 nodes is recommended. 

Very easily spin up node(s) and add to cluster anywhere. 

Btree like index structure used to locate shard that has a key.


Data in each range is replicated using the Raft consensus algorithm.

A minimum replication factor of 3 is needed.

This provides the high availability. Data is available as long as a majority of the nodes in the cluster are available.


By adding a country or region to the primary key, you can limit storage to keys to a particular region. So European data can be make to reside in Europe, US data in US and so. This has 2 benefits
There is a performance benefit and data is local to its users.

It can satisfy legal requirements where data is not allowed to leave a country or region.



Any node can receive a request to read a key/value.

Request is forwarded to the node that is the raft leader for that table/range.

Leader returns the data to the node that requested it. Since leader returns the data, no consensus is required.

Node returns it to the client.


Any node can receive a request to write a key/value.

Request is forwarded to the node that is the raft leader for that table/range.

Leader writes the value to its log and initiates consensus with replicas for the range. When majority acknowledges, the key/value is considered committed and leader notifies the requesting node which notifies the client


Supports transactions that spans multiple tables and rows.

Transactions can be distributed, that is span multiple nodes.

The supported isolation level is strict serializability which is the highest isolation level. Strict serializability means that not only are transactions ordered, but they are ordered as per wall clock time.
Transaction protocol is an improvement over two phase commit. In parallel, participants acquire locks and create write intents. The transaction is marked staged. When the client commits, if all locks are acquired and writes are replicated, the coordinator immediately returns success to client. In background the transaction is marked committed. This is one round trip between transaction coordinator and each participant - unlike two phase commit - which requires two round trips.

Hybrid logical clocks are used to timestamp each transaction. Timestamp is the version for MVCC.

Data Model

Clients see only the SQL row column relation model
Wire protocol is same as Postgresql wire protocol.


Efficient range scan.
Geo partitioning improves performance by locality.
Distributed SQL execution.
Distributed transactions will be slow.
Generally you do not want distributed transactions over large distances. If you build a 3 node CockroachDb cluster with 1 node in NewYork, 1 in London and 1 in San Francisco, the write latencies are going to be very high due to the round trips for RAFT and distributed transactions. The cluster topology needs to be designed appropriately to give you the lowest latency at the desired level of high availability.


Good command line tools and UI console make the the administration easy.
Since all nodes are equals, number of moving parts that need to be administered is low.


If you need a globally distributed database with strict serializability, this is definitely a database to look at. It has good pedigree. However remember that distributed databases are not drop in replacement for your traditional RDBMSs. Distributed queries especially joins and distributed transaction can be slow. So some application redesign, some denormalization is always required.

Note: Moved from

Sunday, November 1, 2020

Building Globally Distributed Applications

A globally distributed application is one where the services and data for the application are partitioned and replicated across multiple regions over the globe. Popular distributed applications that everyone is familiar with are Facebook,, Gmail, Twitter, Instagram. However more and more enterprise applications are finding the need to become distributed because their user base is increasingly distributed around the globe. But not every company has the expertise of a Facebook or Amazon or Google. When going distributed, it is not enough to just spin up instances of your service on AWS or Google cloud on various regions. There are issues related to data that must be addressed for the application to work correctly. While consumer centric social media applications can tolerate some correctness issues or lags in data, the same might not be true for enterprise applications. This blog discusses the data and database issues related to a globally distributed application. Lastly, we discuss 2 research papers that been around since early part of this decade, but whose relevance is increasing in recent times.

Building globally distributed applications that are scalable, highly available and consistent can be challenging. Sharding has to be managed by the application. Keep it highly available requires non database tools. When you have been on a single node database whether it is Mysql or Postgresql etc, it is tempting to scale by manual sharding or one of the clustering solutions available for those databases. It might appear easy at the beginning but the cost of managing the system increases exponentially with scale. Additionally, sharding and replication lead to consistency issues and bugs that need to be addressed. Scaling with single node databases like Mysql beyond a certain point has extremely high operational overhead.

NoSql databases such as Cassandra, Riak, MongoDB etc offer scalability and high availability but at the expense of data consistency. That might be ok for some social media or consumer applications where the dollar value of individual transaction is very small. But not in enterprise applications where the correctness of each transaction is worth several thousands of dollars. In enterprise applications, we need distributed data to behave the same way that we are used to with single node databases.

Let us look at some common correctness issues that crop up with distributed data.

Example 1 : A distributed on line store with servers in San Francisco, New York and Paris.

Each server has 2 tables products and inventory with the following data.
Inventory: (product, count):

Customer Jose connects to server in San Francisco and buys widget2 at time t1. At time t2, Customer Pierre connects to a server in Paris and also buys widget2. Assume t2 > t1 but t2-t1 is small.

Expected Behavior : Jose successfully completes transaction and gets the product. Since inventory of widget2 is now zero, Pierre’s transaction is aborted.
Actual Behavior (in an eventually consistent system): Both transactions complete. But only one of the customers gets the product. The other customer is later sent an apologetic email that widget2 is out of stock.

Example 2: A distributed document sharing system with servers in New York, London, Tokyo

Operation1: In London, User X creates a new empty document marked private.
Operation2. User X makes update 1 to document.
Operation3: User X deletes update 1.
Operation4: User X makes update 2.
Operation5: User X changes the document from private to public.
Due to network issues, only operations 1,2, 5 reach Tokyo. 3 and 4 do not.
In Tokyo, User Y tries to read the shared document.

Expected behavior: The document status is private and Y cannot read the document.
Actual behavior: Y is able to read the document but an incorrect version. The document has update1 which is deleted and is missing update2 which needs to be there.

The problems above are known as consistency issues. Different clients are seeing different views of the data. What is the correct view ?

Consistency here refers to C in the CAP theorem, not the C in ACID. Here Consistency means every thread in a concurrent application correctly reads the most recent write at that point in time.

How do you fix the above issues ? In a single node database, Example1 can be fixed by locking the row in the inventory table during update and Example2 is not even an issue because all the data is in one node. But in a distributed application data might be split across shards and shards replicated for high availability. User of the system might connect to any shard/server and read/write data. With NoSql databases, the application has to handle any in consistencies.

In traditional RDBMSs , database developers are given a knob called isolation level to control what concurrent threads can read. In this old blog I explain what isolation levels are. The safest isolation level is the SERIALIZABLE where the database behaves as if the transactions were executing in a serial order with no overlap, even though in reality they are executing concurrently. Most developers use the default isolation level which is generally READ_COMMITTED OR READ_REPEATABLE. In reality, these isolation levels are poorly documented and implemented differently by different vendors. The result is that in highly concurrent applications, there are consistency bugs even in traditional single node RDBMs. In a distributed database with data spread across shards and replicated for read scalability, the problem is compounded further. Most NoSql vendors punt the problem by claiming eventual consistency, meaning if there are no writes for a while, eventually all reads on all nodes will read the last write.

Consistency is often confused with isolation, which describes how the database behave under concurrent execution of the transactions. At the safest isolation level, the database behaves as if the transactions were executing in serial order, even though in reality they are executing concurrently. At the safest consistency level, every thread in a concurrent application correctly reads the most recent write. But most database documentations are not clear on how to achieve this in an application.

The problems in examples 1 and 2 would not occur if those applications/databases had the notion of a global transaction order with respect to real time. In example 1, Pierre’s transaction at t2 should see the inventory as 0 because a transaction at t1 <t2 set it to zero. In example 2, Y should only be able to read upto operation2 . It should not be able to read operation5 without operations 3,4 which occured before 5.

In database literature, the term for this requirement is called “Strict Serializability” or sometimes “external consistency”. Since this technical definitions can be confusing, it is often referred to as strong consistency.

2 research papers that have been around for a while provide answers on how this problems might be fixed. The papers are the Spanner paper and the Calvin paper.

Their approach is solving the problem can summarized as follows:
1. timestamp transactions with something that reflect their occurrence in real time
2. Order transactions based on timestamp
3. Commit transactions in the above order.

But the details of how they do it are significantly different. Let us look at how they do it.

Spanner paper from Google

Spanner is database built at Google and the paper describes the motivation and design of Spanner. Spanners approach involves
1. The use of atomic clocks and GPS to synchronize clocks across hosts in different regions and the true time API to give accurate time across nodes, regions or continent.
2. For a read/write transaction, spanner calls the true time API to get a timestamp. To address overlaps between transactions that are close to each other, the timestamp is assigned after locks are acquired and before they are released. 
3. The commit order equals timestamp order.
4. Read for particular timestamp is sent to any shard/replica that has the data at that timestamp.
5. Read without timestamp (latest read) are serviced by assigning a timestamp.
6. Writes that cross multiple shards use two phase commit.
And of course,
7. It can scale horizontally to 1000s of nodes by sharding.
8. Each shard is replicated.
And most importantly, 
9. Even though, it is a key value store, it provide SQL support to make it easy for application programmers.
CockroachDb and Yugabyte are 2 commercial databases based on spanner.

Calvin Paper

The Calvin paper addresses the above problem using distributed consensus protocols like Raft or Paxos. 
1. Every transaction has to first go through distributed consensus and secure a spot in a linear replication log. 
2. One can view the index in the log as the timestamp. 
3. The committed entries in the replication log are then executed in the exact same serial order by every node in the distributed database. 
4. Since the transaction log is replicated to every shard, it does not need or use two phase commit. In a transaction involving multiple shards, if a shard dies before committing a particular transaction, then on restart it just has to execute the uncommitted transaction from it replication log.
5. No dependency on wall clocks or time API.
6. No two phase commit.
7. No mention of SQL support.

 FaunaDb is an example of a database based on Calvin.

This class of databases that offer horizontal scalability on a global scale without sacrificing consistency is also called NewSql. 

In summary, if you are a building a globally distributed application that needs strong consistency, doing it on your own with SQL or NoSql database can be non trivial. Consistency is hard enough in a single node database. But on a distributed database, consistency bugs are harder to troubleshoot and even harder to fix. You might want to consider one of the NewSql databases to make life easier. Review the Spanner and Calvin papers to understand the architectural choices that are available. This will help you pick a database that is right for you. Spanner and Calvin papers have been around for almost a decade. But they have become more relevant now as real databases based on them become more popular. Most importantly understand what is consistency is and apply it, for lack of which can cause severe correctness bugs in your application. 


The Spanner paper

The Calvin paper

Consistency and Isolation

Sunday, August 26, 2018

ElasticSearch Tutorial

ElasticSearch is a distributed , scalable, search and analytics engine.

It is similar to Apache Solr with a difference that is built to be scalable from ground up.

Like Solr, ElasticSearch is built on top of Apache Lucene which is a full text search library.

What is difference between a database and a search engine ? Read this blog.

1.0 Key features

Based on very successful search library Apache Lucene.
Provides the ablity to store and search documents.
Supports full text search.
Schema free.
Ability to analyze data - count , summarize ,aggregate etc.
Horizontally scalable and distributed architecture.
REST API support.
Easy to install and operate.
API support for several languages.

2.0 Concepts

An elasticsearch server process called a node is a single instance of a java process.

A key differentiator for elasticsearch is that it was built to be horizontally scalable from ground up.

In production environment, you generally run multiple nodes. A cluster is a collection of nodes that store your data.

A document is a unit of data that can be stored in elasticsearch. JSON is the format.

An Index is a collection of documents of a particular type. For example you might have one index for customer documents and another for product information. Index is the data structure that helps the search engine find the document fast. The document being stored is analyzed and broken into tokens based on rules. Each token is indexed - meaning - given the token -there is pointer back to the document - just like the index at the back of the book. Full text search or the ability to search on any token or partial token in the document is what differentiates a search engine from a more traditional database.

Elasticsearch documentation sometimes use the term inverted index to refer to their indexes. This author believes that the term "inverted index" is just confusing and this is nothing but an index.

In the real world, you never use just one node. You will use an elasticsearch cluster with multiple nodes. To scale horizontally, elasticsearch partitions the index into shards that get assigned to nodes. For redundancy, the shards are also replicated, so that they are available at multiple nodes.

3.0 Install ElasticSearch

Download from the latest version of elasticsearch. You will download elasticsearch-version.tar.gz.

Untar it to a directory of your choice.

4.0 Start ElasticSearch

For this tutorial we will use just a single node. The rest of the tutorial will use curl to send http requests to a elasticsearch node to demonstrate basic functions. Most of it is self explanatory.

To start elasticsearch type


To confirm it is running

curl -X GET "localhost:9200/_cat/health?v"

5.0 Create an index

Let us create a index person to store person information such as name , sex , age , person etc

curl -X PUT "localhost:9200/person"{"acknowledged":true,"shards_acknowledged":true,"index":"person"}

List the indexes created so far

curl -X GET "localhost:9200/_cat/indices?v"

health status index    uuid                   pri rep docs.count docs.deleted store.size
yellow open   person   AJCSCg0gTXaX6N5g6malnA   5   1          0            0      1.1kb          1.1kb

6.0 Add Documents

Let us add a few documents to the person index.
In the url, _doc is the type of document. It is way to group documents of a particular type
In /person/_doc/1, the number 1 is the id of the document we provided. If we do not provide an id , elasticsearch with generate an id.
You will notice that the data elasticsearch accepts is JSON.

curl -X PUT "localhost:9200/person/_doc/1" -H 'Content-Type: application/json' -d'
  "name": "Big Stalk",
  "interests":"Hiking Cooking Reading"
curl -X PUT "localhost:9200/person/_doc/2" -H 'Content-Type: application/json' -d'
  "name": "Kelly Kidney",
  "interests":"Dancing Cooking Painting"

curl -X PUT "localhost:9200/person/_doc/3" -H 'Content-Type: application/json' -d'
  "name": "Marco Dill",
  "interests":"Sports Reading Painting"

curl -X PUT "localhost:9200/person/_doc/4" -H 'Content-Type: application/json' -d'
  "name": "Missy Ketchat",
  "interests":"Singing Cooking Dancing"

curl -X PUT "localhost:9200/person/_doc/5" -H 'Content-Type: application/json' -d'
  "name": "Hal Spito",
  "interests":"Sports Singing Hiking"


7.0 Search or Query

The query can be provided either as a query parameter or in the body of a GET. Yes, Elasticsearch accepts query data in the body of a GET request. 

7.1 Query string example

To retrieve all documents:

curl -X GET "localhost:9200/person/_search?q=*"

Response is not shown to save space.

Exact match search as query string:

curl -X GET "localhost:9200/person/_search?q=sex:female"

  "name": "Kelly Kidney",
  "interests":"Dancing Cooking Painting"
  "name": "Missy Ketchat",
  "interests":"Singing Cooking Dancing"

7.2 GET body examples

Query syntax when sent as body is much more expressive and rich. It merits a blog of its own.
This query finds persons with singing and dancing in the interest field. This is full text search on a field.

curl -X GET "localhost:9200/person/_search" -H 'Content-Type: application/json' -d'
  "query": {
    "bool": {
      "should": [
        { "match": { "interests": "singing" } },
        { "match": { "interests": "dancing" } }

  "name": "Missy Ketchat",
  "interests":"Singing Cooking Dancing"
  "name": "Hal Spito",
  "interests":"Sports Singing Hiking"
  "name": "Kelly Kidney",
  "interests":"Dancing Cooking Painting"

Below is a range query on a field.

curl -X GET "localhost:9200/person/_search" -H 'Content-Type: application/json' -d'
  "query": {
    "range": {
      "age": [
        { "gte": 30, "lte":40 }


  "name": "Hal Spito",
  "interests":"Sports Singing Hiking"
  "name": "Kelly Kidney",
  "interests":"Dancing Cooking Painting"

8.0 Update a document

$curl -X POST "localhost:9200/person/_doc/5/_update" -H 'Content-Type: application/json' -d'
  "doc": { "name": "Hal Spito Jr" }


After executing the above update, do a search for "Jr". The above document will be returned.

9.0 Delete a document

curl -X DELETE "localhost:9200/person/_doc/1"

This will delete the document with id for 1. Any searches will not return this document anymore

10. Delete Index

curl -X DELETE "localhost:9200/person"

That deletes the index we created.

11. Conclusion

This has been a brief introduction of elasticsearch just enough to get you started. There are lot of more details in each category of APIs. We will explore them in subsequent APIs.