Overview
CockroachDb is a modern distributed database that promises linear scalability with strict serializability.
Server side sharding is automatic. Nodes can be added easily as needed. Claims to provide the SERIALIZABLE isolation level.
Most distributed databases such as Cassandra, MongoDb, HBase etc sacrifice consistency to achieve high availability. CockroachDb distinguishes itself by claiming to be distributed and the same time offer strong consistency that even single node databases do not offer.
This falls into a database category called NewSql or DistributedSQL as opposed to NoSql (Cassandra, MongoDb)
When to choose CockroachDb ?
You should choose CockroachDb if
Your data is of a global scale.
As data size increases, you need to scale horizontally to several nodes.
You need data to be distributed and localized in specific geographical regions. For
example EU data resides in Europe while US data resides in US.
You need strong consistency. Serializable isolation level.
You need to keep the SQL / relational data model.
You need distributed transactions.
You may want to pass on it if
You data size can easily fit on a node for the foreseeable future.
You organization is more comfortable with a stable proven database. (CockroachDb is
still maturing).
You data model is heavily normalized and you do a lot of joins in your queries. While this
database can support joins, it is still not recommended in a highly distributed
environment.
Architecture
Architecture is based on Google's Spanner paper.
It is a key value store with a SQL interface on top of it.
Database is a cluster of nodes. All nodes are equal. Nodes may join and leave the cluster at any time.
Sorted map of key values pairs. Fully ordered monolithic key space. All tables/indexes go into the same key space by encoding tablename/indexname/key together.
Sharding
Key value pairs are broken up into contiguous ranges.
When range size reaches 512 Mib (2 power 20) It is split into 2 ranges.
Each range is assigned to a node and replicated.
If you have 1 node all the shards are in that node. To scale, you add more nodes and the shards get distributed across nodes. A minimum of 3 nodes is recommended.
Very easily spin up node(s) and add to cluster anywhere.
Btree like index structure used to locate shard that has a key.
Replication
Data in each range is replicated using the Raft consensus algorithm.
A minimum replication factor of 3 is needed.
This provides the high availability. Data is available as long as a majority of the nodes in the cluster are available.
Geo-partitioning
By adding a country or region to the primary key, you can limit storage to keys to a particular region. So European data can be make to reside in Europe, US data in US and so. This has 2 benefits
There is a performance benefit and data is local to its users.
It can satisfy legal requirements where data is not allowed to leave a country or region.
Read/Write
Reads
Any node can receive a request to read a key/value.
Request is forwarded to the node that is the raft leader for that table/range.
Leader returns the data to the node that requested it. Since leader returns the data, no consensus is required.
Node returns it to the client.
Writes
Any node can receive a request to write a key/value.
Request is forwarded to the node that is the raft leader for that table/range.
Leader writes the value to its log and initiates consensus with replicas for the range. When majority acknowledges, the key/value is considered committed and leader notifies the requesting node which notifies the client
Transactions
Supports transactions that spans multiple tables and rows.
Transactions can be distributed, that is span multiple nodes.
The supported isolation level is strict serializability which is the highest isolation level. Strict serializability means that not only are transactions ordered, but they are ordered as per wall clock time.
Transaction protocol is an improvement over two phase commit. In parallel, participants acquire locks and create write intents. The transaction is marked staged. When the client commits, if all locks are acquired and writes are replicated, the coordinator immediately returns success to client. In background the transaction is marked committed. This is one round trip between transaction coordinator and each participant - unlike two phase commit - which requires two round trips.
Hybrid logical clocks are used to timestamp each transaction. Timestamp is the version for MVCC.
Data Model
Clients see only the SQL row column relation model
Wire protocol is same as Postgresql wire protocol.
Performance
Efficient range scan.
Geo partitioning improves performance by locality.
Distributed SQL execution.
Distributed transactions will be slow.
Generally you do not want distributed transactions over large distances. If you build a 3 node CockroachDb cluster with 1 node in NewYork, 1 in London and 1 in San Francisco, the write latencies are going to be very high due to the round trips for RAFT and distributed transactions. The cluster topology needs to be designed appropriately to give you the lowest latency at the desired level of high availability.
Administration
Good command line tools and UI console make the the administration easy.
Since all nodes are equals, number of moving parts that need to be administered is low.
Summary
If you need a globally distributed database with strict serializability, this is definitely a database to look at. It has good pedigree. However remember that distributed databases are not drop in replacement for your traditional RDBMSs. Distributed queries especially joins and distributed transaction can be slow. So some application redesign, some denormalization is always required.
Note: Moved from heavydutysoftware.com