Code & Cluster: June 2014

A lot has been written about NoSql databases. There is lot of hype surrounding many of these databases. Unfortunately most written material either sings praises about a particular database or trashes it. I also am starting to see people pick databases for the wrong reasons. The purpose of this blog is to highlight the reasons to consider while choosing Cassandra as your database.

1. Scaling by partitioning data

Cassandra is designed to store large quantities of data - several hundreds of terrabytes or petabytes that typically cannot be stored on a single machine. Cassandra solves the problem by partitioning the data across machines in a cluster using a consistent hash. When data is partitioned across several machines, some of the things we are used to in relational databases like consistency and transactions are difficult to implement. Hence those features are weak or in some cases not available. So the ability to scale comes at the expense of other features.

The single biggest mistake people make is using Cassandra when their size of data is not large enough to merit partitioning. If in the foreseeable future, you data size is a few hundred gigabytes , stick to mysql or other relational database of your choice. Even if your data size grows in the the future, you can always port to Cassandra when you reach the stage of few TB. This is especially true if you are building a new application with limited resources. Do not let the complexity of Cassandra slow down the rest of your feature development.

2. High availability

The CAP theorem states that out of consistency, availability and partition tolerance , it is possible for a system to have only 2 of 3. Cassandra is designed for availability and partition tolerance.

If your applications primary requirement is high availability, Cassandra can be a great choice. With its shared nothing architecture, where all nodes are equal, multiple nodes can go down and the database is still available. Clients can connect to any node and that node with get/put the data to the node that is required to handle that data. Replication ensures that if the primary node that handles the data goes down, a replica is able to service the request.

3. Replication

Replication has 2 purposes : One it provides redundancy for data in case of failures. Second, it makes copies of data available closer to where it is consumed or served. In many databases, setting up replication is cumbersome. Not in Cassandra. Replication is core to the architecture. Replication is configured at a keyspace level by specifying a replication strategy and the number of replicas and the data is replicated within the cluster or across data centers as required.

If replication is important, especially across data centers, Cassandra is a great choice.

4. Optimized for writes

Write operations update a in memory data structure called Memtable and returns immediately. Nothing is locked and nothing is written to disk. Writes are very fast. When Memtables reach a certain size, they are flushed to disk to a file called SSTable. Reads may have to go through multiple SStables and aggregate changes to return correct data. For this reason, reads might not be that fast.

If you have a workloads that involves a lot of writes and few reads then Cassandra is a suitable database. A common use case is storing data from log files of high volume production web servers that service several billion requests a day. An analytics application would potentially read the data, but the read volume is low because the reads are done by in house business analysts and not internet users.

5. Compaction

Over time several SSTables get created and reads have to go through multiple SSTables to get to data. Periodically Cassandra will asynchronously merge smaller SSTables into large SSTables. People have complained that during compaction, things slow down and throughput degrades. There are probably ways to tune this, but you should be aware of compaction when using Cassandra.

6. Limited querying capability

Cassandra supports a SQL like language called CQL. It is "SQL like" and not SQL. Many very basic things like aggregation operators are not supported. Joins of tables are not supported. Range queries on partition key are not supported. Range queries are possible within a partition key using the clustered columns, but it requires some additional data modeling.

Bottom line is that Cassandra partitions the data based on consistent hash of the partition key and look ups are possible based only on the look up key. Anything else requires additional modeling that involves what is called clustered columns.

7. Consistency model

Cassandra was inspired by Amazon's Dynamo database where the model was eventual consistency. When a client requested data and there was inconsistency between the values in the nodes of a cluster, the server returned a vector clock to the client and it was the responsibility of the client to resolve any conflict.

Cassandra's model is tunable consistency. For a read or write operation, client can specify a consistency level such as ANY, ALL, QUORUM, ONE , TWO etc. However when there are concurrent writes, the order is determined based on machine time stamps. So it is important that clocks on nodes in the cluster be synchronized. Getting the consistency model to work requires time and effort on the part of the developer. If the kind of strong consistency we are used to in relational databases is important to you, Cassandra will not be a suitable choice.

8. Frequent updates

Based on what is discussed in (4) and (8) Cassandra is not suitable for use cases where you update column values frequently. When concurrent updates happens, Cassandra uses timestamps to determine which update happened first and you could sometimes encounter the lost update problem. To work around the problem, what you have to do is append updates to a collection or wide columns and then aggregate the final value on reads. Again this is additional work in data modeling and programming and you might be better off using another database if frequent updates are an integral part of your use case.

In summary, Cassandra is an excellent choice as a database when your data size is very large and high availability or replication are important. But it is not a general purpose database. Some of the scalability comes at a cost and you give up other features like consistency or querying.

For additional information on Cassandra, please check DataStax documentation. You can also read these blogs:
HBase vs Cassandra
Cassandra Data Model
Cassandra Compaction

Code & Cluster

Friday, June 27, 2014

Apache Cassandra : Things to consider before choosing Cassandra