Saturday, November 19, 2011

What is NoSQL ?

NoSQL is a term used to refer to a class of database systems that differ from the traditional relational database management systems (RDBMS) in many ways. RDBMSs are accessed using SQL. Hence the term NoSQL implies not accessed by SQL. More specifically not RDBMS or more accurately not relational.

Some key characteristics of NqSQL databases are :
  • They are distributed, can scale horizontally and can handle data volumes of the order of several terrabytes or petabytes, with low latency.
  • They have less rigid schemas than a traditional RDBMS.
  • They have weaker transactional guarantees.
  • As suggested by the name, these databases do not support SQL.
  • Many NoSQL databases model data as row with column families, key value pairs or documents

To understand what non relational means, it might be useful to recap what relational means.

Theoretically, relational databases comply with Codds 12 rules of relational model. More simply, in RDBMS, a table is relation and database has a set of such relations. A table has rows and columns. Each table has contraints and the database enforces the constraints to ensure the integrity of data.Each row in a table is identified by a primary key and tables are related using foreign keys. You eliminate duplicate data during the process of normalization, by moving columns into separate tables but keeping the relation using foreign keys. To get data out of multiple tables requires joining the tables using the foreign keys. This relational model has been useful in modeling most real world problems and is in widespread use for the last 20 years.

In addition, RDBMS vendors have gone to great lengths to ensure that RDBMSs do a great job in maintaining ACID (actomic, consistent, integrity, durable) transactional properties for the data stored. Recovery is supported from unexpected failures. This has lead to relational databases becoming the de facto standard for storing enterprise data.

If RDBMSs are so good, Why does any one need NoSQL databases ?

Even the largest enterprises have users only in the order of 1000s and data requirements in the order of few terra bytes. But when your application is on the internet, where you are dealing with millions of users and data in the order of petabytes, things start to slow down with a RDBMS. The basic operations with any database are read and write. Reads can be scaled by replicating data to multiple machines and load balancing read requests. However this does not work for writes because data consistency needs to be maintained. Writes can be scaled only by partitioning the data. But this affects read as distributed joins can be slow and hard to implement. Additionally, to maintain ACID properties, databases need to lock data at the cost of performance.

The Googles, facebooks , Twitters have found that relaxing the constraints of RDBMSs and distributing data gives them better performance for usecases that involve
  • Large datasets of the order of petabytes. Typically this needs to stored using multiple machines.
  • The application does a lot of writes.
  • Reads require low latency.
  • Data is semi structured.
  • You need to be able to scale without hitting a bottleneck.
  • Application knows what it is looking for. Adhoc queries are not required.


What are the NoSQL solutions out there ?

There are a few different types.

1. Key Value Stores

They allow clients to read and write values using a key. Amazon's Dynamo is an example of a key value store.

get(key) returns an object or list of objects
put(key,object) store the object as a blob

Dynamo use hashing to partition data across hosts that store the data. To ensure high availability, each write is replicated across several hosts. Hosts are equal and there is no master. The advantage of Dynamo is that the key value model is simple and it is highly available for writes.

2. Document stores

The key value pairs that make up the data are encapsulated as a document. Apache CouchDB is an example of a document store. In CouchDB , documents have fields. Each field has a key and value. A document could be
"firstname " : " John ",
"lastname " : "Doe" ,
"street " : "1 main st",
"city " : "New york"
In CouchDB, distribution and replication is peer to peer. Client interface is RESTful HTTP, that integrated well with existing HTTP loadbalancing solutions.

3. Column based stores

Read and write is done using columns rather than rows. The best known examples are Google's BigTable and the likes of HBase and Cassandra that were inspired by BigTable. The BigTable paper says that BigTable is a sparse, distributed, persistent, multidimensional sorted Map. While that sentence seems complicated, reading each word individually gives clarity.
sparse - some cells can be empty
distributed - data is partitioned across many hosts
persistent - stored to disk
multidimensional - more than 1 dimension
Map - key and value
sorted - maps are generally not sorted but this one is

This sample might help you visualize a BigTable map
{
row1:{
    user:{
          name: john
          id : 123
    },
    post: {
          title:This is a post    
          text : xyxyxyxx
    }
}
row2:{
    user:{
          name: joe
          id : 124
    },
    post: {
          title:This is a post    
          text : xyxyxyxx
    }
}
row3:{
    user:{
          name: jill
          id : 125
    },
    post: {
          title:This is a post    
          text : xyxyxyxx
    }
}

}
The outermost keys row1,row2, row3 are analogues to rows. user and post are what are called column families. The column family user has columns name and id. post has columns title and text. Columnfamily:column is how you refer to a column. For eg user:id or post:text. In Hbase, when you create the table, the column families need to be specified. But columns can be added on the fly. HBase provides high availability and scalability using a master slave architecture.


Do I needs a NoSQL store ?

You do not need a NoSQL store if

  • All your data fits into 1 machine and does not need to be partitioned.
  • You are doing OLTP which required the ACID transaction properties and data consistency that RDBMSs are good at.
  • You need ad hoc querying using a language like SQL.
  • You have complicated relationships between the entities in your applications.
  • Decoupling data from application is important to you.


You might want to start considering NoSQL stores if

  • Your data has grown so large that it can no longer be handled without partitioning.
  • Your RDBMS can no longer handle the load.
  • You need very high write performance and low latency reads.
  • Your data is not very structured.
  • You can have no single point of failure.
  • You can tolerate some data inconsistency.

Bottomline is that NoSql stores are a new and complex technology. There are many choices and no standards. There are specific use cases for which NoSql is a good fit. But RDBMS does just fine for most vanilla use cases.