Showing posts with label CockRoachDB. Show all posts

Wednesday, July 3, 2024

How modern distributed systems scale by partitioning ?

1.0 Introduction

In the last 20 years, software systems moved to the internet and handled large volumes of data and millions of requests. Most people interact with these systems using a browser or a mobile device. At the back end, is not one powerful computer but generally a network of commodity computers. Both the processing and storage of data is spread across multiple computers. In this blog we discuss how large datasets can be stored using multiple commodity computers.

Partitioning is the process of breaking up a large dataset in parts so that each part can fit easily on the disk on one one node and be efficiently managed by each node. For very large data sets that cannot fit on 1 machine, data needs to be broken up into parts ( partition or shard). Each partition is stored on a different machine. This is just natural horizontal scaling. But most important is that, when it time to read the partitioned data, we need to be able to find (efficiently) which partition and node has the data we want to read.

Storage space is not the only benefit of partitioning. You are also spreading the compute required to read, write and process the data.

Partitioning is generally combined with replication to make the partitions highly available. But we do not discuss replication here. That is a topic for another blog.

2.0 Types of partitioning

There are 2 types of data that need to be considered: unstructured and structured.

Most discussions on partitioning discuss partitioning of data in databases ( structured data ) but not unstructured data which is outside databases in plain files. This blog discusses both unstructured data and structured data.

2.1 Structured data

The problem is more interesting for databases because it is not enough to break up the dataset into smaller parts. During reads you need to be able to find the data. And you need to do it fast. When the database receives a query - "Give me records for Customer X", How does it know which node hosts the data ? Does the database have to send the request to all the nodes ? That would be quite inefficient.

The goal is thus to partition data and query it efficiently. Another goal is to ensure that distribution of data between partitions is even. You do not want a situation where partition 1 has 70% of the data and the other 5 partitions has the remaining 30%. This will overload partition 1 ( known as a hot spot) and you lose the benefits of partitioning.

2 strategies are commonly used for database partitioning.

2.1.1 Range based partitioning

Database records are generally stored sorted based on the primary key.

Initially there is one partition with zero records.

As clients write to the database, the size of the partition increases. When it reaches a certain size say 10MB or 64MB, it is split into two partitions.

Each partition may be assigned to a different node.

This process is repeated as more data is added and partitions grow. If data is deleted and partitions shrink, then small partitions can be merged

To efficiently query data, the database needs to do some book keeping

-- which key range is in which partition

-- which partition is at which node

Starting with 1 partition and 1 node is not efficient for obvious reasons and databases typically start with configured number of partitions or a number of partitions proportional to the number of nodes.

To balance the load on nodes, partitions may need to be moved between nodes.

This is the strategy used by HBase, CockroachDb, MongoDb.

2.1.2 Hash based partitioning

The hash value calculated from the key is used to determine the location where the record can be stored.

The wrong way to determine the node is by using hash mod n, where n is the number of nodes. The problem with this approach is that when nodes are added or removed from the cluster, a very high percentage of the keys need to be removed.

A better approach is to start with a fixed number of partitions , way more that the number of the nodes the cluster will ever have , say for example 1000 or 10000. Partitions are logical. Hash ranges are assigned to partitions. Partitions are assigned to nodes either using numPartitions mod numNodes or other algorithms. This is shown in the top half of Figure 4. The bottom half of figure 4 visualizes the same as a ring as is done in many articles that to refers to this as consistent hashing. Think of partitions being placed on the ring. Each partition owns the key space from the position of the previous position to its position. The cluster needs to maintain a mapping of partitions to nodes. When a new node is added, the cluster can take a few partitions from existing nodes and assign to the new nodes. When a node is removed, the cluster assigns its partitions to other nodes. Looking up a key is a extra level of indirection. The hash of key maps to a partition. The partition node map tell you which node has the partition that has the key. Many studies has shown that this lead to less movement of keys between nodes as the cluster changes.

In popular press, this has the poorly understood name "consistent hashing". It is just hast hash based partitioning. Nothing consistent and nothing to do with consistency.

2.1.3 Secondary Indexes

So far we have been talking about partitioning by the primary key, also know as the primary index.

To speed up retrieval of records, databases also have secondary indexes which can be very large and might need to be partitioned.

One approach can be to keep the secondary indexes local to the node on which the primary index partition is. The advantage of this approach is that since all related rows are on the same node, inserts/updates/deletes are are all local. But queries on the secondary index requires sending queries to all nodes and aggregating the responses.

Another approach is to create a global secondary index and partition it as an independent entity. However since the secondary index partitions might be on a different partition from the primary partition, CUD operations are more expensive. Transactions might be distributed. However range queries on secondary indexes are more efficient since closer records (by sort) are on the same partition.

2.2 Unstructured data

Unstructured data refer to ordinary files that have text or binary data. Of course we are talking about large files or many large files. This is the use case for a distributed file system such as HDFS (hadoop file system) or GFS ( Amazon ). Logically the implementation of distributed file system is similar to say a linux filesystem. You view the file system as a list of blocks of fixed size. On a single node linux file system , all the disk blocks are on one node. In a distributed file system, the blocks are spread across multiple nodes. In HDFS, the name node maintains the metadata for the distributed filesystem -- given a file , which blocks make up the file and which nodes have the blocks. To create a file, the name node may assign a block on a particular node and the client talks directly to service called data node running on the target node to write to block. To read a block, the name node directs the client to the data node that hosts the block and the client reads directly from that block. But the basic algorithm is simple -- break up the file data into blocks and spread them across nodes.

Another example is partitioning of topic logs in Apache Kafka is a messaging system ( they like to call it event streaming) where producers write messages to a topic and consumers read messages from the topic. The storage for a topic is a log file. New messages are appended to the end of log file. They are read from the front. Obviously the logs can grow beyond what can fit on a node. So log is broken into partitions and distributed across multiple nodes. The broker serves producers and consumer. A Kafka cluster has multiple brokers with each broker running on a seprate node managing multiple partitions.

3.0 Rebalancing

Rebalancing is the process of moving partitions between nodes to make the distribution of load even across all the nodes. This is necessary when nodes join or leave the cluster or if the cluster starts receiving more data for certain keys. Either way rebalancing is an expensive operation that needs lot of CPU, memory and network bandwidth. It can have an impact on the performance of regular CRUD processing. In a ideal world, we would like rebalancing to happen automatically behind the scenes, without end users knowing about it. But for performance reasons listed above, that rarely works well in practice. Some databases require an admin to manually start a rebalance, which can be done during a period of low load and monitored.

4.0 Routing

How does a client know which partition to connect to ? The short answer is that the database has to maintain mapping of partition to node. In the case of hash based partitioning, hash maps to a partition which maps to a node. In the case of range based partitioning the key maps to a key range which maps to a partition which maps to a node. The partition node map is available to nodes. If a client can connect to any node, then if the node does not have the partition to handle the clients request, it can redirect the client to the appropriate node or it can get the data from the target node and return to client.

5.0 Use case

There are cases where you might not have a ready made database doing server side partitioning for you and you might need to do it yourself. Or even when the database does it for you, you still need to pick the right partition key for the partitioning to be optimal. Let us look at some large datasets and discuss how they might be logically partitioned.

Let us say you are building a twitter like system.

Say 100 million tweets of 140 character per day

100M * ( 280 bytes + 20 bytes for id, timestamp)

30 GB / day

10 TB / year

Need to store 5 years to data

Need to store and query 50 TB

How does twitter work ?

Users follow other users.

When a user connects, we need to show the most recent tweets from the users he follows.

So we need to store about 50 TB of tweets. Given a user, we need to query say the 50 most recent tweets from the users he follows.

Using commodity hardware, 50TB would need say 25 nodes. What key would you use to partition the data ?

Option 1 : hash based partitioning based on user.

To store tweets, a hash of the user is to used to locate the node where the tweet is stored. To query - for each user that the user follows, use the hash to query the node for that users tweets. A problem with this approach is that some users tweet way more than other users. Their nodes are going to be overloaded while others are idle. Unbalanced load.

Option 2 : hash based on randomly generated tweet id

The problem with this approach is the for every query, you have to query every server and aggregate the results. Inefficient for queries.

Option 3: hash based on timestamp

Timestamp is relevant because for each feed request we want the latest tweets. It would be good if tweets are sorted by timestamp. However with such as hash, at any given point in time , one server is overloaded as all the writes are going to the server.

Option 4:

Given the choices, inefficient query (option 2) is more tolerable that unbalanced load (option 1) which could crash some of the servers making the system unavailable. But we also want queries to return the most recent data (sorted by timestamp). So we can improve querying little bit by combining option 2 and 3. Assume a timestamp in epoch time in increments of 1 sec, the tweetid could be timestamp + auto incremented seq. The sequence gives randomness to the tweet id and will give uniform distribution across node.

So given a epoch 1692547708 you will have tweet ids like

1692547708 1

1692547708 2

1692547708 3

1692547708 n

Of course you are wondering how to generate unique tweet ids in a distributed system. That is a topic for another blog.

6.0 Summary

Partitioning data and spreading it across nodes is fundamental to distributed system. Special thought needs to be given to how the partitioned data can be queried efficiently. Hash based partitioning and range based partitioning are two popular strategies. Nodes can fail or additional nodes may need to added to scale. To ensure that load is even across nodes, partitions may be moved between nodes in a process called rebalancing. For best results design your partition keys so that load is distributed evenly and querying is efficient.

7.0 Related Content

CockroachDB Review

Sunday, June 30, 2024

CockroachDb Review: Should I use CockroachDb ?

Overview

CockroachDb is a modern distributed database that promises linear scalability with strict serializability.

Server side sharding is automatic. Nodes can be added easily as needed. Claims to provide the SERIALIZABLE isolation level.

Most distributed databases such as Cassandra, MongoDb, HBase etc sacrifice consistency to achieve high availability. CockroachDb distinguishes itself by claiming to be distributed and the same time offer strong consistency that even single node databases do not offer.

This falls into a database category called NewSql or DistributedSQL as opposed to NoSql (Cassandra, MongoDb)

When to choose CockroachDb ?

You should choose CockroachDb if

Your data is of a global scale.

As data size increases, you need to scale horizontally to several nodes.

You need data to be distributed and localized in specific geographical regions. For
example EU data resides in Europe while US data resides in US.

You need strong consistency. Serializable isolation level.

You need to keep the SQL / relational data model.

You need distributed transactions.

You may want to pass on it if

   You data size can easily fit on a node for the foreseeable future.
   You organization is more comfortable with a stable proven database. (CockroachDb is
   still maturing).
   You data model is heavily normalized and you do a lot of joins in your queries. While this
   database can support joins, it is still not recommended in a highly distributed
   environment.

Architecture

Architecture is based on Google's Spanner paper.

It is a key value store with a SQL interface on top of it.

Database is a cluster of nodes. All nodes are equal. Nodes may join and leave the cluster at any time.

Sorted map of key values pairs. Fully ordered monolithic key space. All tables/indexes go into the same key space by encoding tablename/indexname/key together.

Sharding

Key value pairs are broken up into contiguous ranges.

When range size reaches 512 Mib (2 power 20) It is split into 2 ranges.

Each range is assigned to a node and replicated.

If you have 1 node all the shards are in that node. To scale, you add more nodes and the shards get distributed across nodes. A minimum of 3 nodes is recommended.

Very easily spin up node(s) and add to cluster anywhere.

Btree like index structure used to locate shard that has a key.

Replication

Data in each range is replicated using the Raft consensus algorithm.

A minimum replication factor of 3 is needed.

This provides the high availability. Data is available as long as a majority of the nodes in the cluster are available.

Geo-partitioning

By adding a country or region to the primary key, you can limit storage to keys to a particular region. So European data can be make to reside in Europe, US data in US and so. This has 2 benefits
There is a performance benefit and data is local to its users.

It can satisfy legal requirements where data is not allowed to leave a country or region.

Read/Write

Reads

Any node can receive a request to read a key/value.

Request is forwarded to the node that is the raft leader for that table/range.

Leader returns the data to the node that requested it. Since leader returns the data, no consensus is required.

Node returns it to the client.

Writes

Any node can receive a request to write a key/value.

Request is forwarded to the node that is the raft leader for that table/range.

Leader writes the value to its log and initiates consensus with replicas for the range. When majority acknowledges, the key/value is considered committed and leader notifies the requesting node which notifies the client

Transactions

Supports transactions that spans multiple tables and rows.

Transactions can be distributed, that is span multiple nodes.

The supported isolation level is strict serializability which is the highest isolation level. Strict serializability means that not only are transactions ordered, but they are ordered as per wall clock time.
Transaction protocol is an improvement over two phase commit. In parallel, participants acquire locks and create write intents. The transaction is marked staged. When the client commits, if all locks are acquired and writes are replicated, the coordinator immediately returns success to client. In background the transaction is marked committed. This is one round trip between transaction coordinator and each participant - unlike two phase commit - which requires two round trips.

Hybrid logical clocks are used to timestamp each transaction. Timestamp is the version for MVCC.

Data Model

Clients see only the SQL row column relation model
Wire protocol is same as Postgresql wire protocol.

Performance

Efficient range scan.
Geo partitioning improves performance by locality.
Distributed SQL execution.
Distributed transactions will be slow.
Generally you do not want distributed transactions over large distances. If you build a 3 node CockroachDb cluster with 1 node in NewYork, 1 in London and 1 in San Francisco, the write latencies are going to be very high due to the round trips for RAFT and distributed transactions. The cluster topology needs to be designed appropriately to give you the lowest latency at the desired level of high availability.

Administration

Good command line tools and UI console make the the administration easy.
Since all nodes are equals, number of moving parts that need to be administered is low.

Summary

If you need a globally distributed database with strict serializability, this is definitely a database to look at. It has good pedigree. However remember that distributed databases are not drop in replacement for your traditional RDBMSs. Distributed queries especially joins and distributed transaction can be slow. So some application redesign, some denormalization is always required.

Note: Moved from heavydutysoftware.com