Showing posts with label System Design. Show all posts

Sunday, November 30, 2025

Review : Facebook Memcache Paper

Introduction

Facebook (Meta) application handles billions of requests per second. The responses are built using thousands of items that needs to retrieved at low latencies to ensure good user experience. Low latency is achieved by retrieving the items from a cache. Once their business grew, a single node cache like Memcached would obviously not be able to handle the load. This paper is about how they solved the problem by building a distributed cache on top of Memcached.

In 2025, some might say the paper is a little dated ( 2013). But I think it is still interesting for several reasons. It is one of the early papers from a time when cloud and distributed systems exploded. I see it in the same category as the Dynamo paper from Amazon. While better technology is available today, this paper teaches important concepts. More importantly the paper shows how to take technology available and make more out it. In this case, they took a single node cache and built a distributed cache on top of it.

This is my summary of the paper Scaling Memcache at Facebook.

Requirements at Facebook

- Allow near real time communication
- aggregate content on the fly
- access and update popular shared content
- scale to millions of user requests per second

The starting point was single node Memcached servers.
The target was a general purpose memory distributed key value store - called Memcache, that would be used by a variety of application use cases.

In the paper and in this blog, Memcached refers to the popular open source in memory key value store and Memcache is the distributed cached that Facebook built.

Observations:

Read volumes are several orders of magnitude higher than write volumes.
Data fetched from multiple sources - HDFS, MySql etc
Memcached supports simple primitives - set, get, delete

Details

The diagram below shows the Memcache architecture.

Memcache is a demand filled look aside cache. There can be thousands of memcached servers within a memcache cluster.

When an application needs to read data, it tries to get it from memcache. If not found in Memcache, it gets the data from the original source and populates Memcache.

When the application needs to write data, it writes to original source and the key in Memcache is invalidated.

Wide fan out: When the front end servers scale, the backend cache needs to scale too.

Items are distributed across Memcached servers using consistent hashing.

To handle a request, front end server might need to get data from many Memcached servers.
Front end servers use a Memcache client to talk to memcached servers. Client is either a library or a proxy called mcrouter. Memcached servers do not communicate with each other.

Invalidation of the Memcached key is done by code running on the storage. It is not done by the client.

Communication

UDP is used for get requests. (Surprised ? clearly an optimization). Direct from client in webserver to Memcached. Dropped requests are treated as a cache miss.
TCP via mcrouter used for set and delete requests. Using mcrouter helps manage connections to storage.
Client implements flow control to limit load on backend components

Leases

Leases were implemented to address stale sets and thundering herd. Stale sets are caused by updating the cache with invalid values. Requiring a lease lets the system check that update is valid.
Thundering herd happens when there is heavy read write activity on the same key at the same time. By handing out leases only every so often say 10 sec, they slow things down.

Memcache pools

This is a general purpose caching layer used by different applications, different workloads with different requirements. To avoid interference, the clusters servers are partitioned into pools. For example, a pool for keys that are accessed frequently and cannot tolerate cache miss. Another pool for infrequently accessed keys. Each pool can scaled separately depending on requirement.

Failures

For small failures, the requests are directed to a set dedicated backup servers called gutters. When a large number of servers in the cluster down, the entire cluster is considered offline and traffic is routed to another cluster.

Topology

A frontend cluster is a group of webservers with Memcache.
A region is frontend cluster plus storage.
This keeps the failure domains small.
Storage is the final source of truth. Use mcsql and mcrouter to invalidate cache.

When going across data centers and across geo regions, the storage master in one region replicates to replicas in other regions. On an update, when the Memcached needs to be invalidated by the storage, it is not a problem if the update is in the master region. But if the update is in the replica region, then the read after write might read a stale data as the replica might not have caught up. In replica regions, markers are used to ensure that only data in sync with the master is read.

Summary

In summary, this paper show how Facebook took a single node Memcached and scaled it to its growing needs. No new fundamental theories are applied or discussed. But this demonstrated innovation and trade offs in engineering to scale and grow in product in production that needs to scale to meet user demand.

Key point is that separation cache and storage allowed each to be scaled separately. Kept focus on monitoring, debugging and operational efficiency. Gradual rollout and rollback of features kept a running system running.

Some might say that Memcache is a hack. But some loss in architectural purity is worth it -- if your users and business stay happy.

Memcache and Facebook were developed together with application and system programmers working together to evolve and scale the system. This does not happen when teams work in isolation.

Saturday, September 6, 2025

CRDT Tutorial: Conflict Free Replication Data Types

Have you ever wondered how Google docs, Figma, Notion provide real time collaborative editing?

The challenge is : What happens when 2 users edit the same part of the document at the same time.

User A at position 5: types X
User B at position 5: types Y

This is a concurrency problem. A traditional implementation would need to lock the document to handle this. But that would destroy real-time responsiveness. There is a need to automatically resolve conflicts so that every one ends up with same document state.

In Google docs, CRDTs are used to handle concurrent text edits, ensuring that if users insert text at the same position, the system is able to resolve the order without conflicts.

What is a CRDT?

CRDT stands for conflict free replication data type.

A CRDT is a specially designed data structure for distributed systems that:

Can be replicated across multiple nodes or regions.
Allows each replica to be updated independently and concurrently (without locks or central coordination).
Guarantees that all replicas will converge to the same state eventually, without conflicts, even if updates are applied in different orders.

Why do we need CRDTs?

In collaborative editing (like Google Docs, Notion, Figma):

Many users may edit the same document concurrently.
Network latency or partitions mean updates may arrive in different orders at different servers.
We can’t just “last-write-wins” — that would lose user edits.
We want low-latency local edits (user sees their change immediately), with eventual consistency across the system.
Typical in distributed systems

CRDTs give us a way to allow users to edit locally first and let the system reconcile changes without central locks.

Types of CRDTs

There are two broad families:

State-based (Convergent CRDTs, CvRDTs)
- Each replica occasionally sends its full state to others.
- Merging = applying a mathematical "join" function (e.g., union, max).
Operation-based (Commutative CRDTs, CmRDTs)

Each replica sends only the operations performed (e.g., "insert X at position 2").
These operations are designed so that applying them in any order yields the same final result.

Examples of CRDTs in Practice

G-Counter (Grow-only counter): Each replica increments a local counter, merge = element-wise max.
PN-Counter (Positive-Negative counter): Like G-counter, but supports increment & decrement.
G-Set (Grow-only set): Only supports adding elements.
OR-Set (Observed-Remove set): Supports add & remove without ambiguity.
RGA (Replicated Growable Array) or WOOT or LSEQ: For collaborative text editing, where inserts/deletes happen at positions in a string.

These are the basis for how real-time editors like Google Docs or Figma handle concurrent text/graphic editing.

Below is a simplistic Java implementation of a CRDT:

https://github.com/mdkhanga/blog-code/tree/master/general/src/main/java/com/mj/crdt

The code above provides a simple implementation of a G-counter that supports insert, update, delete and merges replicas by taking the maximum value for each node. It is a starting point to understand how CRDTs ensure convergence in distributed systems.

CRDT vs. Centralized Coordination

If concurrent editing is rare → a simple centralized lock/version check may be enough (like your first idea).
If concurrent editing is common (e.g., Figma boards with dozens of people) → you want CRDTs to avoid merge conflicts.

In short:

A CRDT is a mathematically designed data structure that ensures all replicas in a distributed system converge to the same state without conflicts — perfect for real-time collaborative editing.

Note that this would be needed only for collaborative editing at scale in distributed systems. For anything else, it could be an overkill.

Saturday, August 30, 2025

Cache in front of a slow database ?

Should You Front a Slow Database with a Cache?

Most of us have been there: a slow database query is dragging down response times, dashboards are red, and someone says, “Let’s put Redis in front of it.”

I have done it myself for an advertising system that needed response times of less than 30 ms. It worked very well.

It’s a tried-and-true trick. Caching can take a query that costs hundreds of milliseconds and make it return in single-digit milliseconds. It reduces load on your database and makes your system feel “snappy.” But caching isn’t free — it introduces its own problems that engineers need to be very deliberate about.

Good Use Cases for Caching

Read-heavy workloads
When the same data is read far more often than it’s written. For example, product catalogs, user profiles, or static metadata.
Expensive computations
Search queries, aggregated analytics, or personalized recommendations where computing results on the fly is costly.
Burst traffic
Handling sudden spikes (sales events, sports highlights, viral posts) where the database alone cannot keep up.
Low latency requirements
Some systems have low latency requirements. Clients need a response is say less than 50 ms or client aborts.

The Catch: Cache Consistency

The hardest part of caching isn’t adding Redis or Memcached — it’s keeping the cache in sync with the database.

Here are the main consistency issues you’ll face:

Stale Data
If the cache isn’t updated when the database changes, users may see outdated results.
Example: A user updates their shipping address, but the checkout flow still shows the old one because it’s cached.
Cache Invalidation
The classic hard problem: When do you expire cache entries? Too soon → database load spikes. Too late → users see stale values.
Race Conditions
Writes may hit the database while another process is still serving old cache data. Without careful ordering, you risk “losing” updates.

Common Strategies

Cache Aside (Lazy Loading)
Application checks cache → if miss, fetch from DB → populate cache.
✅ Simple, common.
❌ Risk of stale data unless you also invalidate on updates.
Write-Through
Writes always go through the cache → cache updates DB.
✅ Consistency is better.
❌ Higher write latency, more complexity.
Write-Behind
Writes update the cache, and DB updates happen asynchronously.
✅ Fast writes.
❌ Risk of data loss if cache fails before DB is updated.
Time-to-Live (TTL)
Expire cache entries after a set period.
✅ Easy safety net.
❌ Not precise; stale reads possible until expiry.

So, Is It Worth It?

If your workload is read-heavy, latency-sensitive, and relatively tolerant of eventual consistency, caching is usually a big win.

But if your workload is write-heavy or requires strict consistency (think payments, inventory, or medical records), caching can create more problems than it solves.

The lesson: don’t add Redis or Memcached just because they’re shiny tools. Add them because you’ve carefully measured your system, know where the bottleneck is, and can live with the consistency trade-offs.

Takeaway:
Caching is like nitrous oxide for your system — it can make things blazing fast, but you need to handle it with care or you’ll blow the engine.