Unique ID generation is a deceptively simple task in application development. While it can seem trivial in a monolithic system, getting it right in a distributed system presents a few challenges. In this blog, I explore the commonly used ID generation techniques, where they fail and how to improve on them.
Why are unique Ids needed in applications ?
They are needed to prevent duplicates. Say the system creates users or products or any other entity. You could use name at the key for uniqueness constraint. But there could be two users or products with the same name. For users, email is an option but that might not be available for other kinds of entities.
Sometimes you are consuming messages from other systems over platforms like Kafka. Processing the same message multiple times can lead to errors. Duplicate delivery can happen for a variety of reasons that can be out of control of the consumer. Senders therefore include a uniqueId with the message so that consumer can ignore ids already processed (Idempotency) .
They can be useful for ordering. Ordering is useful to determine which event happened before others. It can be useful if you want to query for the most recent events or ignore older events.
What should the size and form of an id be ?
Should it be a number or should it be alphanumeric? Should it be 32 bit, 64 bit or larger.
With 32 bit the maximum numeric id you can generate is ~ 4 billion. Sufficient for most systems but not enough if your product is internet scale. With 64 bit, you can generate id into the hundreds of trillions.
But size is not the only issue. It is not the case that ids will generated in sequence in one place one after the other. In any system of even moderate scale, the unique ids will need to be generated from multiple nodes.
On the topic of form, numerics ids are generally preferred as the take up less storage and can be easily ordered and indexed.
In the rest of the blog, I go over some unique id generation techniques I have come across.
ID GenerationTechniques
1. Auto increment feature of the database
If your services uses a database, this becomes an obvious choice as every database supports auto increment.
With postgresql, you would set up the table as
CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT not null);
With mysql,
CREATE TABLE users (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) not null);
Every insert into the table generates an id in the id column.
INSERT INTO users(name) VALUES ('JOHN') ;
INSERT INTO users(name) VALUES ('MARY') ;
id | name |
-----+---------
1 | JOHN |
2 | MARY |
While this is simple to setup and use, it is appropriate for simple crud applications with low usage. The disadvantages are
- It requires an insert to the db to get an id
- If the insert fails for other reasons, you can have gaps
- For distributed applications, it is a single point of failure
The single point of failure issue can be solved setting up multiple databases in a multi master arrangement as show below. In fact, here we are handing out ids in batches of hundred to reduce the load on the database.
2. UUID
These are 128 bit alpha numeric strings that are quite easy to generate and can be suitable for ids because they are unique. An example UUID is e3d6d0e9-b99f-4704-b672-b7f3cdaf5618
The advantages of UUID are:
- easy to generate. Just add UUID generator to the service.
- low probability of collision
- Each replica will generate unique id
- 128 bits for each id will eventually take up space
- No ordering. Ids are random
3. Flickr ticket server
The Flickr ticket server is an improvement of the auto increment feature. In the example we have in 1, the id is tied to the users table. You get a new id only when you insert into the user table. if you needed unique ids for another entity, you would need to add autoincrement column to that table. But what if you needed unique ids in increasing order across tables or tables as would be the case in a distributed system ?
We could create a generic table
CREATE TABLE idgenerator (id SERIAL PRIMARY KEY);
This can work but it would keep accumulating rows of ids which would also be store elsewhere.
What they did at flicker was this
create table ticketserver (id int auto_increment primary key, stub char(1));
When they need an id , the do
replace into tickerserver(stub) values ('a') ;
select LAST_INSERT_ID();
This table always have only 1 row, because everytime you need an id, you are replacing the previous row. The above code is Mysql specific.
For SQL that will work on postgresql you can do it little differently
create table ticketserver( stub char(1) primary key, id bigint notnull default 0)
INSERT into ticketserver(stub,id)
VALUES('a', COALESE((SELECT ID from ticketserver where stub = 'a'),0)
ON CONFLICT(stub) DO UPDATE
SET id = ticketserver.id + 1
returning id
4. Snow flake twitter approach
This Twitter approach that does not rely on the database or any third party service. It can be generated in code just following the specification.
It is a 64 bit id.
The left most bit is a sign bit for future use.
The next 41 bits are used for the timestamp. Current time in ms since the epoch Nov 4 2010 1:42:54 UTC. But you can use any epoch.
The next 5 bits are for datacenter id - that gives you 2 power 5 or 32 data centers.
The next 5 bits are for machine id. 32 machines per datacenter.
The last 12 bits are a sequence id - giving you 2 power 12 = 4096 ids per ms ( per datacenter per machine)
.jpg)
The value of 2 power 41 - 1 is 2,199,023,255,551. With 41 bits for timestamp , you get a lot ids. This will last almost 70 years from epoch.
You can change things to reduce the size from 64 bits if needed. You may not need a datacenter id or you can decide to use fewer bits for the timestamp.
The advantages of this approach are
- Decentralized generation. No dependency on DB. Lower latency
- Time based ordering
- Higher throughput
- Datacenter id and machine id can help in debugging
- clocks need to be synchronized
- datacenters/machines need unique id
- epoch needs to be choose wisely
Other considerations
Summary
ID generation evolves with your systems scale. When you are starting out, it is normal to keep things simple and go with auto increment. But sooner or later you will need to scale (good problem to have). But the Flicker and Twitter methods are solid. I personally like the Twitter approach as it has no dependency on the database. It offers an excellent balance of decentralization, ordering and efficiency but requires clock synchronization. Whatever approach you choose, you need to ensure that aligns with your systems consistency requirements, scaling needs and tolerance for operational complexity.
