Consistent Hashing

One of the most amazing and popular algorithm out there and the only problem it solves is Data ownership.
It will not transfer data for us it is not a service in “itself”
Consistent Hashing applies to any scenarios where you need to distribute data across cluster of servers. Used in Cassandra, DynamoDB and CDNs.
In infra-focused interviews where you ‘re asked to design distributed system from scratch, common topic/scenarios:
- Design distributed database
- Design distributed Cache
- Design distributed message broker

Say we’re having a stateful distributed storage
- node store the data
- proxy forwards the request to a node
- end-user/client talks to node

When the number of node changes the proxy will change the no of functions and it would not become fn%2.
Now all the keys would need to be re-evaluated and moved to the correct node (involves a lot of data transfer)

Given hash fn are cyclic we can visualize it as a ring of integers, every node occupies one slot in the integer, the slot is calculated by passing node’s IP to hash fn.

When we add a new node to the “ring” Say node3 hashes to slot 1, The keys that hashed between slot 12 and slot will now be “owned” by node 3 instead of node O.
Other keys continue to remain at the respective nodes - Minimal Data Movement
Operationally we have to
- snapshot node 0
- create node 3
- delete unwanted keys

Say we scale down and rетоvе node1. All the keys that were owned by Node O will now be owned by Node 2 (next in the ring) -> Minimal Data Transfer
Operationally we have to
- copy everything from Node 0 to Node 1
- remove node 2 from the ring
- delete node 2

We had to move all events that were stored on database 2 to database 3. Now database 3 has 2x the load of database 1 and database 4.
We’d much prefer if we could spread the load more evenly so database 3 wasn’t overloaded.
The solution is to use what are called “virtual nodes”
Instead of putting each database at just one point on the ring, we put it at multiple points by hashing different variations of the database name.

The more the virtual nodes you use per database, the more evenly distributed the load becomes.