skip to content

Search

Consistent Hashing

2 min read

Consistent Hashing

  • One of the most amazing and popular algorithm out there and the only problem it solves is Data ownership.

  • It will not transfer data for us it is not a service in “itself”

  • Consistent Hashing applies to any scenarios where you need to distribute data across cluster of servers. Used in Cassandra, DynamoDB and CDNs.

  • In infra-focused interviews where you ‘re asked to design distributed system from scratch, common topic/scenarios:

    • Design distributed database
    • Design distributed Cache
    • Design distributed message broker

Understanding Consistent Hashing

  • Say we’re having a stateful distributed storage
    • node store the data
    • proxy forwards the request to a node
    • end-user/client talks to node
  • When the number of node changes the proxy will change the no of functions and it would not become fn%2.
  • Now all the keys would need to be re-evaluated and moved to the correct node (involves a lot of data transfer)

Hash Fn (SHA128)

  • Given hash fn are cyclic we can visualize it as a ring of integers, every node occupies one slot in the integer, the slot is calculated by passing node’s IP to hash fn.

Scaling Up

  • When we add a new node to the “ring” Say node3 hashes to slot 1, The keys that hashed between slot 12 and slot will now be “owned” by node 3 instead of node O.
  • Other keys continue to remain at the respective nodes - Minimal Data Movement
  • Operationally we have to
    • snapshot node 0
    • create node 3
    • delete unwanted keys

Scaling Down

  • Say we scale down and rетоvе node1. All the keys that were owned by Node O will now be owned by Node 2 (next in the ring) -> Minimal Data Transfer
  • Operationally we have to
    • copy everything from Node 0 to Node 1
    • remove node 2 from the ring
    • delete node 2

Virtual Nodes

  • We had to move all events that were stored on database 2 to database 3. Now database 3 has 2x the load of database 1 and database 4.
  • We’d much prefer if we could spread the load more evenly so database 3 wasn’t overloaded.
  • The solution is to use what are called “virtual nodes”
  • Instead of putting each database at just one point on the ring, we put it at multiple points by hashing different variations of the database name.
  • The more the virtual nodes you use per database, the more evenly distributed the load becomes.