Consistent Hashing
Overview
Consistent hashing is a distributed hashing technique that operates independently of the number of servers or objects in a distributed system. It allows for minimal redistribution of keys when servers are added or removed, making it ideal for distributed caches, databases, and load balancers.
What is Consistent Hashing?
Consistent hashing is a special kind of hashing that minimizes the number of keys that need to be remapped when a hash table is resized. It's particularly useful in distributed systems where we need to scale up or down the number of servers without having to rebuild the entire mapping of keys to servers.
In traditional hashing, when the number of slots (servers) changes, most keys need to be remapped. This can be catastrophic in a distributed system where data migration is expensive. Consistent hashing solves this by ensuring that when a server is added or removed, only K/N keys need to be remapped on average, where K is the number of keys and N is the number of servers.
Consistent Hashing Ring Structure
Why Use Consistent Hashing?
- Minimal Key Redistribution: When servers are added or removed, only a small fraction of keys need to be remapped
- Load Balancing: Achieves natural load balancing across servers
- Scalability: Easy to scale up or down without major data movement
- High Availability: Supports automatic failover and recovery
- Reduced Hotspots: Virtual nodes help distribute load more evenly
How Consistent Hashing Works
Consistent hashing maps both servers and keys to a fixed circular space or "ring" (typically using a range from 0 to 2^32-1). Each key is assigned to the first server that appears after its position when walking clockwise around the ring.
Adding a New Server
Virtual Nodes
One challenge with basic consistent hashing is that the distribution of keys can become uneven, especially with a small number of servers. Virtual nodes solve this by having each physical server represent multiple points on the ring.
Virtual Nodes Distribution
Implementation Example
Here's an implementation of consistent hashing in TypeScript with support for virtual nodes:
1class ConsistentHash<T> { 2 private ring: Map<number, T>; 3 private sortedKeys: number[]; 4 private virtualNodes: number; 5 private hashFn: (key: string) => number; 6 7 constructor(nodes: T[] = [], vnodes: number = 100) { 8 this.ring = new Map(); 9 this.sortedKeys = []; 10 this.virtualNodes = vnodes; 11 12 // Simple hash function using string's char codes 13 this.hashFn = (key: string): number => { 14 let hash = 0; 15 for (let i = 0; i < key.length; i++) { 16 hash = ((hash << 5) + hash) + key.charCodeAt(i); 17 hash = hash & hash; // Convert to 32-bit integer 18 } 19 return Math.abs(hash); 20 }; 21 22 // Add initial nodes 23 nodes.forEach(node => this.addNode(node)); 24 } 25 26 addNode(node: T): void { 27 // Add virtual nodes 28 for (let i = 0; i < this.virtualNodes; i++) { 29 const virtualKey = `${node}-${i}`; 30 const hash = this.hashFn(virtualKey); 31 this.ring.set(hash, node); 32 } 33 34 // Update sorted keys 35 this.sortedKeys = Array.from(this.ring.keys()).sort((a, b) => a - b); 36 } 37 38 removeNode(node: T): void { 39 // Remove virtual nodes 40 for (let i = 0; i < this.virtualNodes; i++) { 41 const virtualKey = `${node}-${i}`; 42 const hash = this.hashFn(virtualKey); 43 this.ring.delete(hash); 44 } 45 46 // Update sorted keys 47 this.sortedKeys = Array.from(this.ring.keys()).sort((a, b) => a - b); 48 } 49 50 getNode(key: string): T | null { 51 if (this.ring.size === 0) return null; 52 53 const hash = this.hashFn(key); 54 55 // Find the first node that comes after the key in the ring 56 const nodeKey = this.sortedKeys.find(k => k >= hash) || this.sortedKeys[0]; 57 return this.ring.get(nodeKey) || null; 58 } 59 60 getNodes(key: string, count: number): T[] { 61 if (this.ring.size === 0) return []; 62 63 const hash = this.hashFn(key); 64 const nodes: T[] = []; 65 let seen = new Set<T>(); 66 67 // Find starting position 68 let index = this.sortedKeys.findIndex(k => k >= hash); 69 if (index === -1) index = 0; 70 71 // Collect unique nodes 72 while (nodes.length < count && nodes.length < this.ring.size) { 73 const nodeKey = this.sortedKeys[index]; 74 const node = this.ring.get(nodeKey)!; 75 76 if (!seen.has(node)) { 77 seen.add(node); 78 nodes.push(node); 79 } 80 81 index = (index + 1) % this.sortedKeys.length; 82 } 83 84 return nodes; 85 } 86} 87 88// Usage example 89const ch = new ConsistentHash<string>(['server1', 'server2', 'server3'], 100); 90 91// Add a new server 92ch.addNode('server4'); 93 94// Get server for a key 95const server = ch.getNode('user123'); 96console.log(`Key 'user123' is mapped to ${server}`); 97 98// Get multiple servers for replication 99const replicas = ch.getNodes('user123', 3); 100console.log(`Replicas for 'user123' are: ${replicas.join(', ')}`); 101 102// Remove a server 103ch.removeNode('server2');
Real-World Applications
- Distributed Caches: Systems like Memcached and Redis clusters use consistent hashing to distribute keys across nodes
- Content Delivery Networks (CDNs): To determine which edge server should cache specific content
- Load Balancers: For distributing requests across backend servers
- Distributed Databases: Systems like Cassandra and DynamoDB use consistent hashing for data partitioning
- Distributed Object Storage: Systems like Amazon S3 use consistent hashing to distribute objects across storage nodes
Best Practices
- Use Virtual Nodes: Implement virtual nodes to achieve better key distribution
- Choose Appropriate Hash Function: Use a uniform hash function to ensure even distribution
- Consider Replication: Store data on multiple nodes for fault tolerance
- Monitor Distribution: Keep track of key distribution to detect and address hotspots
- Handle Node Failures: Implement automatic failover and recovery mechanisms
- Cache Node Locations: Cache the node lookup results to improve performance
Conclusion
Consistent hashing is a fundamental technique in distributed systems that enables efficient scaling and high availability. By minimizing the amount of data that needs to be redistributed when the system topology changes, it provides a robust foundation for building large-scale distributed systems.
While the basic concept is straightforward, careful consideration must be given to implementation details such as virtual nodes, hash function selection, and handling edge cases to ensure optimal performance and reliability in production environments.