How automatic sharding works or consistent hashing under the hood

Posted on Updated on

Preface

Here we’re going to talk primarily about Consistent hashing. This technique involves such concepts as adaptive load balancing, routing, partitioning in distributed computing. There are many articles on the internet on this technique (refer to the list of references for some of them) but I haven’t found information about how and where to keep the ring of hashes, thus I’ve decided to describe some options with pros and cons. In order to make this post more clear for a wider audience I will first try to write a brief introduction of what this is all about and tell about the ring storage strategies at the end of this issue. So if you’re already familiar with the algorithm you may want to skip over the main stuff and move on to the last chapter for pros and cons of descibed approaces.

Distributed cache and straightforward uniform load balancing

Key-value stores are extremely fast in single search-queries. A very popular one is a distributed hash table (DHT) kept in a fully decentralized manner, and thus particularly adapted to unstable networks where nodes can leave or join at any moment. Note that DHT is not suitable for range-queries albeit and I will probably write a separate post about special data structures responsible for that. Now let’s consider a classic case – you have a cluster of cache-servers where you load-balance a huge data set uniformly. To be able to determine on which node a pair <key, value> should be kept we use a simple hash-mapping:

Cache machine = hash(o) mod n where: n – number of machines in a cluster and o is an object to put/lookup.

What happens when a number of machines changes at runtime? You might add/remove a number of machines in a cluster for e.g. scalability reasons, a failure or whatever.  The change triggers moving almost all objects to new locations due to rehashing.  Each key-value pair will get reallocated completely across the cluster. You’ll end up moving a fraction n/(n+1) of your data to new machines. Indeed, this fact degrades all of the advantages of distributed hash tables. We need somehow to avoid this messy remapping. This is where consistent hashing comes in.

Consistent hashing

The main idea is to hash both data ids and cache-machines to a numeric range using the same hash-function. E.g. in Java a primitive type int has a number range of values between -231 to 231-1.  Assume the interval is [0,  231-1] for simplicity (java primitives cannot be unsigned). Now let’s join starting and ending points together of this interval to create a ring so the values wrap around. We do not have 231 -1 available servers, the large size of the ring being merely intended to avoid collisions. As a hash function a good choise is either be e.g. MD5 or SHA-1 algorithm. As a machine’s number we can take its IP-address and apply that hash function to it. By taking from the result the first 8 bytes we can map it to our ring [0,231-1].

ring_range

Both the nodes and the keys are mapped to the same range on the ring. Ok, now we need to understand how to identify on this ring which data ids belong to which server’s IP. It’s really simple, we just move clockwise starting from zero (starting point on the ring) following the main rule of consistent hashing: If IP-n1 and IP-n2 are 2 adjacent nodes on the ring all data ids on the ring between them belong to IP-n1. That’s it. ring_mapping

As depicted: { Id1, Id2, Id3} ∈ IP-3; {Id4} ∈ IP-1; ∅ ∈ IP-2.

Conclusion: Using consistent hashing we do not need to rehash the whole data set. Instead, the new server takes place at a position determined by the hash value on the ring, and part of the objects stored on its successor must be moved. The reorganization is local, as all the other nodes remain unaffected. if you add a machine to the cluster, only the data that needs to live on that machine is moved there; all the other data stays where it is. Because the hash function remains unaffected, the scheme maintains its consistency over the successive evolutions of the network configuration. Like naive hashing, consistent hashing spreads the distributed dictionary almost evenly across the cluster. One point to mention is what happens when a node goes down due to some disaster. In this case consistent hashing alone doesn’t meet our requirements of reliability due to loss of data. Therefore there should definetely be replication and high availability which is feasible and out of scope of this introduction. You may want to find good references at the end of these article to find out more.

Problems with pure consistent hashing

In a nutshell, the basic consistent hashing has the following problems:

  • There is a huge amount of data to be rehashed.
  • A node picking a range of keys results in one node potentially carrying a larger keyspace than others, therefore still creating disbalance.
  • Leaving/Joining a ring leads to disbalance of data.
  • A more powerful machine needs to process more data than others.
  • A fraction of data to be moved is less unpredictable and much higher.

Virtual nodes solve these issues.

Virtual nodes come to the rescue

Virtual nodes minimize changes to a node’s assigned range by a number of smaller ranges to a single node. In other words, amount of data to be moved from one physical node to others is minimized. Let’s split a real node into a number of virtual nodes. The idea is to build equally-sized subintervals (partitions) for each real server on the ring by dividing the hash-space into P evenly sized partitions, and assign P/N partitions per host. When a node joins/leaves all data from partitions of all real servers are uniformly get assigned to a new server and given back to remaining ones respectively. The number of virtual nodes is picked once during building of the ring  and never changes over the lifetime of the cluster. This ensures that each node picks equal size of data from the full data set, that is P/N and thus our data now are distributed more uniformly. This enforces that the number of virtual nodes must be much higher than the number of real ones.

ring_hashing

Here’s a pretty simple java-code of consistency ring’s  with virtual nodes.

public class Ring {

	private SortedMap<Long, T> ring = new TreeMap<Long, T>();
	private HashMap<String, T> nodeMap = new HashMap<String, T>();
	private MD5Hash hash = new MD5Hash();
	private vNodeCount;

	public Ring(int vnodeCount, Collection pNodes) {

		this.vnodeCount = vnodeCount;

		for (T pNode : pNodes) {
			addNode(ring, nodeMap, pNode, vnodeCount);
		}
	}

	private void addNode(T pNode, int vnodeCount) {
		for (int i = 0; i < vnodeCount; i++) {
			ring.put(hash.hash(pNode.toString() + i), pNode);
		}
	}

        public void removeNode(T node, int vnodeCount) {
          for (int i = 0; i < vnodeCount; i++) {
            ring.remove(hash.hash(pNode.toString() + i));
          }
        }

	private T getNodeByObjectId(String objectId) {

		long hashValue = hash.hash(objectId);

		if (!ring.containsKey(hashValue)) {
			SortedMap<Long, T> tailMap = ring.tailMap(hashValue);
			hashValue = tailMap.isEmpty() ? ring.firstKey() : tailMap.firstKey();
		}

		return ring.get(hashValue);
	}

	private static class MD5Hash {
		MessageDigest instance;

		public MD5Hash() {
			try {
				instance = MessageDigest.getInstance("MD5");
			} catch (NoSuchAlgorithmException e) {
			}
		}

		long hash(String key) {
			instance.reset();
			instance.update(key.getBytes());
			byte[] digest = instance.digest();

	        long h = 0;
	        for (int i = 0; i < 4; i++) {
	            h <<= 8;
	            h |= ((int) digest[i]) & 0xFF;
	        }
	        return h;
		}
	};

}

Strategies to keep a data structure of the ring and their pros and cons

There are a few options on where to keep ring’s data structure:

  • Central point of coordination: A dedicated machine keeps a ring and works as a central load-balancer which routes request to appropriate nodes.
  1. Pros: Very simple implementation. This would be a good fit for not a dynamic system having small number of nodes and/or data.
  2. Cons: A big drawback of this approach is scalability and reliability. Stable distributed systems don’t have a single poing of failure.
  • No central point of coordination – full duplication: Each node keeps a full copy of the ring. Applicable for stable networks. This option is used e.g. in Amazon Dynamo.
  1. Pros: Queries are routed in one hop directly to the appropriate cache-server.
  2. Cons: Join/Leave of a server from the ring  requires notification/amendment of all cache-servers in the ring.
  • No central point of coordination – partial duplication: Each node keeps a partial copy of the ring. This option is direct implementation of CHORD algorithm. In terms of DHT each cache-machine has its predessesor and successor and when receiving a query one checks if it has the key or not. If there’s no such a key on that machine, a mapping function is used to determine which of its neighbors (successor and predessesor) has the least distance to that key. Then it forwards the query to its neighbor thas has the least distance. The process continues until a current cache-machine finds the key and sends it back.
  1. Pros: For highly dynamic changes the previous option is not a fit due to heavy overhead of gossiping among nodes. Thus this option is the choice in this case.
  2. Cons: No direct routing of messages. The complexity of routing a message to the destination node in a ring is O(lg N).

Current trends in consistent hashing

There is a huge boom nowadays of new products that implement this technique. Some of them are: Dynamo, Riak, Cassandra, MemCached, Voldemort, CouchDB, Oracle Coherence, Trackerless Bit-Torrent networks, Web-caching frameworks, Content distribution networks.

References

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s