I’m in the mood for a couple of basics posts. As long-time readers might know, I love writing about data structures.
One of the most important and fundamental structures is a hashtable. In fact, in a lot of modern programming languages have left hashtables behind, for reasons I’ll discuss later. But if you want to understand data structures and algorithmic complexity, hashtables are one of the essentials.
A hashtable a structure for keeping a list of (key, value) pairs, where you can look up a value using the key that’s associated with it. This kind of structure is frequently called either a map, an associative array, or a dictionary.
For an example, think of a phonebook. You’ve got a collection of pairs (name, phone-number) that make up the phonebook. When you use the phonebook, what you do is look for a person’s name, and then use it to get their phone number.
A hashtable is one specific kind of structure that does this. I like to describe data structures in terms of some sort of schema: what are the basic operations that the structure supports, and what performance characteristics does it have for those operations.
In those schematic terms, a hashtable is very simple. It’s a structure that maintains a mapping from keys to values. A hashtable really only needs two operations: put
and get
:
-
put(key, value)
: add a mapping fromkey
tovalue
to the table. If there’s already a mapping for the key, then replace it. -
get(key)
: get the value associated with the key.
In a hashtable, both of those operations are extremely fast.
Let’s think for a moment about the basic idea of a key-value map, and what kind of performance we could get out of a cople of simple naive ways of implementing it.
We’ve got a list of names and phone numbers. We want to know how long it’ll take to find a particular name. How quickly can we do it?
How long does that take, naively? It depends on how many keys and values there are, and what properties the keys have that we can take advantage of.
In the worst case, there’s nothing to help us: the only thing we can do is take the key we’re looking for, and compare it to every single key. If we have 10 keys, then on average, we’ll need to do an average of about 5 steps before we find the key we’re looking for. If there are 100 keys, then it’ll take, on average, about 50 steps. If there are one million keys, then it’ll take an average of half a million steps before we can find the value.
If the keys are ordered – that is, if we can compare one key to another not just for equality, but we can ask which came first using a “less than or equal to” operator, then we can use binary search. With binary search, we can find an entry in a list of 10 elements in 4 steps. We can find an entry in a list of 1000 keys in 10 steps, or one in a list of one million keys in 20 steps.
With a hashtable, if things work right, in a table of 10 keys, it takes one step to find the key. 100 keys? 1 step. 1000 keys? 1 step. 1,000,000,000 keys? Still one step. That’s the point of a hashtable. It might be really hard to do something like general a list of all of the keys – but if all you want to do is look things up, it’s lightning.
How can it do that? It’s a fairly simple trick: the hashtable trades space for time. A hashtable, under normal circumstances, uses a lot more space than most other ways of building a dictionary. It makes itself fast by using extra space in a clever way.
A hashtable creates a bunch of containers for (key, value) pairs called buckets. It creates many more buckets than the number of (key, value) pairs than it expects to store. When you want to insert a value into the table, it uses a special kind of function called a hash function on the key to decide which bucket to put the (key, value) into. When you want to look for the value associated with a key, it again uses the hash function on the key to find out which bucket to look in.
It’s easiest to understand by looking at some actual code. Here’s a simple, not at all realistic implementation of a hashtable in Python:
class Hashtable(object): def __init__(self, hashfun, size): self._size = size self._hashfun = hashfun self._table = [[] for i in range(size)] def hash(self, key): return self._hashfun(key) % self._size def get(self, key, value): self._table[self.hash(key)].append((key, value)) def get(self, key): for k,v in self._table[self.hash(key)]: if k == key: return v return None
If you’ve got a good hash function, and your hashtable is big enough, then each bucket will end up with no more than one value in it. So if you need to insert a value, you find an (empty) bucket using its hashcode, and dump it in: one step. If you need to find a value given its key, find the bucket using its hashcode, and return the value.
There are two big problems with hashtables.
First, everything is dependent on the quality of your hash function. If you hash function maps a lot of values to the same bucket, then your performance is going to suck. In fact, in the worst case, it’s no better than just searching a randomly ordered list. Most of the time, you can come up with a hash function that does pretty good – but it’s a surprisingly tricky thing to get right.
Second, the table really needs to be big relative to the number of elements that you expect to have in the list. If you set up a hashtable with 40 buckets, and you end up with 80 values stored in it, your performance isn’t going to be very good. (In fact, it’ll be slightly worse that just using a binary search tree.)
So what makes a good hash function? There are a bunch of things to consider:
- The hash function must be deterministic: calling the hash on the same key value must always produce the same result. If you’re writing a python program like the one I used as an example above, and you use the value of the key objects fields to compute the hash, then changing the key objects fields will change the hashcode!
- The hash function needs to focus on the parts of the key that distinguish between different keys, not on their similarities. To give a simple example, in some versions of Java, the default hash function for objects is based on the address of the object in memory. All objects are stored in locations whose address is divisible by 4 – so the last two bits are always zero. If you did something simple like just take the address modulo the table size, then all of the buckets whose numbers weren’t divisible by four would always be empty. That would be bad.
- The hash function needs to be uniform. That means that it needs to map roughly the same number of input values to each possible output value. To give you a sense of how important this is: I ran a test using 3125 randomly generated strings, using one really stupid hash function (xoring together the characters), and one really good one (djb2). I set up a small table, with 31 buckets, and inserted all of the value into it. With the xor hash function, there were several empty buckets, and the worst bucket had 625 values in it. With djb2, there were no empty buckets; the smallest bucket had 98 values, and the biggest one had 106.
So what’s a good hash function look like? Djb2, which I used in my test above, is based on integer arithmetic using the string values. It’s an interesting case, because no one is really entirely sure of exactly why it works better than similar functions, but be that as it may, we know that in practice, it works really well. It was invented by a guy named Dan Bernstein, who used to be a genius poster in comp.lang.c, when that was a big deal. Here’s djb2 in Python:
def djb2(key): hash = 5381 for c in key: hash = (hash * 33) + ord(c) return hash
What the heck is it doing? Why 5381? Because it’s prime, and it works pretty well. Why 33? No clue.
Towards the beginning of this post, I alluded to the fact that hashtables have, at least to some degree, fallen out of vogue. (For example, in the Go language standard library, the map type is implemented using a red-black tree.) Why?
In practice, it’s rarely any faster to really use a hashtable than to use a balanced binary tree like a red-black tree. Balanced trees have better worst-case bounds, and they’re not as sensitive to the properties of the hash function. And they make it really easy to iterate over all of the keys in a collection in a predictable order, which makes them great for debugging purposes.
Of course, hash tables still get used, constantly. The most commonly used data structures in Java code include, without a doubt, the HashMap and HashSet, which are both built on hashtables. They’re used constantly. You usually don’t have to implement them yourself; and usually system libraries provide a good default hash function for strings, so you’re usually safe.
There’s also a lot of really fascinating research into designing ideal hash functions for various applications. If you know what your data will look like in advance, you can even build something called a perfect hash function, which guarantees no collisions. But that’s a subject for another time.
In regards to the number 33:
“A good property for a hash function is if you keep feeding it
a constant character, it performs as if it’s a random number generator.
The hash function f(h,c) = 33*h + c is derived from a class of random
number generators called linear congruential. At the start of Knuth Vol 2,
there is a good discussion of these generators.” …
https://groups.google.com/forum/#!original/comp.lang.c/lSKWXiuNOAk/faLWVammoEsJ
hash tables are also ephemeral data structures – if I have a tree, I can construct a new tree that shares large chunks of the old one, and both the new and old trees are accessible. More common in functional languages because you can share freely without worrying about it being updated elsewhere.
Also, it should be noted that hash tables might be unsuitable for use-cases where hostile input can be expected. A bad hash function might allow an attacker to degrade performance to denial-of-service level by artificially causing hash collisions.
I didn’t notice this comment before I left my other comment below. There is a simple way to guard against this: generate a hash function at random every time you start up.
This seems to go against all the advice that you’ve been told about choosing (and testing) hash functions carefully, however, that advice comes from an era where arithmetic was expensive compared to memory accesses. Back then, hash functions used a small number of operations, then did a final remainder-by-a-prime to cover up the fact that the hash function was probably quite poor.
Today, best practice is to use hash tables that are a power of two in size to avoid the division, and spend a few more cycles on good-quality hash functions instead. Hash functions based on look-up tables of randomly generated numbers are popular if you need to guard against hash-based DoS attacks.
I wouldn’t say that hash tables are being “left behind”, they’re a vital part of many modern languages. I mean, Python wouldn’t be Python without a dictionary, and that’s a hash table. I would hazard a guess that it is probably the most common data structure for associative arrays (with a few notable exceptions, like map in the C++ STL).
Also, I’d add one thing: one of the reasons you might prefer trees over hash tables is that it’s much easier to write objects that conform to it. Like you said, writing a hash-function can be real tricky, but trees require only a comparison function (a “less than” function that imposes a total order on all possible objects) and a test for equality. Much easier to write than a hash function.
Hash tables are still very popular in the world of compressed data structures, because they have nice theoretical and practical properties, especially with modern forms of the technique.
For example, you can get arbitrarily close to an “idealised” hash table using cuckoo hashing. It’s idealised in the sense that you can store k elements in an array of size k and retain constant-time lookup.
But it gets better. Suppose that you have 2^m distinct integers, each of which are n-bits long. There are 2^n choose 2^m possible such sets of integers, so assuming nothing else about the set, this should require log(2^n choose 2^m) bits to store.
Assuming that 2^n is small compared with 2^n, a little arithmetic and Stirling’s approximation shows:
log(2^n choose 2^m) = 2^m (n – m) + O(low order terms)
Here, I’m using the imprecise notation f(n) + O(low order terms) to mean f(n) + o(f(n)). The relative error between the actual function and f(n) gets arbitrarily close to zero as n increases.
So an “optimal” representation is to store 2^m integers, each of them (n-m) bits long.
Using cuckoo hashing, we can pack a hash table with arbitrarily small wastage. So we could basically store all 2^m integers in a hash table with 2^m slots. However, using traditional hashing, we would need n bits in each slot.
However, this representation wastes information. For some key x, we compute a hash value h(x), then use m of those bits to determine which slots to put the key in. So the location of the table actually gives us information about the key! We can recover this information if we use a hash function that’s invertible. We use m bits of the hash value to determine the slot, and only store the other (n-m) bits.
So that’s 2^m entries, at (n-m) bits per entry. We’re done!
The only thing remaining is to come up with an invertible hash function. Thankfully, there are easy ways to make them out of non-invertible hash functions by raiding the theory of ciphers. Feistel networks do the trick nicely.
“The hash function must be idempotent” – I think you mean “deterministic”.
Yup. Stupid word confusion on my part. Thanks for the catch!
One important caveat is that it only has to be deterministic for the lifetime of the data structure.
One vector for a denial of service attack is to send a hash table carefully crafted data which all maps to the same hash table slot. To guard against this, some security-conscious applications compute a new randomised hash function every time they start.
Another operation that is usually needed (and provided) is iterating over the keys, values, and/or key/value pairs.
In a true hash table, this is a very inefficient operation. Many languages provide a built in map datatype based on a binary tree for the keys. Look-ups are O(log n) instead of O(1), but there is an obvious way to get all key/value pairs in a consistent order in O(n). As the post points out, this is frequently a good tradeoff, but not a true hash table.
The operation that is inefficient in a true hash table is iterating over keys/values in a canonical order. If any old order will do (as it frequenty does), then it’s efficient with no extra data structures.
I’ll just chime in here to mention that I noticed something interesting the other day.
Standard Big-O analysis of algorithm performance is becoming outdated. The standard analysis is based on counting “operations” and assuming that this is at least roughly proportional to the execution cost of the algorithm, then analyzing algorithms to see how the number of “operations” scales. However, these days the performance of many (perhaps most) computer programs are not dominated by the number of operations that they perform but by the number of times that they access new pages of memory.
Anyway, the interesting thing that I noticed the other day was that Python is changing their implementation of hash tables. They are still using the same hash buckets, but their solution to “what if two things wind up in the same bucket?” (called “collision resolution” in hash-table jargon) has changed. It used to be that they would try what was essentially a series of different hash algorithms, resulting in random probes around the data structure looking for an empty slot. Now, they start by searching a series of nearby slots (which is quick because it is all on the same memory page), and after that they revert to the random probes.
I suspect that lots of other common algorithms and a good deal of established wisdom needs to be reviewed in light of modern computer architecture.
Not really, no. I don’t think there ever was such a thing as a “standard” big-O analysis. Remember, it was always relative to a machine model. Something that’s O(log N) on a pointer machine is typically not O(log N) on a Turing machine. What’s changed is that the machine models have changed as physical hardware has changed.
For example, the word RAM model is very popular these days. In this model, if the problem size is N, then we assume that memory is divided into units (“words”) of size Θ(log N) bits. That is, the size of the problem can be stored in a constant number of words. Any deterministic operation which takes a word and returns a word is assumed to take constant time. Only memory accesses need to be counted for the purpose of determining time complexity.
Another popular model is the idealised cache model, which is an extension of the word RAM model. In this model, RAM is further split into “lines” of L words each. There is a fully-associative cache which holds Ω(L²) words (called the tail cache assumption). The cache complexity is defined as the minimum number of I/O operations (i.e. operations which transfer lines between cache and RAM) required to execute the algorithm (i.e. the cache replacement policy is assumed to be optimal).
I agree with Maik. “Idempotent” would mean that hash(x) = hash(hash(x))… generally not a desired property in a hash function. 🙂
“def get(self, key, value):” should be “def put(self, key, value):” in the code.
“Why 33?” Looking for a good numerical spread over a range, 33=11*3, both prime (and beneficially, not a power of 2). 35 would probably work out pretty good also, being 7*5, and if you’re looking at an equal spread of numbers, upper and lowercase characters, it presents a slightly better distribution (it would never be a truly equal spread, more weight would likely be on the characters). 21 (7*3) should work well also, and would have no gaps given an even distribution of numbers, caps and lowercase. Only a guess, though — I have no idea how he actually arrived at 33 for his number, but I’m sure those factors (pun unintended) worked into it.
I think you missed an important part of hash tables: That the hash function (a) need not map different key values to different hash values; and (b) that, for all practical domains, it *will not* always different key values to different hash values.
Ad (a), many people somehow believe this. There was even a bug in a commercial .Net GUI library where hash values were used to identfy GUI controls because the programmers didn’t understand this. In my job (as a software engineer/architect), I have to explain roughly once a year to someone that hash(x) = 17 is a perfectly valid hash function – and it is even useful, namely to test algorithms using hash functions (the GUI bug mentioned above would have been found quickly with it).
Ad (b), the pigeonhole principle will map at least two different keys to the same hash value if the number of possible keys is larger than the number of possible hash values – e.g., if keys are strings over an alphabet of at least symbols of lengths larger than 32; and hash values are 32bit integers. Maybe this would deserve another of your splendid explanations …
Nice article. Small typo. Shouldn’t
def get(self, key, value):
self._table[self.hash(key)].append((key, value))
be:
def set(self, key, value):
self._table[self.hash(key)].append((key, value))
I don’t believe that maps in Go are implemented using a red-black tree. Every resource I’ve been able to find shows that a map is a hash table.
E.g.
https://blog.golang.org/go-maps-in-action