The hash table belongs to the abstract type of data structures called the table or dictionary. In tables, information as stored in (key, value) pairs, in which the key is the 'handle' used to get to the useful data stored as the value.
If the keys obey certain criteria, this functionality is very easy to acheive. All that is required is to use the key's value to directly specify a cell in a simple vector.
For example, if you wanted to store the numbers of wheels for different vehicles:
----------------------------
keys: | 2 | 3 | 4 |
----------------------------
values: | bicycle | tricycle | car |
----------------------------
So that get(3) would return "tricycle" as required. This operation would only require a single lookup into an array, making the entire process O(1).
However, this implementation makes some massive assumptions about the keys that are unreasonable in the real world.
The criteria that need to be met are:
These requirements are so restrictive as to make this implementation impractical in all but a very limited set of cases.
An alternative approach keeps the pairs in a vector, sorted according to the keys' values. This means that the keys can be any comparable type - a much less strict requirement than those detailed above.
If the inverse of the above table was to be created, the sorted vector would look like this:
----------------------------
keys: | bicycle | car | tricycle |
----------------------------
values: | 2 | 4 | 3 |
----------------------------
So that get("tricycle") returns the value 3, as required.
This lookup would be performed using a binary chop on the vector, making the operation O(log (n)).
However, hash tables boast even better performance than this, but through a totally different method: while sorted arrays rely on order, a hash table relies on chaos.
The reason why hash tables work so well is because the key is converted very simply and neatly from any sort of object into an integer obeying the three requirements listed before.
To do this, the key is fed into a hash function, which returns a integer which is effectively unique to that key. This integer is then used to index an array directly, as in the example outlined above. The beauty of this system is that the hash function is very cheap to compute, as is the array lookup, so a get(key) or set(key, value) operation is O(1) on average!
If a hash table is being used to map the type of vehicle onto the values "zero", "one", "two", etc., this is how it would look:
(key, value)
|
|
Hash Function
|
|
--------------------
| | |
V V V
-----------------------------
keys: | bicycle | tricycle | car |
-----------------------------
values: | two | three | four |
-----------------------------
When an operation is called, the key is converted by the hash function into a hash value which is then used to access a cell in the vector, or bucket, all in O(1) time (on average).
However, this is a much simplified overview of the workings of a hash table. There are a few 'gotchas' and glitches which need to be sorted out. I will look at one here.
Collisions
One problem is that of a non-perfect hash function. In reality, an infinite number of different keys will map to the same hash value, and something must be done to cope with the situation when two different keys produce the same result from the hash function. This is called a collsion.
There are two different methods employed here, giving rise to two types of hash table:
Open hash tables don't simply store the (key, value) pairs in an array, they store an array of pointers, each leading to a linked list. Each element in the list contains a (key, value) pair and a pointer to the next list element.
The hash table's array of pointers:
-------------------
... pointer | pointer | pointer ...
-------------------
|
|
V
----------------------------
L | key1 | value1 | pointer1 |
I ----------------------------
N |
K |
E V
D ----------------------------
| key2 | value2 | pointer2 |
L ----------------------------
I |
S |
T V
null
When an operation is performed, the pointer is followed into the linked list, which is searched linearly for the relevant (key, value) pair.
This approach has the advantage that the hash table has not got a maximum capacity. As more and more entries are added, and more and more collisions occur, the linked lists just grow in size. This means performance will degrade, but the hash table won't die.
This implementation of a hash table has a fixed size, but will offer better performance when sparsely filled.
Instead of having a list of pairs for each bucket, when a collision occurs, some algorithm is employed to work out an alternative bucket for this key. One simple, and quite inefficient, method is to traverse the vector linearly until a free bucket is found, and use that one; i.e., if a bucket is full already, the next empty one is used instead.
If a malicious user or cracker learns what your hash function is, then both the open and closed hash tables' pathological case can easily be created by constructing keys that all hash to the same value. If this happens, operations are O(n) instead of O(1); a distastrous performance hit, which could have massive importance for the effectiveness of the entire system.
As the open hash table offers better performance, on average, and are guaranteed not to saturate, they are chosen more often than closed hash tables.
The Uses of Hash Tables
Although the general idea behind hash tables is simple, there are a number of tricky implementation details that must be adequately handled to acheive an acceptable level of robustness. For example: What hash function to use? How should you dynamically change the size the table to get a balance between size and speed? Open or closed? If closed, what collision recovery algorithm?
Fortunately, implementations exist for almost all mainstream languages now; Java, C++, C, Perl and Python to name some, so their use is increasing.
Why use hash tables?
There exists a common, erroneous attitude that because modern computers are so fast, high-performance data structures and algorithms are becoming unnecessary. However, if one program runs 10 times faster than another on an 8086, it will probably run 10 times faster on modern CPU. The absolute difference is smaller, but the relative speed will remain fairly constant. As modern processing jobs are much larger than jobs 8086's were expected to do, the relative speed of a bad programme to a good one is just as relevant now as it was back then.
For this reason, the most efficient data structures and algorithms should be used in all but the simplest cases.
Although hash tables boast O(1) operations, this does not mean they are as fast as O(1) array lookups or even some small O(log(n)) sorted vector or binary tree lookups. The invisible constant factor in big-O notation is increased by the need for computing the hash function every time an operation is performed. For reasons discussed below, when the pattern and type of use of a data structure is well known at design time, it may well be that slighly more exotic data structures offer better performance. That said, a well implemented hash table guarantees excellent scalability, relaxed key requirements and, above all, efficiency. They are a good choice for a general-purpose yet functional data structure.
Why not use hash tables?
As mentioned above, if the data structure is going to be used for a very well defined and specific task, choosing a hash table could be the wrong way to go. For example, when storing a large set of words (as an OCR programme might), ternary search trees typically offer better performance than hash tables.
Thanks to pfft for this point:
Another problem with hash tables is due to the pattern in which they store data in memory or on disk.
It has been known for a while that very often, data is accessed in long, continous stretches. Think how many times you have looped along the length of an array, performing a task on each element!
For this reason, computer designers now use cacheing to speed up reads/writes to areas of memory and disk which are close to recent accesses. Cacheing does cost CPU time and bus bandwidth, but in the long run, under normal usage patterns, it vastly improves performance.
With hash tables too, people will often iterate through an entire table, accessing each element. However, due to the randomizing effect of the hash function, sequential keys will be scattered all over the system - numerous caches, memory, hard disk. As a result, even though sequential keys are being accessed, as the designers expected, the memory accesses aren't sequential, so not only is cacheing useless, it is expenisive!
To combat this problem, the hash table would have to be combined with something like dynamic hashing, reducing disk reads. Otherwise, some other data structure where items are kept in an order could be used, such as 2-3-4 trees, heaps or splay trees.
NB This writeup describes open and closed hash tables the other way around to other sources. This is a reflection of the fact that there exists no standard terminology for which means what.
Thanks to pfft an isogolem for informed advice.
Factual sources:
University of Cambridge, Computer Lab.
"Computer Science: a modern introduction" - Goldschlager, Lister, Lister. Prentice Hall. ISBN: 0131659456