10.4. Hash Tables

We have seen that balanced binary search trees can be used to implement a table where each operation uses O(log2(n)) time in the worst case, where there are n entries in the table when the operation is performed. On this page we look at an alternative way to implement a table that is not as good as binary search trees in the worst case but that is better in the average case.


Hash functions

The starting point is the idea of a hash function, which takes a key and yields a nonnegative integer. For example, if a key is a string and h is a hash function then you might find that h("frog") = 3144099.

Suppose that you choose a number from 0 to n−1 at random, each value equally likely. Then you do the same thing again, independently of the first choice, yielding a second number. The probability of choosing the same value twice is 1/n.

Ideally, hash functions should behave similarly. For any given value n, which you can think of as the number of entries in the table, the probability that two of the keys k1 and k2 satisfy h(k1) mod n = h(k2) mod n should be close to 1/n, as if they had been chosen at random.

A hash function is not really random at all, and no choice of hash function can always behave in a perfectly random way. But some hash functions do very well in practice at approximating what a truly random function would do.

Strings are among the most common kinds of keys, so let's look at finding a hash function for strings. One idea is to get the integer values of the characters in the string and to add them up. For example, 'c' = 99, 'a' = 97 and 't' = 116, so this hash function would yield 99 + 97 + 116 = 312 for "cat". That is a simple hash function, but it is not very good. For example, it yields the same value for "act" and "sad" as for "cat".

A more sophisticated hash function for strings should try to keep more information about the characters in the string. It should certainly depend on the order of the characters. If a string can only contain 26 letters then you can think of it as representing a number in base 26, each character being a digit. But then the hash values can get very large, and that is impractical. Taking into account upper and lower case letters, digits and special symbols, there might be about 80 possible characters instead of 26, so a string represents a number in base 80, which makes the hash values even larger.

One way to reduce the size of the number is only to look a the first or last 4 characters. But then the hash function does not depend on all of the characters in the string, and it is not a good hash function.

The following hash function is a practical one based partly on the idea of treating a string like a number. It yields a 32-bit integer result. As it sees new characters it shifts the result to the left by 4 bits and adds in the new character; it appears to treat each character as a digit in base 16. But this function does not only look at the low order 4 bits of a character: it adds in the entire character, so the 'digits' of this base 16 number can be from 0 to 255.

But there is an important modification of the basic idea. Any part of the result that would be shifted off the left end is brought back around to the right end and combined with the result, to ensure that the result depends on all of the characters. This hash function is used in ELF binaries.

  //=============================================
  //              strhash
  //=============================================
  // strhash is a hash function for strings.
  // Parameter str is a null-terminated string.
  //=============================================

  int strhash(const char* str)
  {
    const char* p;
    int         g;
    int         h = 0;

    for(p = str; *p != '\0'; p++) 
    {
      h = (h << 4) + (unsigned int)(*p);
      g = h & 0xF0000000;
      if(g != 0)
      {
        h = h ^ (g >> 24);
      }
      h = h & ~g;
    }
    return h;
  }

Hash tables

The fundamental idea of a hash table is to store entries in an array. If the physical size of array is n, and h is a hash function, then an entry with key k is stored at index h(k) mod n. Then, looking up key k just involves computing h(k) mod n again and looking at the index in the array. It is very fast.

Unfortunately, there is a catch. It is possible for h(k1) mod n and h(k2) mod n to be the same for two different keys k1 and k2. That is called a collision.

There are several ways of dealing with collisions. Here, we look at just one of those ways, called chaining.


Dealing with collisions by chaining

One way to handle collisions is to store a linked list of values at each index in the array. All keys that are supposed to be stored at index i are put in the linked list at index i.

For example, suppose that the physical size of the array is 5. Imagine that we want to store (key,value) pairs for keys "cat", "dog", "frog", "goat" and "horse". Just for illustration, imagine that the hash function yields the following values. (Our values are unrealistically small, but this is just for illustration.)

h("cat") = 63
h("dog") = 35
h("frog") = 79
h("goat") = 72
h("horse") = 117

Then the hash table stores the keys as follows. (Imagine that the value is stored underneath the key, like two playing cards on top of one another; the value is there but you do not see it.)

       ------      
      |      |      ---------
      |      |     |   ---------\
    0 |  --------->|---------|
      |      |     | "dog"   |
      |      |      ---------
      |------|
      |      |      
      |      |
    1 |  -------\
      |      |      
      |      |
      |------|     
      |      |      ---------         ---------
      |      |     |   ------------->|   -------------\
    2 |  --------->|---------|       |---------|
      |      |     | "horse" |       | "goat"  |
      |      |      ---------         ---------
      |------|      
      |      |      ---------
      |      |     |   ------------\
    3 |  --------->|---------|
      |      |     | "cat"   |
      |      |      ---------
      |------|
      |      |      ---------
      |      |     |   -----------\
    4 |  --------->|---------|
      |      |     | "frog"  |
      |      |      ---------
       ------
To look up a value, just compute the index where it belongs and do a linear search of the linked list found at that index.


Analysis and managing the physical array size

For hash tables to be efficient, the linked lists need to be short, on the average. If the physical array size is too small, the lists will be long. If the physical size is too large, there are a lot of empty lists, and that wastes memory.

A good compromise is to choose the physical size of the array to be approximately the same as the number of entries in the table, since then the average list length is about 1. (That is a little optimistic because some of the linked lists will be empty, and those lists are less commonly searched than the nonempty lists. The average length of the nonempty lists is a little more than 1, but it is still a small constant.)

So as long as the physical size of the array is kept very roughly the same as the number of entries, insertion, deletion and lookup in a hash table take constant time, on the average. The average cost does not depend on the number of entries! That is extraordinarly fast. As you can imagine, hash tables are very popular, especially for large databases.

We have seen how to reallocate an array. If the number of entries is within a factor of 2 of the physical size then hash tables will work well. As the hash table grows, you move it into a larger array. When it shrinks a lot, you move it into a smaller array.


Exercises

  1. How much time does each operation on a hash table take in the worst case? Answer

  2. Suppose the following types are used for a hash table whose keys and values are both strings, of type string. (That is convenient because memory management will be handled automatically via reference counts.)

      //==========================================
      //               ListCell
      //==========================================
    
      struct ListCell
      {
        string    key;
        string    value;
        ListCell* next;
    
        ListCell(string k, string v, ListCell* nx)
        {
          key   = k;
          value = v;
          next  = nx;
        }
      };
    
      //==========================================
      //               HashTable
      //==========================================
    
      struct HashTable
      {
        ListCell* A;
        int       load;
        int       size;
    
        HashTable(int n)
        {
          load = 0;
          size = n;
          A    = new ListCell*[n];
          for(int i = 0; i < n; i++)
          {
            A[i] = NULL;
          }      
        }
      };
    
    Write a definition of lookup(x, T), which returns the value associated with key x in hash table T, or returns an empty string if x does not occur in T. Assume that the hash function is strhash. Answer

  3. Using the same types as in the previous exercise, write a definition of insert(x, v, T), which inserts key x with associated value v into hash table T. If there is already an entry for key x, then the value associated with x should be replaced by v.

    If the load becomes more than twice the physical size, then the array should be reallocated to be the same as the load. Be careful with this step. The index where a given key goes depends on the physical size of the array. If you change the physical size, you need to remember that keys might need to move to different indices.

    Answer