Computer Science 3300
Section 001
Fall 2005
Programming Assignment 2

Due dates:
First solution: Friday, Oct 7, 11:59pm
Second solution: Monday, Oct 17, 11:59pm

This is mostly an exercise in the use of dynamic memory allocation.

START EARLY ON THIS ASSIGNMENT. DO NOT WAIT.

If you experience difficulty, ask for help early. Do not wait until the last minute to ask for help.

Try to turn in a high quality solution for your first version, so that the second version is just touch up. Do not skip the first version.


Hash tables

A table might be used in a program to implement something like a telephone book. The table stores a collection of pairs (key,value) giving a key and an associated value. You can look up a particular key, and get the value that is associated with it. For a telephone book, the key would be a name, and the value would be a telephone number.

From a conceptual viewpoint, a table looks like the following example.

Key Value
"John" "555-3545"
"Mindy" "555-6921"

If you ask for the value associated with key "John", you get value "555-3545". If you ask for the value associated with "Max", you are told (somehow) that "Max" is not in the table.

In order to use tables in a program, you need a way of storing them. Hash tables are a very efficient way of doing that. We start with a function h(k), called a hash function, which takes a key and produces a nonnegative integer. To store the information in the table, we store an array A, of some size s, with the keys and associated values kept in it. A pair with key k is stored (preferably) in the array at index h(k) mod s.

For example, suppose that h("John") = 20 and h("Mindy") = 42. Suppose that we store our table above in a hash table of size 5. Then key "John" would be stored in A[20 mod 5] (that is, in A[0]), and "Mindy" would be stored in A[42 mod 5] (that is, in A[2]). The array would look like this.

Index Key Value
0 "John" "555-3545"
1 - -
2 "Mindy" "555-6921"
3 - -
4 - -

The vacant cells have a NULL pointer (indicated by -) for the key and for the value.

Collisions

As long as all of the keys in the table have different hash values, hash tables are very simple and efficient. To find key k, you look in A[h(k) mod s]. Unfortunately, things are not always this nice. Sometimes, two different keys have the same hash value. In that case, what we do is put a key in the first empty array cell after the one where it prefers to be. If we hit the end of the array, we wrap around back to the beginning.

If you have a decent hash function, and do not try to fill the table too full (as a rule of thumb, don't use more than about 2/3 of the cells), then hash tables are very efficient - more efficient than other approaches, on the average.

A more careful picture of a hash table

In the previous diagrams, hash tables are shown as holding the key and value strings inside them. That is only an approximation. In reality, the strings are stored as pointers to arrays of characters. So here is a more careful picture of the array holding "John" and "Mindy".


A starting point

Accompanying this assignment is an implementation of hash tables. It is not an application, but is a module that might be used in many applications. For this assignment, imagine that this hash table module is already part of a much larger program, and that it is used in several other modules. You are not creating a new tool, but modifying an existing one. When you modify this tool, you do not want to cause unnecessary problems for other people who maintain the other modules that use this tool.

Get the following files. File test1.cpp is a tester that you can use to test your work.

A Hash table uses an array to hold information. In table.cpp, the array is called the cell array, and is pointed to by a pointer called cells. Each cell is a structure holding two values that you can think of as a name and a telephone number. Each value is a pointer to an array of characters (a null-terminated string).

File table.h describes the types for the hash table. A table is represented as a structure, of type HashTableType, that contains a pointer to an array of cells, and that also tells how many of the cells are occupied. Each cell has type HashCellType.


Hash table modification 1

The implementation in table.cpp has the difficulty that it has a fixed size array, and the table can fill up. Your assignment is to modify the table implementation so that the table does not fill up.

It is not acceptable merely to increase the fixed size to some very large value. That would be wasteful of memory when few things are actually put into the table, and would still not solve the problem of the table getting full. Instead, your new implementation must reallocate the array when it runs out of memory. Do this by adding another component, or variable, to the HashTableType type, indicating the physical size of the array - that is, how many slots the cell array currently has. When the cell array becomes nearly full, allocate a new, larger array, and copy everything that was in the old array into the new array. Be sure to update the component that indicates the physical size of the array of cells.

When you allocate a new array, don't get just one more cell. If you do, then a growing hash table becomes very expensive, since each insertion needs to make a copy of the entire table. One idea is to double the size of the table when it is reallocated.

You will need to modify function hashFull. Do not remove it, since it has been advertised in the header file. Remember that this module is already in use in a large program. It is possible, as far as you know, that some other module uses hashFull, and you do not want your modification of this module to require modifications of other modules. The programmers who are responsible for those other modules will be upset with you if you do. Your hash tables never fill up. Think carefully about the correct implementation of hashFull. Read the contract for hashFull to see what its advertised behavior is. Think about how hashFull might be used in some other module. You want that other module to continue to work correctly, without any modification.

Hash table modification 2

A second difficulty is that this implementation allows the table to become almost completely full. A hash table of the kind implemented here that is almost completely full is very inefficient to search. It is better to keep at least 1/3 of the cells empty. Modify the implementation to reallocate the table when it becomes 2/3 full. Be careful: if you use integer arithmetic, you might get incorrect results. For example, in C++, expression 2/3 yields result 0. How can you test whether x > (2/3)y using only integer arithmetic, and no divisions? (Hint: multiply both sides by 3.)


Hints and warnings

Hint. Table.cpp contains a function called hash that is the hash function called h above. Do not try to modify it. Just leave it alone.

Hint. The starting point allocates 100 slots to the array. But if the array can grow, there is no reason to start so high. You might start by allocating 10 slots instead of 100.

Warning. In a hash table, the position of an item is computed based on a hash function, and the position computation depends on the physical size of the hash table. So when the physical size changes, the positions of entries also change. To copy the old values into the new array, you will need to do something like hashInsert calls, so that the positions are set correctly. Think about this. Notice that hashInsert copies the strings. But when you move the table to a larger array, you do not want to copy the strings. How can you handle that?

Pay attention to ownership issues. Arrays that you have allocated are yours, and nobody but you will change them. Arrays that other modules have allocated might be changed by those other modules. Do not make the brash assumption that another module will not change what is in one of its arrays. That is why hashInsert makes copies.

Hint. There is a constant called HASH_TABLE_SIZE in table.h. It is defined by a #define directive as the (fixed) physical size of a hash table. After your modification, HASH_TABLE_SIZE is no longer meaningful, since hash tables do not have a fixed size. So get rid of HASH_TABLE_SIZE. Do not use this name for a different purpose. Instead, just get rid of it. Notice that table.h says that other modules should not use HASH_TABLE_SIZE, so you are free to throw it away.

Hint. You will find it convenient to add at least one new function. (Do not try to make the existing functions a lot larger. Instead, call a new function.) Do not advertise your new function(s) in table.h! The point of table.h is to tell other modules about things that you are providing. Since no new functions are being provided to other modules, do not add them to table.h. You will lose points for advertising functions that should not be advertised.

Hint. Be sure to keep documentation up to date when you make modifications. That means you will need to look at the documentation. Do not ignore it. Incorrect documentation can be worse than no documentation at all.


Testing

Normally, you will be expected to do your own testing. For this assignment, I have written a tester for you. On Unix, you can compile and run the tester and the hash table module together using the following commands.

  g++ -Wall test1.cpp table.cpp 
  a.out
Notice that you do not write table.h.


Turning in your solution

First version

To turn in your assignment, log into one of the Unix machines in the lab. Issue the following commands.

  ~karl/3300/bin/submit 2a table.cpp table.h
You should get confirmation that your files were handed in successfully.

Second version

When the second submission is ready, hand it in using the following command.

  ~karl/3300/bin/submit 2b table.cpp table.h