Computer Science 3510
Spring 2004
Programming Assignment 3

First version due: Saturday, Feb. 28, 11:59pm
Second version due: Saturday, Mar. 6, 11:59pm

Data compression

Data compression is a way of finding economical ways to encode information. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. But the simplest kind of data compression works on individual characters, without paying attention to how characters are grouped. This assignment has to do with this simple form of data compression.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'   01100001
'b'   01100010
'c'   01100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'   10
'b'   0
'c'   11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character.


Choosing a prefix code

There is an algorithm that is due to Huffman for finding efficient prefix codes; they are called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each nonleaf has two children, one labeled the '0' child, the other the '1' child. Each leaf contains a character. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
A tree defines a code. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see. For example, the code for b is 01, and the code for a is 10. The code in the preceding example is defined by the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

The algorithm to create the tree starts with a table of characters and their frequencies. It sorts the table into increasing order of frequency, so that the least frequently occuring character is first. But it thinks as each character not as a single character but as a tree that contains only that character.

Now the algorithm repeatedly performs the following, as long as there are still two trees left in the table.

  1. Remove the first two trees (those with the lowest frequencies). Suppose that they are trees S and T, and suppose that tree S has frequency FS, and tree T has frequency FT.

  2. Build a tree that looks as follows.

                  .
                 / \
               0/   \1
               /     \
              S       T
    
    Say that this tree has frequency FS + FT.

  3. Insert the new tree into the table in such a way that the table is still sorted by frequency, with lower frequencies first.

When there is only one tree left in the table, the algorithm is done. That tree is the result.


Writing trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. The notation groups the two subrees of a node in parentheses. So a set of parentheses stands for a nonleaf node in the tree. For example, tree

                   .
                  / \
                0/   \1
                /     \
               e       c
is written (e c). Tree
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
is written (b (a c)). Tree
               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
is written ((c b) (a d)).


Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a 500
b 230
c 300
d 250
e 700

After sorting, you have

b 230
d 250
c 300
a 500
e 700

Combining the first two characters yields a tree with frequency 480. It is inserted into the table, yielding the following table.

c 300
(b d) 480
a 500
e 700

The next step combines c with (b d), yielding tree (c (b d)), with a frequency of 300+480 = 780. The table is now

a 500
e 700
(c (b d)) 780

Now combine a and e, yielding

(c (b d)) 780
(a e) 1200

The last step combines these two trees, yielding

((c (b d)) (a e)) 1980

The final tree is ((c (b d)) (a e)), which is the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d
For example, the code for d is 011.


The assignment

Your assignment is to read a table of characters and their frequencies, and then to print a Huffman code for this set of characters. The input consists of lines where each line contains a nonblank character followed by a real number frequency, and ended by the end of a file. The output consists of an echo of the input followed by a table showing the character codes (in any order). For example, if the input is

a 500
b 230
c 300
d 250
e 700
then the output would be as follows.
The character frequencies are:

a 500
b 230
c 300
d 250
e 700

A Huffman code is as follows.

c 00
b 010
d 011
a 10
e 11

The remaining sections explain how to do this. Follow the design that is described here. Your program is required to be organized along these lines.

Your program should get the name of a file to read from the command line. Use the following main program heading.

   int main(int argc, char* argv[])
Then open file argv[1]. All output should be written to the standard output.


Trees

Implement trees using pointers. Each node in a tree is a structure that tells whether the tree is a leaf (containing a character) or a nonleaf (containing two subtrees). Use the following structure for the nodes of the tree.

  enum NodeKind {leaf, nonleaf};
  struct Node {
    NodeKind kind;
    char ch;
    Node* left;
    Node* right;
    Node(char c)
    {
      kind = leaf;
      ch   = c;
    }
    Node(Node* L, Node *R)
    {
      kind  = nonleaf;
      left  = L;
      right = R;
    }      
  };

When you set up your initial trees (each holding just one letter) use the first constructor. When you build a composite tree, use the second constructor. For example, statement

   Node* t = new Node('a');
builds a tree that contains only the character 'a'. Statement
   Node* t = new Node(r,s);
builds a new tree whose left subtree is r and whose right subtree is s. You can tell what kind of tree you are looking at by checking its kind. For example, if t has type Node*, you might ask
   if(t->kind == nonleaf) 
   {
     ...
   }


Priority queues

A priority queue is an abstract data type of sets of items. Each item in the set has an associated priority. When a priority queue object is created, it initially contains no items. The operations are as follows.

insert(x, p)

Insert item x into the priority queue, with priority p.

remove(x, p)

Remove the item from the priority queue that has the smallest priority. Set variable x to the item that was removed and set p to its priority.

empty()

Return true if the priority queue has no items in it.

Implementation of priority queues

Implement the priority queue as an abstract data type. Use the following for the implementation. Store the set of items as a linked list. Each list cell contains an item (a tree), a priority (a real number) and a pointer to the next item. So the cells of this queue have the following type.

  struct PQCell
  {
    Node* item;
    double priority;
    PQCell* next;
  };
Keep the list in sorted order, by priority, with smaller priorities closer to the beginning of the list. Then removing the item with the smallest priority is easy; just remove the first cell from the list. If the queue is empty, then remove(x, p) should set x = NULL and p = 0.

To do insert(x, p) scan the list until you find a cell with a priority that is not smaller than p, or until you encounter the end of the list, and do the insertion there. You can use either a loop or recursion to do this.

An empty queue is just an empty list, represented by a NULL pointer.


Putting the implementation together

As you read characters and their priorities, insert them into a priority queue. You do not need to worry about sorting, since the priority queue takes care of that for you. After reading all of the input, do a loop where you repeatedly remove two things from the priority queue, combine them into one tree, add their priorities and then put them back into the priority queue. Be careful about ending the loop. Ask yourself how you can know it is time to stop.

You will need a function that prints the character codes. I strongly recommend that you use recursion. Pass two parameters to the printing function, one a tree and the other a string that precedes the codes found in the tree. When you are working at a node that you reached by following edges labeled 0 and 1 (in that order) your string parameter will be "01". A simple way to add a digit to the end of a string is to use the sprintf function. Suppose that X is a null-terminated string (an array of characters terminated by a null character), and suppose that Y is another array of characters. Then

  sprintf(Y, "%s0", X);
will copy X into array Y, but put a 0 after it. For example, if X holds null-terminated string "011", then, after doing the sprintf line, Y will hold null-terminated string "0110". To add a 1 to the end instead of a 0, use
  sprintf(Y, "%s1", X);
Include header file <cstdio> to use sprintf.

Do not use global or static local variables anywhere in this assignment with the possible exception that you can use a global variable to hold a file to which you do debug prints, when debugging is turned on.


Turning in your program

Use the following file names when turning in your program.

priority.h   Header file for the priority queue module
priority.cpp   Implementation file for the priority queue module
huffman.cpp   The Huffman code module. Put the tree manipulation code here.

Use the submit program to turn in your program. The first version is called assignment 3a, and the second version is called 3b.