Computer Science 3300
Spring 2006
Section 001
Programming Assignment 5

Assigned: March 23
Design due: April 4, 11:59pm
First version due: April 12, 11:59pm
Second version due: April 24, 11:59pm


Read the entire assignment before you start working on it.

Table of contents

  1. Background
  2. Requirements
  3. Algorithmic issues
  4. Implementation issues
  5. Design issues
  6. Planning
  7. Submitting your work


Background

Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumplestiltskin occurs often, then there will be a very short code for that word. But a simpler kind of data compression works on individual characters, without paying attention to how characters are grouped. This assignment has to do with this simple form of data compression.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'   01100001
'b'   01100010
'c'   01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'   0000000001100001
'b'   0000000001100010
'c'   0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'   10
'b'   0
'c'   11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.


Requirements

Your assignment is to read a table of characters and their frequencies, and then to print an efficient code for this set of characters (with the given frequencies). The input, read from a file, consists of lines where each line contains a nonblank character followed by an integer frequency, and ended by the end of a file. The output, written to the standard output, consists of an echo of the input followed by a table showing the character codes (in any order). For example, if the input is

a 500
b 230
c 300
d 250
e 700
then the output would be as follows.
The character frequencies are:

a 500
b 230
c 300
d 250
e 700

A Huffman code is as follows.

c 00
b 010
d 011
a 10
e 11

Your program should get the name of a file to read from the command line. Use the following main program heading.

   int main(int argc, char* argv[])
Then open file argv[1] for reading. All output should be written to the standard output. If your compiled program is a.out, you run command
  a.out data.txt
to run it, reading from file data.txt.


Algorithmic Issues

There is an algorithm that is due to Huffman for finding efficient prefix codes; they are called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each nonleaf has two children, one labeled the '0' child, the other the '1' child. Each leaf contains a character. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
A tree defines a code. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see. For example, in the tree above, the code for 'b' is 01, and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. It attaches the frequency to each tree.

Now the algorithm repeatedly performs the following, as long as there are still at least two trees left in the collection of trees.

  1. Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T, and suppose that tree S has frequency FS, and tree T has frequency FT.

  2. Build a tree that looks as follows.

                  .
                 / \
               0/   \1
               /     \
              S       T
    
    Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.

  3. Add the new tree, with its new frequency, to the collection of trees.

When there is only one tree left in the table, the algorithm is done. That tree is the result.

Writing trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. The notation groups the two subrees of a node in parentheses. So a pair of parentheses stands for a nonleaf node in the tree. For example, tree

                   .
                  / \
                0/   \1
                /     \
               e       c
is written (e c). Tree
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
is written (b (a c)). Tree
               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
is written ((c b) (a d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a 500
b 230
c 300
d 250
e 700

The initial collection of trees (all leaves) with their priorities is as follows.

b 230
d 250
c 300
a 500
e 700

Combining the first two characters yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c 300
(b d) 480
a 500
e 700

The next step combines c with (b d), yielding tree (c (b d)), with a frequency of 300+480 = 780. The collection of trees and frequencies is now

a 500
e 700
(c (b d)) 780

Now combine a and e, yielding

(c (b d)) 780
(a e) 1200

The last step combines these two trees, yielding

((c (b d)) (a e)) 1980

The final tree is ((c (b d)) (a e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d
For example, the code for d is 011.


Implementation Issues

Do not use global variables anywhere in this assignment with the possible exception that you can use a global variable to hold a file to which you do debug prints, when debugging is turned on.

Trees

Implement trees using pointers. Each node in a tree is a structure that tells whether the tree is a leaf (containing a character) or a nonleaf (containing two subtrees). Use the following structure for the nodes of the tree.

  enum NodeKind {leaf, nonleaf};

  struct Node {
    NodeKind kind;
    char ch;
    Node* left;
    Node* right;
    Node(char c)
    {
      kind = leaf;
      ch   = c;
    }
    Node(Node* L, Node *R)
    {
      kind  = nonleaf;
      left  = L;
      right = R;
    }      
  };

When you set up your initial trees (each holding just one letter) use the first constructor. When you build a composite tree, use the second constructor. For example, statement

   Node* t = new Node('a');
builds a tree that contains only the character 'a'. Statement
   Node* t = new Node(r,s);
builds a new tree whose left subtree is r and whose right subtree is s. You can tell what kind of tree you are looking at by checking its kind. For example, if t has type Node*, you might ask
   if(t->kind == nonleaf) 
   {
     ...
   }

Priority Queues

Assignment 4 describes priority queues, which are also useful for this assignment. But this time, the priorities are integers, and the items in the priority queue are trees.

When you read in the characters and their frequencies, insert each character (as a leaf node of a tree) with its frequency as its priority, into a priority queue. Now, to find the two trees with the smallest frequencies, just remove two things from the priority queue.

Be careful. The algorithm is finished when there is only one thing in the priority queue. The priority queue abstract data type does not have an operation that asks whether there is just one thing. Do not add one. Do not let the Huffman code module look directly at the representation of priority queues to decide if there is only one member. Your Huffman code module must continue to work after the priority queue module has been replaced by one that uses a different representation.

Instead, just remove the first tree. Now check whether the priority queue is empty. If so, then there was only one tree. You are done.

Printing the code

You will need a function that prints the character codes. I strongly recommend that you use recursion. Pass two parameters to the printing function, one a tree and the other a string that precedes the codes found in the tree. When you are working at a node that you reached by following edges labeled 0 and 1 (in that order) your string parameter will be "01". A simple way to add a digit to the end of a string is to use the snprintf function. Suppose that X is a null-terminated string (an array of characters terminated by a null character), and suppose that Y is another array of n characters. Then

  snprintf(Y, n, "%s0", X);
will copy string X into array Y, but put a 0 after it. For example, if X holds null-terminated string "011", then, after doing the snprintf line, Y will hold null-terminated string "0110". To add a 1 to the end instead of a 0, use
  snprintf(Y, n, "%s1", X);
Be sure that n is the size of array Y. If there is not enough room, snprintf will not try to store more than n bytes into the array. (It will only store some of the bytes in that case.)

Include header file <cstdio> to use snprintf.


Design Issues

Write this program in three modules.

  1. Module 1 is the priority queue module. Call the files pqueue.h and pqueue.cpp. Use the same implementation as for Assignment 4, but change the item and priority types.

  2. Module 2 implements trees and the Huffman code algorithm. Call its files huffman.h and huffman.cpp. Provide functions that make the Huffman code algorithm easy to write. Make the functions short and simple.

    pqueue.h will need to use type Node*. You can do that by including huffman.h in pqueue.h. But that might cause huffman.h to be included twice in other files. For example, if huffman.cpp includes both huffman.h and pqueue.h, it will end up reading huffman.h twice. That will cause compile errors. You can avoid that by writing huffman.h as follows.

    # ifndef HUFFMAN_H
    # define HUFFMAN_H
      ... (the body of file huffman.h)
    # endif
    
    The first time this file is read, HUFFMAN_H will not have been defined, so the body will be read. (ifndef asks if the name is not defined.) The second line defines HUFFMAN_H. The next time this file is read, HUFFMAN_H has been defined, and the entire body will be skipped.

  3. Module 3 contains the main program. It just reads the input file, runs the Huffman code algorithm, and prints the results.


Planning

Before starting to implement this program, develop a plan, similar to the plan used for prior assignments. Write down the plan. Then follow the plan.


Submitting Your Work

To turn in this program, log into one of the Unix computers in the lab. (You can log in remotely.) Ensure that your files are there. Change to the directory that contains those files. Then issue one of the following commands.

Design

submit 5a huffman.cpp huffman.h pqueue.cpp pqueue.h main.cpp

First version

submit 5b huffman.cpp huffman.h pqueue.cpp pqueue.h main.cpp

Second version

submit 5c huffman.cpp huffman.h pqueue.cpp pqueue.h main.cpp