Computer Science 3300
Spring 2007
Section 001
Programming Assignment 5

Assigned: Thursday, March 22
First version due: Monday, April 9, 11:59pm
Second version due: Friday, April 27, 11:59pm


Read the entire assignment before you start working on it.

Table of contents

  1. Background
  2. Requirements
  3. Algorithmic issues
  4. Implementation issues
  5. Design issues
  6. Planning
  7. Extra credit
  8. Submitting your work


Background

Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumplestiltskin occurs often, then there will be a very short code for that word. But a simpler kind of data compression works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'   01100001
'b'   01100010
'c'   01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'   0000000001100001
'b'   0000000001100010
'c'   0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'   10
'b'   0
'c'   11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.


Requirements

Your assignment is to read a table of characters and their frequencies, to build an efficient code for those characters, to print code, and then to encode a file using that code. The first input, read from a file, consists of lines where each line contains a nonblank character followed by an integer frequency, and ended by the end of a file. The output of the code, written to the standard output, consists of an echo of the input followed by a table showing the character codes (in any order). For example, if the frequency input is

a 500
b 230
c 300
d 250
e 700
then the output of the code would be as follows.
The character frequencies are:

a 500
b 230
c 300
d 250
e 700

A Huffman code is as follows.

c 00
b 010
d 011
a 10
e 11

Next, your program reads a file that only contains the characters whose frequencies are given, and writes an encoded version of the file. Write only 0 and 1 for characters. Note that the file that your program writes will be larger than the original file, not smaller, because you will write the 0 and 1 characters as Ascii characters, of 8 bits each. An actual compression program needs to pack the bits together.

Particulars

Your program should get the names of a files to read from the command line. Use the following main program heading.

   int main(int argc, char* argv[])
Then argv[1] will be the name of the file that contains the character frequency information, and argv[2] is the name of the file to encode, and argv[3] is the name to give to the encoded file. If your compiled program is called a.out, you run command
  a.out freq.txt info.txt encinfo.txt
to run the program, reading character frequency information from file freq.txt, then to encode file info.txt, writing the encoding into file encinfo.txt.


Algorithmic Issues

There is an algorithm that is due to Huffman for finding efficient prefix codes; they are called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each nonleaf has two children, one labeled the '0' child, the other the '1' child. Each leaf contains a character. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
A tree defines a code. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see. For example, in the tree above, the code for 'b' is 01, and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency. For example, if the frequency input is

a 500
b 230
c 300
d 250
e 700
then the algorithm creates five leaves, one holding a, with frequency of 500, one holding b with a frequency of 230, etc.

Now the algorithm repeatedly performs the following, as long as there are still at least two trees left in the collection of trees.

  1. Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T, and suppose that tree S has frequency FS, and tree T has frequency FT.

  2. Build a tree that looks as follows.

                  .
                 / \
               0/   \1
               /     \
              S       T
    
    Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.

  3. Add the new tree, with its new frequency, to the collection of trees.

When there is only one tree left in the table, the algorithm is done. That tree is the result.

Writing trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. The notation groups the two subrees of a node in parentheses. So a pair of parentheses stands for a nonleaf node in the tree. For example, tree

                   .
                  / \
                0/   \1
                /     \
               e       c
is written (e c). Tree
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
is written (b (a c)). Tree
               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
is written ((c b) (a d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a 500
b 230
c 300
d 250
e 700

The initial collection of trees (all leaves) with their priorities is as follows.

b 230
d 250
c 300
a 500
e 700

Combining the first two characters yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c 300
(b d) 480
a 500
e 700

The next step combines c with (b d), yielding tree (c (b d)), with a frequency of 300+480 = 780. The collection of trees and frequencies is now

a 500
e 700
(c (b d)) 780

Now combine a and e, yielding

(c (b d)) 780
(a e) 1200

The last step combines these two trees, yielding

((c (b d)) (a e)) 1980

The final tree is ((c (b d)) (a e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d
For example, the code for d is 011.


Implementation Issues

Do not use global variables anywhere in this assignment with the possible exception that you can use a global variable to hold a file to which you do debug prints, when debugging is turned on.

Trees

Implement trees using pointers. Each node in a tree is a structure that tells whether the tree is a leaf (containing a character) or a nonleaf (containing two subtrees). Use the following structure for the nodes of the tree.

  enum NodeKind {leaf, nonleaf};

  struct Node {
    NodeKind kind;
    char ch;
    Node* left;
    Node* right;
    Node(char c)
    {
      kind = leaf;
      ch   = c;
    }
    Node(Node* L, Node *R)
    {
      kind  = nonleaf;
      left  = L;
      right = R;
    }      
  };

When you set up your initial trees (each holding just one letter) use the first constructor. When you build a composite tree, use the second constructor. For example, statement

   Node* t = new Node('a');
builds a tree that contains only the character 'a'. Statement
   Node* t = new Node(r,s);
builds a new tree whose left subtree is r and whose right subtree is s. You can tell what kind of tree you are looking at by checking its kind. For example, if t has type Node*, you might ask
   if(t->kind == nonleaf) 
   {
     ...
   }

Priority Queues

Assignment 4 describes priority queues, which are also useful for this assignment. But this time, the priorities are integers, and the items in the priority queue are trees.

When you read in the characters and their frequencies, insert each character (as a leaf node of a tree) with its frequency as its priority, into a priority queue. Now, to find the two trees with the smallest frequencies, just remove two things from the priority queue.

Be careful. The algorithm is finished when there is only one thing in the priority queue. The priority queue abstract data type does not have an operation that asks whether there is just one thing. Do not add one. Do not let the Huffman code module look directly at the representation of priority queues to decide if there is only one member. Your Huffman code module must continue to work after the priority queue module has been replaced by one that uses a different representation.

Instead, just remove the first tree. Now check whether the priority queue is empty. If so, then there was only one tree. You are done.

Printing the code

You will need a function that prints the character codes. I strongly recommend that you use recursion. Pass two parameters to the printing function, one a tree and the other a string that precedes the codes found in the tree. When you are working at a node that you reached by following edges labeled 0 and 1 (in that order) your string parameter will be "01". A simple way to add a digit to the end of a string is to use the snprintf function. Suppose that X is a null-terminated string (an array of characters terminated by a null character), and suppose that Y is another array of n characters. Then

  snprintf(Y, n, "%s0", X);
will copy string X into array Y, but put a 0 after it. For example, if X holds null-terminated string "011", then, after doing the snprintf line, Y will hold null-terminated string "0110". To add a 1 to the end instead of a 0, use
  snprintf(Y, n, "%s1", X);
Be sure that n is the size of array Y. If there is not enough room, snprintf will not try to store more than n bytes into the array. (It will only store some of the bytes in that case.)

Include header file <cstdio> to use snprintf.

Writing the encoded file

Write another function that is similar to the printing function. But instead of printing the code, it should produce an array of structures, where each structure gives a character and its encoding. Also write a function that finds the encoding (a string) of a given character in the array.

Now open the file to encode for reading, and open the result file for writing. Read one character at a time from the input file, and write its encoding to the output file, as a sequence of 0's and 1's. Do not write spaces between the character encodings. If you encounter a character that does not have an encoding, just skip it. When you are done, close both the input and the output file.

You can open a file for reading using the iostream library. Include <fstream> to use that library. Write

  ifstream in(argv[2]);
to open file argv[2] for reading. Now use in just the way you would use cin. For example, write
  in.get(c);
to read a character from in, and store that character into variable c. When you are done,
  in.close();
will close the file.

To open file argv[3] for writing, use

  ofstream out(arg[3]);
Now treat out just like cout. For example, to write a string s into the file, you say
  out << s;


Design Issues

Write this program in three modules.

  1. Module 1 is the priority queue module. Call the files pqueue.h and pqueue.cpp. Use the same implementation as for Assignment 4, but change the item and priority types. The item is now a pointer to a Node, and the priority is an integer.

  2. Module 2 implements trees and the Huffman code algorithm. Call its files huffman.h and huffman.cpp. Provide functions that make the Huffman code algorithm easy to write. Make the functions short and simple.

    pqueue.h will need to use type Node*. You can do that by including huffman.h in pqueue.h. But that might cause huffman.h to be included twice in other files. For example, if huffman.cpp includes both huffman.h and pqueue.h, it will end up reading huffman.h twice. That will cause compile errors. You can avoid that by writing huffman.h as follows.

    # ifndef HUFFMAN_H
    # define HUFFMAN_H
      ... (the body of file huffman.h)
    # endif
    
    The first time this file is read, HUFFMAN_H will not have been defined, so the body will be read. (ifndef asks if the name is not defined.) The second line defines HUFFMAN_H. The next time this file is read, HUFFMAN_H has been defined, and the entire body will be skipped.

  3. Module 3 contains the main program. It just reads the input file, runs the Huffman code algorithm, prints the code (on the standard output) and writes the encoded file.


Planning

Before starting to implement this program, develop a plan, similar to the plan used for prior assignments. Write down the plan. Then follow the plan. Do not just try to write the whole program, then test it. Instead, write a little bit, then test that, then write a little more, continuing until the entire program is done.


Extra credit

For extra credit, add another function to decode an encoded file. Write your program so that, after writing (and closing) the encoded file, it reads that file again and decodes it. Your program should call the decoded file by the same name as the encoded file, with ".decoded" after it. For example, if the compiled program is called a.out, then

  a.out freq.txt myfile.txt encoded.txt
should create files encoded.txt and encoded.txt.decoded. The decoded file should look like the original.


Submitting Your Work

To turn in this program, log into one of the Unix computers in the lab. (You can log in remotely.) Ensure that your files are there. Change to the directory that contains those files. Then issue one of the following commands.

First version

~karl/3300/bin/submit 5a huffman.cpp huffman.h pqueue.cpp pqueue.h main.cpp

Second version

~karl/3300/bin/submit 5b huffman.cpp huffman.h pqueue.cpp pqueue.h main.cpp