Computer Science 3300
Spring 2015
Section 001
Programming Assignment 6

Assigned: Wednesday, March 25
Intermediate version due: Thursday, April 2, 11:59pm
Version (a) due: Thursday, April 9, 11:59pm
Version (b) due: Saturday, April 25, 11:59pm


Table of contents

  1. Background
  2. Requirements
  3. Algorithmic issues
  4. Tracing
  5. Implementation issues and a development plan
  6. Intermediate version
  7. Submitting your work
  8. Late submissions


Background

Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Data compression is critical for transmitting videos over the internet. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumpelstiltskin occurs often, then there will be a very short code for that word.

A very simple kind of data compression for text files works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'   01100001
'b'   01100010
'c'   01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'   0000000001100001
'b'   0000000001100010
'c'   0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'   10
'b'   0
'c'   11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.


Requirements

For this assignment you will write two programs.

The first program, called huffman, gets two file names, which I will refer to as A and B, from the command line. It should do the following.

  1. Count how many occurrences of each character are in file A.

  2. If requested, show the character frequencies on the standard output, showing only characters that occur at least once.

  3. Construct a Huffman code for the characters that occur in file A (ignoring characters that do not occur at all).

  4. If requested, write the code on the standard output.

  5. Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.

For example, command

  ./huffman data.txt data.cmp
reads file data.txt and writes a compressed version into a file called data.cmp.

The second program, called unhuffman, will also be given two file names, A and B, on the command line. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. The decoded file should look identical to the file that you encoded. For example, command

  ./unhuffman data.cmp newdata.txt
reads data.cmp and writes newdata.txt.


Algorithmic Issues

There is an algorithm that is due to Huffman for finding efficient prefix codes, called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each leaf contains a character. Each nonleaf has two children, the left child and the right child. We also think of the left child as the '0' child and the right child as the '1' child. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
A tree defines a code as follows. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see along the path. For example, in the tree above, the code for 'b' is 01 and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency that was counted. For example, if the frequency counts are

a 500
b 230
c 300
d 250
e 700
then the algorithm creates five leaves, one holding a, with frequency of 500, one holding b with a frequency of 230, etc.

(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)

Now the algorithm repeatedly performs the following steps, as long as there are still at least two trees left in the collection of trees.

  1. Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency FS and tree T has frequency FT.

  2. Build a tree that looks as follows.

                  .
                 / \
               0/   \1
               /     \
              S       T
    
    Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.

  3. Add the new tree, with its new frequency, to the collection of trees.

The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.

A notation for trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. A leaf is shown as the character at that leaf. A nonleaf is shown as the two subtrees, surrounded by parenthese and separated by a comma. For example, tree

                   .
                  / \
                0/   \1
                /     \
               e       c
is written (e,c). Tree
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
is written (b,(a,c)). Tree
               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
is written ((c,b),(a,d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a 500
b 230
c 300
d 250
e 700

The initial collection of trees (all leaves) with their frequencies is as follows.

b 230
d 250
c 300
a 500
e 700

Removing the two with the smallest frequencies and combining them yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c 300
(b,d) 480
a 500
e 700

The next step combines c with (b,d), yielding tree (c,(b,d)), with a frequency of 300 + 480 = 780. The collection of trees and frequencies is now

a 500
e 700
(c,(b,d)) 780

Now combine a and e, yielding

(c,(b,d)) 780
(a,e) 1200

The last step combines the last two trees, yielding

((c,(b,d)),(a,e)) 1980

The final tree is ((c,(b,d)),(a,e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d
The code for d is 011, the code for a is 10, etc.


Tracing for Huffman

Create a file trace.cpp that will contain functions for tracing. Also create file trace.h to contain prototypes for functions exported by trace.cpp.

Add a function to trace.cpp that takes a null-terminated string opt and a reference parameter v of type int. If opt is -t1, it should set v to 1. If opt is -t2, it should set v to 2. Otherwise, it should do nothing.

Write main in huffman.cpp so that it sets the trace level to 1 if -t1 occurs in the command line before the first file name, and to level 2 if the command line contains -t2. (If there is no tracing option in the command line, the tracing level should be set to 0.)

Your huffman program is required to contain tracing as follows.

  1. For trace level 1, the program should write information about character frequencies, show the tree that it produces, and show the code that it gets. For example, the output for a particular file might be as follows.

    The character frequencies are:
    
    a 500
    b 230
    c 300
    d 250
    e 700
    
    The Huffman tree is as follows: ((a,e),(c,(b,d)))
    
    A Huffman code is as follows.
    
    a 10
    b 010
    c 00
    d 011
    e 11
    

  2. If the tracing level is 2, the program should do level 1 tracing and it should also show the entire collections of trees, with their frequencies, at the beginning of Huffman's algorithm and after each step.


Implementation Issues and a Development Plan

Start by writing file huffman.cpp, the source code for the encoding program huffman. Do not try to write it all and then test it. Write a little at a time and test that part before moving on.

The main function

Make your main function set the trace level from the command line. As you write more of the program, modify main to do more.

Getting the character frequencies

Write a function to get the character frequencies. Assume the characters in the file are 8-bits (one byte), so there are 256 possible characters. See writing files using the stdio library and reading files using the stdio library.

Create an array of 256 integers to store the frequency counts, one for each possible character that can be stored in one byte. Initially, all of the counts are 0. Now read each character of the file, adding 1 to its count.

Important. You will be using characters as array indices. Do not use type char for an array index. Characters are assumed to be from −128 to 127. Read each character as an integer. If you need a character variable, use type unsigned char.

Modify the main function in huffman.cpp so that it reads the input file, counts the character frequencies and, if tracing is requested, shows the frequencies on the standard output.

Showing the frequencies

Add a function to trace.cpp that shows the character frequencies on the standard output. It needs to be passed the array of frequencies. Make the function show a newline character as (newline), a tab character as (tab) and a space as (space).

Also in trace.cpp, write a function that takes a character and writes the character's name to the standard output. For example, the name of a newline character is "(newline)". The name of 'a' is just "a". Use it for writing the code.

Trees

Create a file, tree.h, holding the following type definition.

  enum NodeKind {leaf, nonleaf};

  struct Node {
    NodeKind kind;
    char     ch;
    Node*    left;
    Node*    right;

    Node(char c)
    {
      kind = leaf;
      ch   = c;
    }
    Node(Node* L, Node *R)
    {
      kind  = nonleaf;
      left  = L;
      right = R;
    }      
  };

The idea is that a leaf has kind leaf and a nonleaf has kind nonleaf. The ch field should only be used in a leaf and the left and right fields should only be used for a nonleaf. You will lose points if you use the ch field of a nonleaf or the left or right fields of a leaf.

When you set up your initial trees (each holding just one letter) use the first constructor. When you build a composite tree, use the second constructor. For example, statement

   Node* t = new Node('a');
builds a leaf that contains only the character 'a'. Statement
   Node* t = new Node(r,s);
builds a new tree whose left subtree is r and whose right subtree is s. You can tell what kind of tree you are looking at by checking its kind. For example, if t has type Node*, you might ask
   if(t->kind == nonleaf) 
   {
     ...
   }

Showing trees

Add a function to trace.cpp that prints a tree to the standard output, using the notation described above. Use it for tracing, as described above.

Priority Queues

For Assignment 4 you implemented the priority queue abstract data type. Copy that implementation into the directory for assignment 6 and modify it by defining

  struct Node;
  typedef Node* PQItemType;
  typedef int   PQPriorityType;

After you have counted character frequencies, insert each character (as a leaf node of a tree) that has a nonzero count, with its frequency as its priority, into a priority queue. Now repeatedly remove two trees from the priority queue, combine them into a single tree, and insert the new tree back into the priority queue. The priority of the new tree is the sum of the priorities of the two trees that were combined.

Ensure that your algorithm only uses features of the priority queue module that are part of its interface. Do not use features that should remain hidden inside the priority queue module! You will lose a lot of points if you use non-exported features of the priority queue module.

Test what you have by building and, if tracing is requested, printing the huffman tree.

Intermediate version

Once the above functions are working, submit them as assignment 6i. Include a main function that tests them. This version should count the frequencies, show the frequencies (if requested), build the Huffman tree and show the Huffman tree (if requested).

Getting the code from the tree

Now build an array of 256 strings to hold the character codes. Use the C++ string type to make your job easy. Write a function that traverses the tree and stores the code for each character into the array. That function will need to have three parameters: the code array (so that it knows where to store the codes), the tree t (a subtree of the full huffman tree) and a string that is the path from the root of the full huffman tree to subtree t. Now ask yourself how to handle the case where t is a leaf and how to handle the case where t is a nonleaf.

Echoing the code

Add a function to trace.cpp that shows the code from the code array. If tracing is turned on, main should show the code.

(Don't show the code while building it. If you do, then any mistake in building the array will not show up when you print the code. Fill in the array first, then show what is in it.)

As above, show a newline character as (newline), a tab character as (tab) and a space as (space).

Writing the encoded file

The goal is to pack 8 bits into each byte. But before doing that, take a simpler approach where you write each bit as a character so that you can read it. After everything seems to work, you change the interface so that it writes out a binary file instead.

Get files binary.h, binary1.cpp and binary2.cpp. They provide the following tools for writing files.

Type BFILE

BFILE* openBinaryFileWrite(const char* filename)

This function opens binary file filename for writing and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened.

void writeBit(BFILE* f, int b)

This function writes bit b (either 0 or 1) into open file f. Make sure that b is either 0 or 1. It should not be '0' or '1'.

void writeByte(BFILE* f, int b)

This function writes byte or character b. into open file f.

void closeBinaryFileWrite(BFILE* f)

This function closes file f. You must close the file when you are done writing.

Both files binary1.cpp and binary2.cpp have the same interface, described by binary.h. But binary1.cpp writes a text file, writing character '0' to represent bit 0 and character '1' to represent bit 1. That makes the output file readable. Module binary2.cpp writes the file in binary format. If you link with binary1.cpp you will be able to read the "compressed" file using a text editor, but you will not get real compression. If you link with binary2.cpp you will get compression but you will not be able to read the compressed file using a text editor.

Add two functions to huffman.cpp. The first writes the huffman tree into the binary file. (You will need to do this so that you can get the tree back in order to decode the file.)

  1. To write a leaf, write a 1 bit followed by the character stored at the leaf (8 bits). So, if t is a leaf, you write t as follows.

      writeBit(f, 1);
      writeByte(f, t->ch);
    

  2. To write a nonleaf, write a 0 bit followed by the left subtree followed by the right subtree. So, if t is not a leaf, write t as follows, assuming your function is called writeTreeBinary.

      writeBit(f, 0);
      writeTreeBinary(f, t->left);
      writeTreeBinary(f, t->right);
    

Now write a function that reads file A and writes the character codes for the characters that are read into an open binary file.

Finally, write a function that writes the entire compressed file B. It should open B as a binary file and A as an ordinary (text) file. Then it should write the Huffman tree at the beginning of the file, write the codes of the characters in file A and finally close the two open files.


Tracing for Unhuffman

The next step is to write unhuffman.cpp, the source code for unhuffman. Start by writing a main function in unhuffman.cpp that sets the trace level as described earlier. You will modify it as you add more to unhuffman.cpp.

Uncompression

Use the following for reading the binary file, also described by binary.h.

BFILE* openBinaryFileRead(const char* filename)

This function opens binary file filename for reading and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened.

int readBit(BFILE* f)

This function reads one bit from file f and returns the bit (either 0 or 1). At the end of the file, readBit returns EOF.

int readByte(BFILE* f)

This function reads one byte (8 bits) from file f and returns the byte. Use it to get a byte that you wrote using writeByte. At the end of the file, readByte returns EOF.

void closeBinaryFileRead(BFILE* f)

This function closes file f. Once it is closed, you cannot read from it again.

When you link unhuffman.cpp, it is critical that you use the same implementation of binary.h as for huffman.cpp. Either link both with binary1.cpp or both with binary2.cpp.

Write a function that reads a tree. You will need to use it at the beginning of decoding to get the tree back. To read a tree, start by reading a bit. If the bit is a 1, you are looking at a leaf. Read the character (a byte) and build a leaf holding that character. But if the bit is a 0, then you are looking at a nonleaf. Just call your readTree function twice, once to build the left subtree and once to build the right subtree. Then build the whole tree by building a root with those two subtrees. You should have recovered the original tree.

Modify main to read a tree and, if requested, to show it.

Now write a function to do the uncompression. Use the tree that describes the code. As you read each bit, move down the tree in the appropriate direction. When you hit a leaf, write out the character at that leaf then start again at the root of the tree to get the next character. Keep going until you hit the end of the file.

Important note. The code for a character only has a bit in it for each nonleaf. There is no bit for a leaf. So when reading the code, only read a bit when you are at a nonleaf, to decide whether to move to the left or right. Do not read a bit when you are at a leaf.


Submitting Your Work

To turn in this program, log into Linux. Change to the directory that contains those files. Then issue one of the following commands.

Intermediate version

~abrahamsonk/3300/bin/submit 6i huffman.cpp tree.h trace.h trace.cpp pqueue.h pqueue.cpp

First version

~abrahamsonk/3300/bin/submit 6a huffman.cpp unhuffman.cpp tree.h trace.h trace.cpp pqueue.h pqueue.cpp

Second version

~abrahamsonk/3300/bin/submit 6b huffman.cpp unhuffman.cpp tree.h trace.h trace.cpp pqueue.h pqueue.cpp


Late submissions

Late submissions on version (a) will be accepted for one day after the due date. Late submissions for version (b) will be accepted for three days after the due date. If you miss a late submission deadline by a microsecond, your work will not be accepted.