Computer Science 3300
Spring 2016
Programming Assignment 8

Assigned: Wednesday, April 6
Due: Friday, April 22, 11:59pm

Table of contents

  1. Purpose of this assignment
  2. Plagiarism
  3. Background
  4. Huffman's algorithm
  5. Assignment
  6. Tracing
  7. Implementation issues and a development plan
  8. Things to watch out for
  9. Submitting your work


Purpose of This Assignment

This assignment has you use trees. It is the largest assignment that you are assigned, and gives you experience in implementing a multi-module program.

It is crucial to get started early. Use the methods that we have studied in this course. Do not cut corners. Do not revert to novice methods.

Read the entire assignment before you start on it.


Plagiarism

When faced with a large assignment, many students turn to plagiarism, especially if they have waited too long to start on the assignment.

Do not fall into that trap and submit a plagiarized assignment. Doing so will cost you points. Keep in mind that you must get an overall score of 50% or better on the programming assignments in order to pass this course, and that a plagiarised assignment gets a negative score.


Background

Data compression is concerned with economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Data compression is critical for transmitting videos over the internet and storing high-definition videos.

Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumpelstiltskin occurs often, then there will be a very short code for that word.

A very simple kind of data compression for text files works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'   01100001
'b'   01100010
'c'   01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'   0000000001100001
'b'   0000000001100010
'c'   0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'   10
'b'   0
'c'   11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.


Huffman's Algorithm

There is an algorithm that is due to Huffman for finding efficient prefix codes, called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each leaf contains a character. Each nonleaf has two children, the left child and the right child. We also think of the left child as the '0' child and the right child as the '1' child. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see along the path. For example, in the tree above, the code for 'b' is 01 and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency that was counted. For example, if the frequency counts are

a 500
b 230
c 300
d 250
e 700
then the algorithm creates five leaves, one (holding a) with frequency of 500, one (holding b) with a frequency of 230, etc.

(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)

Now the algorithm repeatedly performs the following steps, as long as there are still at least two trees left in the collection of trees.

  1. Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency FS and tree T has frequency FT.

  2. Build a tree that looks as follows.

                  .
                 / \
               0/   \1
               /     \
              S       T
    
    Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.

  3. Add the new tree, with its new frequency, to the collection of trees.

The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.

A notation for trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. A leaf is shown as the character at that leaf. A nonleaf is shown as the two subtrees, surrounded by parenthese and separated by a comma. For example, tree

                   .
                  / \
                0/   \1
                /     \
               e       c
is written (e,c). Tree
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
is written (b,(a,c)). Tree
               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
is written ((c,b),(a,d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a 500
b 230
c 300
d 250
e 700

The initial collection of trees (all leaves) with their frequencies is as follows.

b 230
d 250
c 300
a 500
e 700

Removing the two with the smallest frequencies and combining them yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c 300
(b,d) 480
a 500
e 700

The next step combines c with (b,d), yielding tree (c,(b,d)), with a frequency of 300 + 480 = 780. The collection of trees and frequencies is now

a 500
e 700
(c,(b,d)) 780

Now combine a and e, yielding

(c,(b,d)) 780
(a,e) 1200

The last step combines the last two trees, yielding

((c,(b,d)),(a,e)) 1980

The final tree is ((c,(b,d)),(a,e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d
The code for d is 011, the code for a is 10, etc.


Assignment

For this assignment you will write two programs.

The first program, called huffman, gets two file names, which I will refer to as A and B, from the command line. It should do the following.

  1. Count how many occurrences of each character are in file A.

  2. If requested, show the character frequencies on the standard output, showing only characters that occur at least once.

  3. Construct a Huffman tree and Huffman code for the characters that occur in file A (ignoring characters that do not occur at all).

  4. If requested, show the tree and the code on the standard output.

  5. Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.

For example, command

  ./huffman data.txt data.cmp
reads file data.txt and writes a compressed version into a file called data.cmp.

The second program, called unhuffman, will also be given two file names, A and B, on the command line. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. For example, command

  ./unhuffman data.cmp newdata.txt
reads data.cmp and writes newdata.txt. The decoded file should look identical to the file that you encoded. It will need to read the huffman tree and, if requested, show the tree.


Tracing

You are required to provide for tracing. If the command line contains -t, then tracing is turned on. Otherwise, tracing is turned off.

In Linux, it is conventional for options such as -t to occur before file names on commands. So assume that, if present, -t will come immediately after the command.

Your huffman/unhuffman programs are required to contain tracing as follows, if tracing is turned on.

  1. The huffman program should write information about character frequencies, show the tree that it produces, and show the code that it gets. For example, the output of huffman with tracing turned on might look as follows.

    The character frequencies are:
    
    a 500
    b 230
    c 300
    d 250
    e 700
    
    The Huffman tree is as follows: ((a,e),(c,(b,d)))
    
    A Huffman code is as follows.
    
    a 10
    b 010
    c 00
    d 011
    e 11
    

  2. The unhuffman program should show the tree that it reads from the binary file, in a clear and readable format.


Implementation Issues and a Development Plan

Development plan

1. Create files trace.cpp and trace.h. You can use a global variable to indicate the trace level, which can be 0, or 1. Create the variable in trace.cpp. If your variable is called tracelevel, then trace.cpp will say

  int tracelevel = 0;
and trace.h will say
  extern int tracelevel;
Any file that includes trace.h will be able to use the tracelevel variable that is created in trace.cpp.

2. Create a Makefile for this assignment. Look at Makefiles for prior assignments as guides.

3. Create file huffman.cpp, the source code for the encoding program huffman. Add a main function to it that just sets the trace-level. Use the following heading for main (in huffman.cpp).

  int main(int argc, char** argv)
Parameter argv is an array of null-terminated strings having argc members. The names of the files to work on are argv[argc−2] and argv[argc−1].

Try it. Does it set tracelevel correctly?

4. Write a function in huffman.cpp to get the character frequencies. Assume the characters in the file are 8-bits (one byte), so there are 256 possible characters. See reading files using the stdio library.

Create an array of 256 integers in main to store the frequency counts, one for each possible character that can be stored in one byte.

Write a function that computes the character counts. It should start by setting all of the character frequencies to 0. Then it should open the appropriate file, read the characters, and increment the count for a character each time it is seen. When it is done, it should close the file.

Make the frequency counting function return true if it was successful in opening the file and false if not.

Now make main call the frequency counter. If it returns false, then there should be a meaningful message to the user, and the program should stop. The program must not keep going when the file to read does not exist.

Important. You will be using characters as array indices. Do not use type char for an array index. Characters are assumed to be from −128 to 127. Read each character as an integer, where it will be a number from 0 to 255. If you need a character variable, use type unsigned char.

If you receive a warning that an array index has type char, heed the warning!

5. Add a function to trace.cpp that takes appropriate parameters and, if tracing is turned on, shows the frequencies of all characters that occur at least once in a clear and readable format.

You will want another function, also in trace.cpp, that shows a character's name. The name of 'a' is just a. But the name of '\n' (the newline character) should be \n, or some other desriptive name, such as newline. Use descriptive names for the tab and space characters as well.

For an unprintable character whose code is n, use \n, where you show the character code. (Include <cctype>. Then isprint(c) is true if character c is printable. You are best off passing isprint an integer code, not a character.)

Now make main trace the frequency counts if tracing is turned on. Test your program. Does it seem to work?

6. Create a file, tree.h, holding the following type definition.

#ifndef TREE_H
#define TREE_H

  enum NodeKind {leaf, nonleaf};

  struct Node {
    NodeKind kind;
    char     ch;
    Node*    left;
    Node*    right;

    Node(char c)
    {
      kind = leaf;
      ch   = c;
    }

    Node(Node* L, Node *R)
    {
      kind  = nonleaf;
      left  = L;
      right = R;
    }      
  };

#endif

The idea is that a leaf has kind leaf and a nonleaf has kind nonleaf. The ch field should only be used in a leaf and the left and right fields should only be used for a nonleaf. You will lose points if you use the ch field of a nonleaf or the left or right fields of a leaf.

You should include tree.h in any file that needs to use the tree type.

7. Add a function to trace.cpp that prints a tree to the standard output, using the notation described above. To show a leaf in the tree, show the character's name.

Modify main so that it builds some test trees and prints them. Does the tree printer appear to work? After it works, remove the tests of the tree printer from main.

8. Write a function in huffman.cpp that takes the array of character frequencies and returns a huffman tree based on the character frequencies. The huffman tree should have type Node*.

For Assignment 5 you implemented the priority queue abstract data type. Copy pqueue.h and pqueue.cpp into the directory for assignment 8 and modify it by defining

  struct Node;
  typedef Node* PQItemType;
  typedef int   PQPriorityType;

To build the huffman tree, create a priority queue. Insert a leaf into the priority queue for each character whose frequency is not 0, with its frequency as its priority. Notice that

   Node* t = new Node('a');
creates a leaf holding character 'a'.

Now repeatedly remove two trees from the priority queue. Combine them into a single tree.

   Node* t = new Node(r,s);
creates a new tree holding subtrees r and s.

After taking the first tree out of the priority queue, check whether the priority queue is empty. If so, then the tree that you just got is the answer. So return it. Ask yourself whether you need to destroy the priority queue.

Important. Ensure that your function only uses features of the priority queue module that are part of its interface. Do not use features that should remain hidden inside the priority queue module! Do not add any new features. This is very important. When I compile your program, I will link in my own priority queue module that implements the interface described in assignment 5. If you use anything that is not part of that interface, your program will not compile.

9. Modify main in huffman.cpp so that it builds the huffman tree and, if tracing is turned on, shows the resulting tree.

Test your program. Does it seem to create a sensible Huffman tree?

10. Now, in huffman.cpp, write a function buildCode that takes the Huffman tree and returns an array of 256 strings that holds the character codes of all characters that occur at least once. (Only those characters are in the tree.)

Your function will actually need to have the following parameters.

  • A tree t (a subtree of the full huffman tree)

  • The array Code to fill with the code.

  • A null-terminated string pref of 0's and 1's that is the sequence of edge labels from the root of the huffman tree to the root of t.

If t is a leaf holding character c, then add pref into Code as the code for character c.

If t is not a leaf, then buildCode should call itself on the left and right subtrees of t. The pref parameter for the call on the left subtree should be pref with '0' added to the end. The pref parameter for the call on the right subtree should be pref with '1' added to the end. (Create local arrays in buildCode and and copy pref into them using strcpy. Then concatenate "0" to the end of one and "1" to the end of the other using strcat.)

If you think about this, you can see that buildCode stores the codes of characters in t with pref added as a prefix to each.

11. Add a function to trace.cpp that shows the code from the code array. When showing a character, use your function that shows the character's name.

Make main build the code and, if tracing is turned on, show the code.

Test what you have so far.

12. The goal is to pack 8 bits into each byte. But before doing that, take a simpler approach where you write each bit as a character so that you can read it. After everything seems to work you can change it so that it writes a binary file instead.

Get files binary.h, binary1.cpp and binary2.cpp. They provide the following tools for writing files.

Type BFILE

BFILE* openBinaryFileWrite(const char* filename)

This function opens binary file filename for writing and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened.

void writeBit(BFILE* f, int b)

This function writes bit b (either 0 or 1) into open file f. Make sure that b is either 0 or 1. It should not be '0' or '1'.

void writeByte(BFILE* f, int b)

This function writes byte or character b. into open file f.

void closeBinaryFileWrite(BFILE* f)

This function closes file f. You must close the file when you are done writing.

Both files binary1.cpp and binary2.cpp have the same interface, described by binary.h. File binary1.cpp writes a text file, writing character '0' to represent bit 0 and character '1' to represent bit 1, allowing you to read the output with a text editor, but not actually compressing. File binary2.cpp writes a binary file.

13. Add a function to huffman.cpp that writes a description of a given tree into a given open binary file (type BFILE*).

  1. To write a leaf, write a 1 bit followed by the character stored at the leaf (8 bits). So, if t is a leaf, you write t as follows.

      writeBit(f, 1);
      writeByte(f, t->ch);
    

  2. To write a nonleaf, write a 0 bit followed by the left subtree followed by the right subtree. So, if t is not a leaf, write t as follows, assuming your function is called writeTreeBinary.

      writeBit(f, 0);
      writeTreeBinary(f, t->left);
      writeTreeBinary(f, t->right);
    

Modify main so that it opens the second file name in the command line as a binary file for writing. Then it writes the Huffman tree to the binary file and closes the file.

14. Create file unhuffman.cpp, which will be the program that uncompresses a file.

Write a function in unhuffman.cpp that reads the description of a tree from a given open BFILE* file, and returns the tree. Keep it simple. To read a tree, start by reading a bit. If the bit is a 1, you are looking at a leaf. Read the character (a byte) and build a leaf holding that character. But if the bit is a 0, then you are looking at a nonleaf. Read the left subtree then the right subtree.

Be careful about parameters. When you encounter a nonleaf, it is important that you read the left subtree then the right subtree. When a function call f(A, B) is evaluated, expressions A and B can be evaluated in any order. If order of evaluation is important (which it is for reading a tree), do the evaluations in separate statements before doing the function call.

Write a main function in unhuffman.cpp. It should start by turning tracing on or off. Then it opens the binary file whose name is given in the command line, reads a tree from it, and, if tracing is turned on, writes the tree to the standard output for debugging.

Use the following for reading the binary file, also described by binary.h.

BFILE* openBinaryFileRead(const char* filename)

This function opens binary file filename for reading and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened.

int readBit(BFILE* f)

This function reads one bit from file f and returns the bit (either 0 or 1). At the end of the file, readBit returns EOF.

int readByte(BFILE* f)

This function reads one byte (8 bits) from file f and returns the byte. Use it to get a byte that you wrote using writeByte. At the end of the file, readByte returns EOF.

void closeBinaryFileRead(BFILE* f)

This function closes file f. Once it is closed, you cannot read from it again.

When you link unhuffman.cpp, it is critical that you use the same implementation of binary.h as for huffman.cpp. Either link both with binary1.cpp or both with binary2.cpp.

15. Test what you have so far. Your huffman program should write a description of the tree into the binary file. Read it back with unhuffman. Is the tree correct?

16. Write a function in huffman.cpp that takes the array of character codes, an open binary file and the name of the file to read. It should reread the input file and write the code for each character that it reads into the binary file.

Modify main so that, after writing the tree to the binary file, it calls that new function to write then encoded characters.

17. Return to unhuffman.cpp. Write a function that reads the code for a single character and returns the character, as an integer. If it encounters and end-of-file, it should return EOF. This function will need take the Huffman tree and an open binary file for reading.

As you read each bit, move down the tree in the appropriate direction. When you hit a leaf, that is the answer.

Important note. The code for a character only has a bit in it for each nonleaf. There is no bit for a leaf. So when reading the code, only read a bit when you are at a nonleaf in the tree, to decide whether to move to the left or right. Do not read a bit when you are at a leaf.

18. Now write a function to do the uncompression. It should keep reading character codes and writing the corresponding character into the decoded file until the end-of-file is reached.

19. Test both programs. If you compress a file and then uncompress it, do you get back exactly what was in the original file? Try a file that contains tabs, spaces and has more than one line. Try some non-letter characters too.

20. Once everything is working, submit your work.


Things to Watch Out for

  1. Huffman and unhuffman each get the names to open from the command line. Be sure not to read fixed files. If you do, your program will not work when I test it.

  2. Only do tracing when it is turned on by option -t on the command line. Remember that -t occurs before the file names.

  3. Start each file from the standard template.

  4. Follow the coding standards. Do not become lax on those.


Submitting Your Work

To turn in your work, log into the Linux system, change your directory for the one for assignment 8, use the following command.

  ~abrahamsonk/3300/bin/submit 8 huffman.cpp unhuffman.cpp tree.h trace.cpp trace.h pqueue.h pqueue.cpp
After submitting, you should receive confirmation that the submission was successful. If you do not receive confirmation, assume that the submission did not work.

Command

  ~abrahamsonk/3300/bin/submit 8
will show you what you have submitted for assignment 8.

You can do repeated submissions. New submissions will replace old ones.

Late Submissions

Late submissions will be accepted for one day after the due date. If you miss a late submission deadline by a microsecond, your work will not be accepted.