Computer Science 3300
Fall 2013
Section 001
Programming Assignment 4

Assigned: Tuesday, October 22
First version due: Friday, Nov. 8, 11:59pm
Second version due: Friday, Nov. 22, 11:59pm


Read the entire assignment before you start working on it.

Table of contents

  1. Background
  2. Requirements
  3. Algorithmic issues
  4. Implementation issues
  5. Writing binary files
  6. Design issues
  7. Planning
  8. Submitting your work


Background

Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumplestiltskin occurs often, then there will be a very short code for that word. But a simpler kind of data compression works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'   01100001
'b'   01100010
'c'   01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'   0000000001100001
'b'   0000000001100010
'c'   0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'   10
'b'   0
'c'   11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.


Requirements

For this assignment you will write two programs. The first program will be given two file names, which I will refer to as A and B. It should do the following.

  1. Count how many occurrences of each character are in file A.

  2. Show the character frequencies on the standard output, showing only characters that occur at least once.

  3. Construct a Huffman code for the characters that occur in file A (ignoring characters that do not occur at all).

  4. Write the code on the standard output.

  5. Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.

The second program will also be given two file names, A and B. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. The decoded file should look identical to the file that you encoded.

The source file for the encoder will be called huffman.cpp and the source file for the decoder will be called unhuffman.cpp. I will assume that the executable files are called huffman and unhuffman.

Getting the file names

Your programs should get the file names called A and B above from the command line. Use the following heading for main.

   int main(int argc, char* argv[])
Then argv[1] will be A and argv[2] will be B. For example, if your executable program is called huffman, then command
  huffman info.txt encinfo.txt
runs huffman, encoding file "info.txt" into file "encinfo.txt". Similarly, command
  unhuffman encinfo.txt encinfo.decoded
decodes encinfo.txt and writes the decoded file into encinfo.decoded.

The standard output

If the file being encoded contains only characters a, b, c, d and e, the huffman program might show the following information on the standard output, in addition to encoding and decoding files.

The character frequencies are:

a 500
b 230
c 300
d 250
e 700

A Huffman code is as follows.

a 10
b 010
c 00
d 011
e 11


Algorithmic Issues

There is an algorithm that is due to Huffman for finding efficient prefix codes; they are called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each nonleaf has two children, one labeled the '0' child, the other the '1' child. Each leaf contains a character. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
A tree defines a code. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see. For example, in the tree above, the code for 'b' is 01 and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency that was counted. For example, if the frequency input is

a 500
b 230
c 300
d 250
e 700
then the algorithm creates five leaves, one holding a, with frequency of 500, one holding b with a frequency of 230, etc.

(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)

Now the algorithm repeatedly performs the following, as long as there are still at least two trees left in the collection of trees.

  1. Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency FS and tree T has frequency FT.

  2. Build a tree that looks as follows.

                  .
                 / \
               0/   \1
               /     \
              S       T
    
    Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.

  3. Add the new tree, with its new frequency, to the collection of trees.

The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.

A notation for trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. The notation groups the two subrees of a node in parentheses. So a pair of parentheses stands for a nonleaf node in the tree. For example, tree

                   .
                  / \
                0/   \1
                /     \
               e       c
is written (e c). Tree
                    .
                   / \
                 0/   \1
                 /     \
                b       .
                       / \
                     0/   \1
                     /     \
                    a       c
is written (b (a c)). Tree
               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d                         
is written ((c b) (a d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a 500
b 230
c 300
d 250
e 700

The initial collection of trees (all leaves) with their frequencies is as follows.

b 230
d 250
c 300
a 500
e 700

Removing the two with the smallest frequencies and combining them yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c 300
(b d) 480
a 500
e 700

The next step combines c with (b d), yielding tree (c (b d)), with a frequency of 300 + 480 = 780. The collection of trees and frequencies is now

a 500
e 700
(c (b d)) 780

Now combine a and e, yielding

(c (b d)) 780
(a e) 1200

The last step combines those two trees, yielding

((c (b d)) (a e)) 1980

The final tree is ((c (b d)) (a e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d
For example, the code for d is 011.


Implementation Issues

Do not use global variables anywhere in this assignment, with the possible exception that you can use a global variable to hold a flag for turning debugging on or off, and you can also have a global variable that holds a file to which you do debug prints, when debugging is turned on.

Getting the character frequencies

Write a function to get the character frequencies. Assume the characters in the file are 8-bits (one byte), so there are 256 possible characters.

Create an array of 256 integers to store the frequency counts, one for each possible character that can be stored in one byte. Initially, all of the counts are 0. Now read each character of the file, adding 1 to its count.

You can open a file for reading using the <fstream> library. Write

  ifstream in(fname);
to open file fname for reading. Now use in just the way you would use cin. When you are done,
  in.close();
will close the file.

Do not skip over blanks. If variable c has type int, then

  c = in.get();
stores the integer code of the next character into variable c, regardless of what the next character is. If there are no more characters in the file, then it sets c to EOF (which is defined to be -1). So you would typically say
  c = in.get();
  if(c != EOF) ...
Be sure to make variable c have type int, not char.

Trees

Implement trees using pointers. Each node in a tree is a structure that tells whether the tree is a leaf (containing a character) or a nonleaf (containing two subtrees). Use the following structure for the nodes of the tree.

  enum NodeKind {leaf, nonleaf};

  struct Node {
    NodeKind kind;
    char     ch;
    Node*    left;
    Node*    right;

    Node(char c)
    {
      kind = leaf;
      ch   = c;
    }
    Node(Node* L, Node *R)
    {
      kind  = nonleaf;
      left  = L;
      right = R;
    }      
  };

When you set up your initial trees (each holding just one letter) use the first constructor. When you build a composite tree, use the second constructor. For example, statement

   Node* t = new Node('a');
builds a leaf that contains only the character 'a'. Statement
   Node* t = new Node(r,s);
builds a new tree whose left subtree is r and whose right subtree is s. You can tell what kind of tree you are looking at by checking its kind. For example, if t has type Node*, you might ask
   if(t->kind == nonleaf) 
   {
     ...
   }

Priority Queues

Assignment 3 describes priority queues, which are also useful for this assignment. But this time, the priorities are integers, and the items in the priority queue are trees. Make a copy of the priority queue module, and change the definitions of PQItemType and PQPriorityType appropriately.

After you have counted character frequencies, insert each character (as a leaf node of a tree) that has a nonzero count, with its frequency as its priority, into a priority queue. Now, to find the two trees with the smallest frequencies, just remove two things from the priority queue. Remember that removing something from a priority queue automatically gets the one with the smallest priority.

Be careful. The algorithm is finished when there is only one thing in the priority queue. The priority queue abstract data type does not have an operation that asks whether there is just one thing in the priority queue. Do not add such an operation. You should use the priority queue module unchanged from the preceding assignment except for the definitions of PQItemType and PQPriorityType. Do not let the Huffman code module look directly at the representation of priority queues to decide if there is only one member. Your Huffman code module must continue to work after the priority queue module has been replaced by one that uses a different representation.

Instead, just remove the first tree. Now check whether the priority queue is empty. If so, then there was only one tree. You are done. Otherwise, remove a second tree and continue (combining the two trees into one and inserting the combined tree back into the priority queue).

Getting the code from the tree

Now build an array of 256 strings to hold the character codes. Write a function that traverses the tree and stores the code into the array.

Pass three parameters to that function: a tree, the array where the codes are being stored and a string that precedes the codes found in the tree. The string parameter is the path that you took to get to the root of this subtree. When you are working at a node that you reached by following edges labeled 0 and 1 (in that order) your string parameter will be "01".

For this assignment, you can use the C++ string type. Include <string> to use it. Statement

  string s = "";
makes s be an empty string. Operator + concatenates strings. For example,
  string s = "00";
  string r = s + "1";
makes r hold string "001". You can convert s from type string to type const char* (a null-terminated string that you cannot change) using s.c_str().

But be careful. The memory occupied by s.c_str() only lives as long as s does. So if string s is a local variable, stored in the run-time stack, then s.c_str() is also a local variable, and disappears as soon as s does.

To store a code in the array, you will want to ensure that it is stored in the heap, so that you do not lose the memory. Here is a simple function that returns a copy of a null-terminated string, with the copy allocated in the heap.

  char* dupstr(const char* s)
  {
    char* space = new char[strlen(s) + 1];
    strcpy(space, s);
    return space;
  }

Echoing the code

Write a function to show the code, from the array. (Don't show the code while building the array. If you do, then any mistake in building the array will not show up when you print the code. Build the array first, then show it.)

Note that some of the characters can be blanks and newlines. For newline, write \n. So your output might be

  a 100
    111
  \n 00
  ...
where the second line shows the code for a space.

Writing the encoded file

The goal is to pack 8 bits into each byte. But before doing that, take a simpler approach, where you write each bit into a byte. That way, the file will be a text file that you can look at for testing your program. After everything seems to work, you change the interface so that it writes out a binary file instead.

Get files binary1.h and binary1.cpp. That module provides the following for writing files.

Type BFILE

BFILE* openBinaryFileWrite(const char* filename)

This function opens binary file filename for writing and returns a structure that describes the open file. It returns NULL if the file cannot be opened.

void writeBit(BFILE* f, int b)

This function writes bit b (either 0 or 1) into open file f.

void writeByte(BFILE* f, int b)

This function writes byte or character b. into open file f.

void closeBinaryFileWrite(BFILE* f)

This function closes file f. You must close the file when you are done writing. Once it is closed, you cannot write to it again.

Using those functions, write two functions. First write a function that writes the huffman tree into the binary file. (You will need to get the tree back in order to decode the file.)

  1. To write a leaf, write a 1 bit followed by the character stored at the leaf. So, if t is a leaf, you write t as follows.

      writeBit(f, 1);
      writeByte(f, t->ch);
    

  2. To write a nonleaf, write a 0 bit followed by the left subtree followed by the right subtree. So, if t is not a leaf, write t as follows, assuming your function is called writeTreeBinary.

      writeBit(f, 0);
      writeTreeBinary(f, t->left);
      writeTreeBinary(f, t->right);
    

Now write a function that reads a file A and writes out an encoded file B. It should write the Huffman tree at the beginning of the file. Then it just reads each character, finds its code in the code array and writes out the code, one bit at a time.

Decoding

The next step is to write the decoder. Use the following for reading the binary file, also provided by binary1.cpp.

BFILE* openBinaryFileRead(const char* filename)

This function opens binary file filename for reading and returns a structure that describes the open file. It returns NULL if the file cannot be opened.

int readBit(BFILE* f)

This function reads one bit from file f and returns the bit (either 0 or 1). At the end of the file, readBit returns EOF.

int readByte(BFILE* f)

This function reads one byte (8 bits) from file f and returns the byte. Use it to get a byte that you wrote using writeByte. At the end of the file, readByte returns EOF.

void closeBinaryFileRead(BFILE* f)

This function closes file f. Once it is closed, you cannot write to it again.

Write a function that reads a tree. You will need to use it at the beginning of decoding to get the tree back. To read a tree, start by reading a bit. If the bit is a 1, you are looking at a leaf. Read the character (a byte) and build a leaf holding that character. But if the bit is a 0, then you are looking at a nonleaf. Just call your readTree function twice, once to build the left subtree and once to build the right subtree. Then build the whole tree by building a root with those two subtrees. You should have recovered the original tree.

Now write a function to do the decoding. Use the tree that describes the code. As you read each bit, move down the tree in the appropriate direction. When you hit a leaf, write out the character at that leaf then start again at the root of the tree to get the next character. Keep going until you hit the end of the file.

Use openBinaryFileRead, etc. for reading the encoded file. But just use the fstream library to write the decoded file. If you have included <fstream>, then

  ofstream out(filename);
opens a file whose name is filename. (Really, filename is an array of characters that holds the name, as a null-terminated string.) The new object out works like cout, except that everything you write goes into that file. For example,
  out << c;
will write c into the file.


Writing Binary Files

After you have tested the functions, replace files binary1.h and binary1.cpp by files binary2.h and binary2.cpp. Change your include line to include binary2.h. Those functions will pack bits into bytes. Make sure that your program works! You should see some compression.


Design Issues

Write this program in five modules. (Don't be intimidated by that. You have already written one of those modules (priority queues), I provide another (reading and writing binary files), and the third, tree.h, is mostly a matter of cutting and pasting from this assignment. So you really only have two modules to write.)

  1. Module 1 is the priority queue module. Call the files pqueue.h and pqueue.cpp. Use the same implementation as for Assignment 3, but change the item and priority types. The item is now a pointer to a Node, and the priority is an integer.

    File pqueue.h will need to use type Node*. The simplest way to do that is to write

      struct Node;
    
    in pqueue.h. That will allow you to create a variable of type Node*, but not one of type Node.

  2. Module 2 consists initially of files binary1.h and binary1.cpp. Later, you will replace it by binary2.h and binary2.cpp.

  3. Module 3 consists of file tree.h that describes the tree type. (I already that for you. Just copy and paste.) Module 3 might also include tree.cpp, holding shared functions for working with trees, if you decide to put those functions into a separate file. You can also choose to put all tree functions into modules 4 and 5 instead.

  4. Module 4 implements trees and the Huffman code algorithm, and is responsible for encoding files. Call it huffman.cpp. Of course, it will need to include tree.h. Provide functions that make the Huffman code algorithm easy to write. Make the functions short and simple. Include a main function that performs the encoding.

  5. Module 5 contains the decoding program. Call it unhuffman.cpp. It will also need to include tree.h.

If you have any functions on trees that are needed by both huffman.cpp and unhuffman.cpp, put them into a file tree.cpp, so that they only need to be written once, and put prototypes for them into tree.h.


Planning

Here is a suggested plan for writing this program. It will look like a lot of steps. On the good side, each step is relatively short, if you approach it the right way. On the other side, if you wait until the last minute to start, you will have no chance of finishing. Always leave yourself time for snags.

  1. Create tree.h, describing trees.

  2. Get pqueue.h and pqueue.cpp. Modify the item and priority types.

  3. Be sure that you have files binary1.h, binary1.cpp, binary2.h and binary2.cpp.

  4. Build a Makefile for compiling your programs. Use binary1.cpp, not binary2.cpp, in your link. Use prior Makefiles that I have provided as examples.

  5. Begin writing huffman.cpp. Write a function that counts the character frequencies and a function that shows the character frequencies. Test them.

  6. Write a function that builds a Huffman tree, using the character frequencies. Test it. You will find it helpful to have a function that shows a tree for debugging. Just show the tree using the notation that was used above.

  7. Write a function that builds the array holding the Huffman code from the tree. Test it.

  8. Write a function that writes a tree in binary. Test it using binary1.cpp.

  9. Write a function that reads the input file and encodes it, still using binary1.cpp. Test it.

  10. You should now have a working huffman.cpp program, including a main program, that writes one bit per byte.

  11. Now turn to unhuffman.cpp. Write a function that reads a tree in binary form, so that you can get the tree back from the file. Continue to use binary1.cpp. Test it.

  12. Write a function to decode the file. It should read the tree and then read the encoded file bit by bit, walking down the tree until it hits a leaf. Test this, using binary1.cpp.

  13. Now you should have a working unhuffman.cpp program that works on files that use an entire byte for each bit. You are almost done.

  14. Modify huffman.cpp and unhuffman.cpp to include binary2.h. Modify your makefile to use binary2.cpp. Now you have programs that encode in binary. Test them.


Notes on Grading

Here are some things to do and mistakes to avoid, with a range of points that you might lose (out of 100) for each one.

  1. Write your name, the course and the assignment number in a comment at the top of the program, along with the tab stop distance. (up to 2 points)

  2. Failure to compile (100 points). Your program is expected to compile. A program that has fatal compile errors will receive a grade of 0.

  3. Compiler warnings (up to 5 points). Your program should not have any compiler warnings.

  4. Failure to follow the design (up to 100 points). The design requires you two write certain functions. It also says what those functions must do. It is not acceptable to substitute a different design.

    The design does not call for global variables, with the possible exception of a file for debugging. Do not use global variables. Using even one global variable means that you are not following the design, and can lead to a very high loss of points.

  5. Partial program. If only part of the program is written, you will receive some credit, depending on how much is written, and how well that part is written. Do not count on a great deal of credit for a partial program that cannot be run, or that only reads and echoes the input.

  6. Incorrect results (up to 30 points). The program should produce correct results.

  7. Memory faults (up to 30 points). If your program encounters a memory fault then it does not do what it is supposed to do. Ensure that you do not use an uninitialized or dangling pointer. Do not use an index that is outside the bounds of an array.

  8. Incorrect modules (up to 15 points). Break the program into modules as described in the design.

  9. Violating the abstraction of the PriorityQueue abstract data type (up to 20 points) If huffman.cpp or unhuffman.cpp contain anything that is supposed to be private to the pqueue.cpp module, then you have violated the priority queue abstraction. You will lose substantial points for that.

  10. Incorrect use of PQItemType and PQPriorityType (up to 5 points). Your priority queue module (pqueue.h and pqueue.cpp) should be written so that changing the types of items involves changing only the definition of type PQItemType, and changing the priority type requires only changing the definition of PQPriorityType.

    Types PQItemType and PQPriorityType have just one purpose: to make it easy to modify the priority queue module. So they should not occur at all in any other module. If your graph.cpp module mentions PQItemType or PQPriorityType, it is wrong. In graph,cpp, you know that the priorities are real numbers and the items are integers.

  11. Private functions advertised in header file (up to 3 points). File pqueue.h should say as little as possible about things that should only be used in pqueue.cpp. It should contain a prototype for insert, since that is something that other modules need to use. But it should not contain a prototype for functions, such as insertCell and removeItem, that are only intended to be used in pqueue.cpp to help implement the other functions.

  12. Violating the abstraction of the binary read/write module (up to 10 points). Use the functions provided. Do not peek at the implementation and try to use private features in it.

  13. Poor or missing contracts (up to 15 points). Every function must have a clear and precise contract. A contract tells what a function does, not how it works. A contract tells precisely how every one of its parameters affects what the function does, and refers to parameters by name.

  14. Functions do not do what they are supposed to do, or return incorrect results according to their contracts (up to 7 points per function).

  15. Crippled functions (up to 8 points). A crippled function is one that does not really do its job, but requires its caller to do part of the job.

  16. Memory leaks (up to 5 points). When you remove something from a priority queue, be sure to delete memory that becomes inacessible. Ensure that you do not have a memory leak when you read in a graph.

    Do not worry about deleting everything just before your program finishes. That is done automatically anyway.

  17. Poor indentation (up to 10 points). You are expected to indent your function definitions correctly. Minor problems will not lose much, but seriously bad indentation can lose 10 points.

  18. Failure to explain tabs (2 points). Be sure that you either do not use tabs in your program or that you tell where your tab stops are at the top of the program. Different editors are set up with different tab stops. For example, write

      // tab stops: every 4 characters
    
    If you do not know where the tab stops are, do not use tabs.

  19. Excessively complicated algorithms (up to 5 points per algorithm). Use recursion where it is appropriate. Keep algorithms simple.

  20. Excessive inefficiency (up to 5 points). For decoding, use the tree, not the code array. It is simpler and more efficient.

  21. Long lines, margin comments (up to 5 points). Avoid long lines. As a rule of thumb, keep lines at most 80 characters long. Avoid margin comments (comments written in the right margin telling what the code to the left of them does). Those comments rarely say anything useful, and they make lines long and clutter the program. If you need to write a comment inside a function, write a paragraph comment telling what the next paragraph of code does.

  22. Packing functions together (up to 1 point). Put a blank line between function definitions. There is no need to pack them together. That makes the program more readable.

    But also do not put huge blank sections in your program consisting of several blank lines in a row. A long space is two blank lines. Longer than that is excessive.

  23. system("pause") (up to 5 points). Some programmers write system("pause") at the end of main to compensate for a flaw in the design of a Windows development environment. Since that is a nonstandard thing, I must edit your program to get rid of it before I can test your program. I should not have to fix your program for any reason.


Submitting Your Work

To turn in this program, log into one of the Unix computers in the lab. (You can log in remotely.) Ensure that your files are there. Change to the directory that contains those files. Then issue one of the following commands.

First version

~abrahamsonk/3300/bin/submit 4a huffman.cpp unhuffman.cpp tree.h pqueue.cpp pqueue.h

Second version

~abrahamsonk/3300/bin/submit 4b huffman.cpp unhuffman.cpp tree.h pqueue.cpp pqueue.h