Computer Science 3300
Fall 2015
Section 001
Programming Assignment 8

Assigned:	Monday, November 23
Due:	Monday, December 7, 11:59pm

Background
Assignment
Algorithmic issues
Warnings
Tracing
Implementation issues and a development plan
Submitting your work

Background

Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Data compression is critical for transmitting videos over the internet. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumpelstiltskin occurs often, then there will be a very short code for that word.

A very simple kind of data compression for text files works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'		01100001
'b'		01100010
'c'		01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'		0000000001100001
'b'		0000000001100010
'c'		0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'		10
'b'		0
'c'		11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.

Assignment

For this assignment you will write two programs.

The first program, called huffman, gets two file names, which I will refer to as A and B, from the command line. It should do the following.

Count how many occurrences of each character are in file A.
If requested, show the character frequencies on the standard output, showing only characters that occur at least once.
Construct a Huffman code for the characters that occur in file A (ignoring characters that do not occur at all).
If requested, write the code on the standard output.
Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.

For example, command

  ./huffman data.txt data.cmp

reads file data.txt and writes a compressed version into a file called data.cmp.

The second program, called unhuffman, will also be given two file names, A and B, on the command line. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. The decoded file should look identical to the file that you encoded. For example, command

  ./unhuffman data.cmp newdata.txt

reads data.cmp and writes newdata.txt.

Algorithmic Issues

There is an algorithm that is due to Huffman for finding efficient prefix codes, called Huffman codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each leaf contains a character. Each nonleaf has two children, the left child and the right child. We also think of the left child as the '0' child and the right child as the '1' child. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d

A tree defines a code as follows. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see along the path. For example, in the tree above, the code for 'b' is 01 and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.

where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency that was counted. For example, if the frequency counts are

then the algorithm creates five leaves, one holding a, with frequency of 500, one holding b with a frequency of 230, etc.

(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)

Now the algorithm repeatedly performs the following steps, as long as there are still at least two trees left in the collection of trees.

Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency F_S and tree T has frequency F_T.

Build a tree that looks as follows.

Say that this tree has frequency F_S + F_T, the sum of the frequencies of S and T.

Add the new tree, with its new frequency, to the collection of trees.

The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.

A notation for trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. A leaf is shown as the character at that leaf. A nonleaf is shown as the two subtrees, surrounded by parenthese and separated by a comma. For example, tree

is written (e,c). Tree

is written (b,(a,c)). Tree

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d

is written ((c,b),(a,d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a	500
b	230
c	300
d	250
e	700

The initial collection of trees (all leaves) with their frequencies is as follows.

b	230
d	250
c	300
a	500
e	700

Removing the two with the smallest frequencies and combining them yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c	300
(b,d)	480
a	500
e	700

The next step combines c with (b,d), yielding tree (c,(b,d)), with a frequency of 300 + 480 = 780. The collection of trees and frequencies is now

a	500
e	700
(c,(b,d))	780

Now combine a and e, yielding

(c,(b,d))	780
(a,e)	1200

The last step combines the last two trees, yielding

((c,(b,d)),(a,e))

1980

The final tree is ((c,(b,d)),(a,e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d

The code for d is 011, the code for a is 10, etc.

Warnings

Start each file from the standard template.

Tracing

Create a file trace.cpp that will contain functions for tracing. Also create file trace.h to contain prototypes for functions exported by trace.cpp.

You can use a global variable to indicate the trace level, which can be 0, 1 or 2. Create the variable in trace.cpp. If your variable is called tracelevel, then trace.cpp will say

int tracelevel = 0;

Put

extern int tracelevel;

in trace.h. Any file that includes trace.h will be able to use the tracelevel variable that is created in trace.cpp.

Add a function to trace.cpp that takes a null-terminated string opt. If opt is "-t1", your function should set the trace level to 1. If opt is "-t2", it should set the trace level to 2. Otherwise, it should do nothing (leaving the trace level at its initial value, 0, indicating no tracing).

Use the following heading for main (in huffman.cpp).

  int main(int argc, char** argv)

Parameter argv is an array of null-terminated strings having argc members. The main function should call your trace-setting function with each command-line argument after the one at index 0 in argv. So just look at argv at indices from 1 to argc−1.

Your huffman/unhuffman programs are required to contain tracing as follows.

For trace level 1, the huffman program should write information about character frequencies, show the tree that it produces, and show the code that it gets. For example, the output for a particular file at trace level 1 might be as follows.
```
The character frequencies are:

a 500
b 230
c 300
d 250
e 700

The Huffman tree is as follows: ((a,e),(c,(b,d)))

A Huffman code is as follows.

a 10
b 010
c 00
d 011
e 11
```
If the tracing level is at least 2, the program should do level 1 tracing and it should also show the entire collections of trees, with their frequencies, at the beginning of Huffman's algorithm and after each step.
If the trace level is at least 1 then the unhuffman program should show the tree that it reads from the binary file, in a clear and readable format.

Implementation issues and a development plan

Development plan
1. Create files trace.cpp and trace.h. Create the trace-level variable and a function to set the trace-level, as discussed above under tracing.

2. Create file huffman.cpp, the source code for the encoding program huffman. Add a main function to it that just sets the trace-level. Try it. Does it set the trace-level correctly?

3. Write a function in huffman.cpp to get the character frequencies. Assume the characters in the file are 8-bits (one byte), so there are 256 possible characters. See writing files using the stdio library and reading files using the stdio library. Create an array of 256 integers in main to store the frequency counts, one for each possible character that can be stored in one byte. Write a function that computes the character counts. It should start by setting all of the character frequencies to 0. Then it should open the appropriate file, read the characters, and increment the count for a character each time it is seen. When it is done, it should close the file. Make the frequency counter return true if it was successful in opening the file and false if not. Make main call the frequency counter. If it returns false, then there should be a meaningful message to the user, and the program should stop. Important. You will be using characters as array indices. Do not use type char for an array index. Characters are assumed to be from −128 to 127. Read each character as an integer, where it will be a number from 0 to 255. If you need a character variable, use type unsigned char.

4. Add a function to trace.cpp that takes appropriate parameters and, if the trace level is at least 1, shows the frequencies of all characters that occur at least once. You will want another function, also in trace.cpp, that shows a character's name. The name of 'a' is just a. But the name of '\n' (the newline character) should be \n, or some other desriptive name, such as newline. Use descriptive names for the tab and space characters as well. For an unprintable character whose code is n, use \n, where you show the character code. (Include <cctype>. Then isprint(c) is true if character c is printable. You are best off passing isprint an integer code, not a character.) Now test the frequency counter. Does it seem to work?

5. Create a file, tree.h, holding the following type definition. enum NodeKind {leaf, nonleaf}; struct Node { NodeKind kind; char ch; Node* left; Node* right; Node(char c) { kind = leaf; ch = c; } Node(Node* L, Node R) { kind = nonleaf; left = L; right = R; } }; The idea is that a leaf has kind leaf* and a nonleaf has kind nonleaf. The ch field should only be used in a leaf and the left and right fields should only be used for a nonleaf. You will lose points if you use the ch field of a nonleaf or the left or right fields of a leaf.

6. Add a function to trace.cpp that prints a tree to the standard output, using the notation described above. Modify main so that it builds some trees and prints them. Does the tree printer appear to work? After it works, remove the tests of the tree printer from main.

7. Write a function in huffman.cpp that takes the array of character frequencies and returns a huffman tree based on the character frequencies. The huffman tree should have type Node. For Assignment 5 you implemented the priority queue abstract data type. Copy that implementation into the directory for assignment 8 and modify it by defining struct Node; typedef Node PQItemType; typedef int PQPriorityType; To build the huffman tree, create a priority queue. Insert a leaf into the priority queue for each character whose frequency is not 0, with its frequency as its priority. Notice that Node* t = new Node('a'); creates a leaf holding character 'a'. Now repeatedly remove two trees from the priority queue. Combine them into a single tree. Node* t = new Node(r,s); creates a new tree holding subtrees r and s. After taking the first tree out of the priority queue, check whether the priority queue is empty. If so, then the tree that you just got is the answer. So return it. Ask yourself whether you need to destroy the priority queue. Ensure that your function only uses features of the priority queue module that are part of its interface. Do not use features that should remain hidden inside the priority queue module! For example, do not make use of the information that a priority queue is represented by a linked list. Do not make direct use of type PQCell or any value of type PQCell*. You will lose a lot of points if you use non-exported features of the priority queue module.

8. Modify main so that it builds the huffman tree and, if the trace level is at least 1, shows the resulting tree. Modify your function so that, if the trace level is at least 2, it shows the entire contents of the priority queue after inserting all of the leaves and after each step of combining two trees. Use printPriorityQueue. What parameters should you pass it? Test your program. Does it seem to create a sensible Huffman tree?

9. Now, in huffman.cpp, write a function that takes the Huffman tree and returns an array of 256 strings that holds the character codes of all characters that occur at least once. (Only those characters are in the tree.) Use the C++ string type to make your job easy. Your function will actually need to have two parameters: A tree t (a subtree of the full huffman tree) and a string s that is the path from the root of the full huffman tree to the root of subtree t. Now ask yourself how to handle the case where t is a leaf and how to handle the case where t is a nonleaf. At a nonleaf, what string parameter should you pass to recursive calls?

10. Add a function to trace.cpp that shows the code from the code array. When showing a character, use your function that shows the character's name. Make main build the code and, if the trace level is at least 1, show the code. Test what you have so far.

11. The goal is to pack 8 bits into each byte. But before doing that, take a simpler approach where you write each bit as a character so that you can read it. After everything seems to work you can change it so that it writes a binary file instead. Get files binary.h, binary1.cpp and binary2.cpp. They provide the following tools for writing files. Type BFILE BFILE* openBinaryFileWrite(const char* filename) This function opens binary file filename for writing and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened. void writeBit(BFILE* f, int b) This function writes bit b (either 0 or 1) into open file f. Make sure that b is either 0 or 1. It should not be '0' or '1'. void writeByte(BFILE* f, int b) This function writes byte or character b. into open file f. void closeBinaryFileWrite(BFILE* f) This function closes file f. You must close the file when you are done writing. Both files binary1.cpp and binary2.cpp have the same interface, described by binary.h. File binary1.cpp writes a text file, writing character '0' to represent bit 0 and character '1' to represent bit 1, allowing you to read the output with a text editor, but not actually compressing. File binary2.cpp writes a binary file.

12. Add a function to huffman.cpp that writes the huffman tree into a given open binary file (type BFILE*). To write a leaf, write a 1 bit followed by the character stored at the leaf (8 bits). So, if t is a leaf, you write t as follows. writeBit(f, 1); writeByte(f, t->ch); To write a nonleaf, write a 0 bit followed by the left subtree followed by the right subtree. So, if t is not a leaf, write t as follows, assuming your function is called writeTreeBinary. writeBit(f, 0); writeTreeBinary(f, t->left); writeTreeBinary(f, t->right); Modify main so that it opens the second file name in the command line as a binary file for writing. Then it writes the Huffman tree to the binary file and closes the file.

13. Create file unhuffman.cpp, which will be the program that uncompresses a file. Write a function in unhuffman.cpp that reads the description of a tree from a given open BFILE* file, and returns the tree. Keep it simple. To read a tree, start by reading a bit. If the bit is a 1, you are looking at a leaf. Read the character (a byte) and build a leaf holding that character. But if the bit is a 0, then you are looking at a nonleaf. Read the left subtree then the right subtree. Be careful about parameters. When you encounter a nonleaf, it is important that you read the left subtree then the right subtree. When a function call f(A, B) is evaluated, expressions A and B can be evaluated in any order. If order of evaluation is important (which it is for reading a tree), do the evaluations in separate statements before doing the function call. Write a main function that opens the binary file whose name is given in the command line, reads a tree from it, and, if the trace level is at least 1, writes the tree to the standard output for debugging. Use the following for reading the binary file, also described by binary.h. BFILE* openBinaryFileRead(const char* filename) This function opens binary file filename for reading and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened. int readBit(BFILE* f) This function reads one bit from file f and returns the bit (either 0 or 1). At the end of the file, readBit returns EOF. int readByte(BFILE* f) This function reads one byte (8 bits) from file f and returns the byte. Use it to get a byte that you wrote using writeByte. At the end of the file, readByte returns EOF. void closeBinaryFileRead(BFILE* f) This function closes file f. Once it is closed, you cannot read from it again. When you link unhuffman.cpp, it is critical that you use the same implementation of binary.h as for huffman.cpp. Either link both with binary1.cpp or both with binary2.cpp.

14. Test what you have so far. Your huffman program should write a description of the tree into the binary file. Read it back with unhuffman. Is the tree correct?

15. Write a function in huffman.cpp that takes the array of character codes, an open binary file and the name of the file to read. It should reread the input file and write the codes for the characters that it reads into the binary file. Modify main so that, after writing the tree to the binary file, it also writes all of the character codes.

16. Return to unhuffman.cpp. Write a function that reads the code for a single character and writes that character. It should take the Huffman tree, an open binary file for reading and an open file for writing. As you read each bit, move down the tree in the appropriate direction. When you hit a leaf, write out the character at that leaf then start again at the root of the tree to get the next character. Keep going until you hit the end of the file. It is possible that this function encounters the end of the file. You will find it convenient for this function to return an indication of whether it was successful in reading a character code or encountered the end-of-file. Important note. The code for a character only has a bit in it for each nonleaf. There is no bit for a leaf. So when reading the code, only read a bit when you are at a nonleaf in the tree, to decide whether to move to the left or right. Do not read a bit when you are at a leaf.

17. Now write a function to do the uncompression. It should keep reading character codes and writing the corresponding character until the end-of-file is reached.

18. Test both programs. If you compress a file and then uncompress it, do you get back exactly what was in the original file? Try a file that contains tabs, spaces and has more than one line. Try some non-letter characters too.

19. Once everything is working, submit your work.

Submitting Your Work

To turn in your work, log into the Linux system, change your directory for the one for assignment 8, use the following command.

  ~abrahamsonk/3300/bin/submit 8 huffman.cpp unhuffman.cpp tree.h trace.cpp trace.h pqueue.cpp pqueue.h

After submitting, you should receive confirmation that the submission was successful. If you do not receive confirmation, assume that the submission did not work.

Command

  ~abrahamsonk/3300/bin/submit 8

will show you what you have submitted for assignment 8.

You can do repeated submissions. New submissions will replace old ones.

Late submissions

Late submissions will be accepted for one day after the due date. If you miss a late submission deadline by a microsecond, your work will not be accepted.

Computer Science 3300 Fall 2015 Section 001 Programming Assignment 8

Table of contents