Computer Science 2530
Fall 2016
Programming Assignment 8

Assigned:	Wednesday, November 16
Due:	Monday, December 5, 11:59pm

Purpose of this assignment
Plagiarism
Background
Huffman's algorithm
Assignment
Tracing
Implementation issues and a development plan
Things to watch out for
Submitting your work

Purpose of This Assignment

This assignment has you use trees. It is the largest assignment that you are assigned, and gives you experience in implementing a multi-module program.

It is crucial to get started early. Use the methods that we have studied in this course. Do not cut corners. Do not revert to novice methods!

Read the entire assignment before you start on it.

Plagiarism

When faced with a large assignment, many students turn to plagiarism, especially if they have waited too long to start on the assignment.

Do not fall into that trap and submit a plagiarized assignment. Doing so will cost you points. Keep in mind that you must get an overall score of 50% or better on the programming assignments in order to pass this course, and that a plagiarised assignment gets a negative score.

Background

Data compression is concerned with economical ways to encode information, for storing files using limited disk space, or for transfering files over a network. Data compression is critical for transmitting videos over the internet and storing high-definition videos.

Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumpelstiltskin occurs often, then there will be a very short code for that word.

A much simpler kind of data compression for text files works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'		01100001
'b'		01100010
'c'		01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'		0000000001100001
'b'		0000000001100010
'c'		0000000001100011

Some characters are used more than others. In fact, you very likely don't use some characters at all. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently, and to have no code at all for a character that is not used.

For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'		10
'b'		0
'c'		11

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes, without having any character boundaries stored. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.

Huffman's Algorithm

There is an algorithm that is due to Huffman for finding efficient prefix codes. The input to the algorithm is a table of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each leaf contains a character. Each nonleaf has two children, the left child and the right child. We think of the left child as the '0' child and the right child as the '1' child. An example is the following tree.

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d

You find the code for a character by following the path from the root of the tree to that character, writing down the 0's and 1's that you see along the path. For example, in the tree above, the code for 'b' is 01 and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.

where 'b' has code 0, 'a' has code 10 and 'c' has code 11.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the number of occurrences of the character in the file. For example, if the frequency counts are

(so 'a' occurs 500 times, etc.) then the algorithm creates five leaves, one (holding a) with frequency of 500, one (holding b) with a frequency of 230, etc.

(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)

Now the algorithm repeatedly performs the following steps, as long as there are still at least two trees left in the collection of trees.

Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency F_S and tree T has frequency F_T.

Build a tree that looks as follows.

Say that this tree has frequency F_S + F_T, the sum of the frequencies of S and T.

Add the new tree, with its new frequency, to the collection of trees.

The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.

A notation for trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. A leaf is shown as the character at that leaf. A nonleaf is shown as the two subtrees, surrounded by parenthese and separated by a comma. For example, tree

is written (e,c). Tree

is written (b,(a,c)). Tree

               .
              / \
            0/   \1
            /     \
           /       \
          .         .
         / \       / \
       0/   \1   0/   \1
       /     \   /     \
      c      b   a      d

is written ((c,b),(a,d)). We use this notation to demonstrate the algorithm.

Example

Suppose there are five letters, a, b, c d and e. After counting the characters in a document, you get the following counts.

a	500
b	230
c	300
d	250
e	700

The initial collection of trees (all leaves) with their frequencies is as follows.

b	230
d	250
c	300
a	500
e	700

Removing the two with the smallest frequencies and combining them yields a tree with frequency 480. It is inserted into the collection, yielding the following.

c	300
(b,d)	480
a	500
e	700

The next step combines c with (b,d), yielding tree (c,(b,d)), with a frequency of 300 + 480 = 780. The collection of trees and frequencies is now

a	500
e	700
(c,(b,d))	780

Now combine a and e, yielding

(c,(b,d))	780
(a,e)	1200

The last step combines the last two trees, yielding

((c,(b,d)),(a,e))

1980

The final tree is ((c,(b,d)),(a,e)), which is the following tree.

                    .
                   / \
                 0/   \1
                 /     \
                /       \
               .         .
              / \       / \
            0/   \1   0/   \1
            /     \   /     \
           c       .  a      e 
                  / \
                0/   \1
                /     \
               b       d

The code for d is 011, the code for a is 10, etc.

Assignment

For this assignment you will write two programs.

The first program, called huffman, gets two file names, which I will refer to as A and B, from the command line. It should do the following.

Count how many occurrences of each character are in file A.
If requested, show the character frequencies on the standard output, showing only characters that occur at least once.
Construct a Huffman tree for those characters.
If requested, show the tree on the standard output, using the notation described above.
Construct a Huffman code from the Huffman tree (ignoring characters that do not occur at all).
If requested, the code on the standard output.
Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.

For example, command

  ./huffman data.txt data.cmp

reads file data.txt and writes a compressed version into a file called data.cmp.

The second program, called unhuffman, will also be given two file names, A and B, on the command line. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. For example, command

  ./unhuffman data.cmp newdata.txt

reads data.cmp and writes newdata.txt. The decoded file should look identical to the file that you encoded.

If requested, Unhuffman should show the tree that it gets from file A.

Tracing

You are required to provide for tracing, as outlined above (at if requested lines). If the command line contains -t, then tracing is turned on. Otherwise, tracing is turned off.

In Linux, it is conventional for options such as -t to occur before file names on commands. Command

  ./huffman -t data.txt data.cmp

request compression with tracing.

Your huffman/unhuffman programs are required to contain tracing as follows, if tracing is turned on.

The huffman program should show the character frequencies, show the Huffman tree that it produces, and show the code that it gets. For example, the output of huffman with tracing turned on might look as follows.
```
The character frequencies are:

a 500
b 230
c 300
d 250
e 700

The Huffman tree is as follows: ((c,(b,d)),(a,e))

A Huffman code is as follows.

a 10
b 010
c 00
d 011
e 11
```
The unhuffman program should show the tree that it reads from the binary file. If it is not identical to the tree that was produced by huffman, there is trouble.

Implementation Issues and a Development Plan

Development plan
1. Create files trace.cpp and trace.h. You can use a global variable to indicate the trace level, which can be 0 or 1. Create the variable in trace.cpp. If your variable is called tracelevel, then trace.cpp will say int tracelevel = 0; and trace.h will say extern int tracelevel; Any file that includes trace.h will be able to use the tracelevel variable that is created in trace.cpp. Add a function that takes argc and argv and sets the trace level to 1 if the command line contains -t.

2. A Makefile is provided. Look at it to see what the targets are and what they do.

3. Create file huffman.cpp, the source code for the compression program huffman. Add a main function to it that just sets the trace-level. Use the following heading for main (in huffman.cpp). int main(int argc, char** argv) Parameter argv is an array of null-terminated strings having argc members. The names of the files to work on are argv[argc−2] and argv[argc−1]. Try it. Does it set tracelevel correctly?

4. Write a function in huffman.cpp to get the character frequencies. Assume the characters in the file are 8-bits (one byte), so there are 256 possible characters. See reading files using the stdio library. Create an array of 256 integers in main to store the frequency counts, one for each possible character that can be stored in one byte. Write a function that computes the character counts. It should start by setting all of the character frequencies to 0. Then it should open the appropriate file, read the characters, and increment the count for a character each time the character is seen. When it is done, your function should close the file. Make the frequency counting function return true if it was successful in opening the file and false if not. What are appropriate parameters to pass to the frequency counter? Remember that it is a specialist. It counts character frequencies in a given file. Don't make it more aware of the context in which it sits than it needs to be. Now make main call the frequency counter. If it returns false, then there should be a meaningful message to the user, and the program should stop. The program must not keep going when the file to read does not exist. Important. You will be using characters as array indices. Do not use type char for an array index. Values of type char are assumed to be from −128 to 127. That won't work for array indices. Read each character as an integer, where it will be a number from 0 to 255. If you need a character variable, use type unsigned char. If you receive a warning that an array index has type char, heed the warning!

5. Add a function to trace.cpp that takes appropriate parameters and, if tracing is turned on, shows the frequencies of all characters that occur at least once in a clear and readable format. You will want another function, also in trace.cpp, that prints a character's name. The name of 'a' is just a. But the name of '\n' (the newline character) should be \n, or some other desriptive name, such as newline. Use descriptive names for the tab and space characters as well. For an unprintable character whose code is n, use \n, where you show the character code. (Include <cctype>. Then isprint(c) is true if character c is printable (and is not a white-space character). Pass isprint an integer code, not a character.) Now make main trace the frequency counts if tracing is turned on. Test your program. Does it seem to work?

6. Create a file, tree.h, holding the following type definition. #ifndef TREE_H #define TREE_H enum NodeKind {leaf, nonleaf}; struct Node { NodeKind kind; char ch; Node* left; Node* right; Node(char c) { kind = leaf; ch = c; } Node(Node* L, Node R) { kind = nonleaf; left = L; right = R; } }; typedef Node Tree; #endif The idea is that a leaf has kind leaf and a nonleaf has kind nonleaf. The ch field must only be used in a leaf and the left and right fields mus only be used for a nonleaf. You will lose points if you use the ch field of a nonleaf or the left or right fields of a leaf. You should include tree.h in any file that needs to use the tree type.

7. Add a function to trace.cpp that prints a tree to the standard output, using the notation described above. To show a leaf in the tree, show the character's name, not the raw character. Modify main so that it builds some test trees and prints them. Does the tree printer appear to work? After it works, remove the tests of the tree printer from main.

8. Write a function in huffman.cpp that takes the array of character frequencies and returns a huffman tree based on the character frequencies. The huffman tree should have type Tree. For Assignment 5 you implemented the priority queue abstract data type. Copy pqueue.h and pqueue.cpp into the directory for assignment 8 and modify it by defining #include "tree.h" typedef Tree PQItemType; typedef int PQPriorityType; To build the huffman tree, create a priority queue. Insert a leaf into the priority queue for each character whose frequency is not 0, with its frequency as its priority. Notice that Tree t = new Node('a'); creates a leaf holding character 'a'. Now repeatedly remove two trees from the priority queue. Combine them into a single tree. Tree t = new Node(r,s); creates a new tree with left subtree r and right subtree s. After taking the first tree out of the priority queue, check whether the priority queue is empty. If so, then the tree that you just got is the answer. So return it. Should you destroy the priority queue? Important. Ensure that your function only uses features of the priority queue module that are part of its interface. Do not use features that should remain hidden inside the priority queue module! Do not add any new features to pqueue.cpp. This is very important. When I compile your program, I will link in my own priority queue module that implements the interface described in assignment 5. *If you use anything that is not part of that interface, your program will not compile.* Important. The function that builds the Huffman tree is an expert in building Huffman trees. Do not make other functions (except those that are helpers for the tree builder) know anything about how a Huffman tree is built from an array of frequencies. In particular, no other function should know that a priority queue is involved. The tree builder should not know more about the rest of the program than it needs to know in order to do its job. Choose its interface in a sensible way.

9. Modify main in huffman.cpp so that it builds the huffman tree and, if tracing is turned on, shows the resulting tree. Test your program. Does it seem to create a sensible Huffman tree?

10. Now, in huffman.cpp, write a function buildCode that takes the Huffman tree t and an array Code that can hold 256 strings. This function should fill in the Code array so that, if character c occurs at a leaf in t then Code[c] holds the code of c, as a null-terminated string. If character c does not occur in tree t, then Code[c] should be NULL. Your function will need to have the following parameters. A tree t (a subtree of the full huffman tree) The array Code to fill with the code. A null-terminated string pref of 0's and 1's that is the sequence of edge labels from the root of the huffman tree to the root of t. If your function is called buildCode, then, for each character c in t with code r (according to t), buildCode(t, Code, pref) adds code pref+t for character c to array Code. If t is (e, g) then buildCode(t, Code, "01") stores code "010" for e and code "011" for g. Now you see why that parameter is called pref. If t is a leaf holding character c, then add pref into Code> as the code for character c. If t is not a leaf, then buildCode should call itself on the left and right subtrees of t. The pref parameter for the call on the left subtree should be pref with '0' added to the end. The pref parameter for the call on the right subtree should be pref with '1' added to the end. (Create local arrays in buildCode and and copy pref into them using strcpy. Then concatenate "0" to the end of one and "1" to the end of the other using strcat.)

11. Add a function to trace.cpp that shows the code from a code array. When showing a character, use your function that shows the character's name. Make main build the code and, if tracing is turned on, show the code. Test what you have so far.

12. The goal is to pack 8 bits into each byte. But before doing that, take a simpler approach where you write each bit as a character so that you can read it. After everything seems to work you can change it so that it writes a binary file instead. Get files binary.h, binary1.cpp and binary2.cpp. They provide the following tools for writing files. Type BFILE BFILE* openBinaryFileWrite(const char* filename) This function opens binary file filename for writing and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened. void writeBit(BFILE* f, int b) This function writes bit b (either 0 or 1) into open file f. Make sure that b is either 0 or 1. It should not be '0' or '1'. void writeByte(BFILE* f, int b) This function writes byte or character b. into open file f. void closeBinaryFileWrite(BFILE* f) This function closes file f. You must close the file when you are done writing. Both files binary1.cpp and binary2.cpp have the same interface, described by binary.h. File binary1.cpp writes a text file, writing character '0' to represent bit 0 and character '1' to represent bit 1, allowing you to read the output with a text editor, but not actually compressing. File binary2.cpp writes a binary file, packing 8 bits into each byte.

13. Add a function to huffman.cpp that writes a description of a given tree into a given open binary file (type BFILE). Note that this is not* the same as your trace function that writes a tree for debugging. To write a leaf, write a 1 bit followed by the character stored at the leaf (8 bits). So, if t is a leaf, you write t as follows. writeBit(f, 1); writeByte(f, t->ch); To write a nonleaf, write a 0 bit followed by the left subtree followed by the right subtree. So, if t is not a leaf, write t as follows, assuming your function is called writeTreeBinary. writeBit(f, 0); writeTreeBinary(f, t->left); writeTreeBinary(f, t->right); Modify main so that it opens the second file name in the command line as a binary file for writing. Then it writes the Huffman tree to the binary file and closes the file.

14. Create file unhuffman.cpp, which will be the program that uncompresses a file. You will want a file written by huffman, holding a description of a tree. Write a function in unhuffman.cpp that reads a tree whose description is in a given open BFILE* file, and returns the tree. (Open the file in main.) Keep it simple. To read a tree, start by reading a bit. If the bit is a 1, you are looking at a leaf. Read the character (a byte) and build a leaf holding that character. But if the bit is a 0, then you are looking at a nonleaf. Read the left subtree then the right subtree. When you encounter a nonleaf, it is important that you read the left subtree then the right subtree, in that order. According to the definition of C++, when a function call f(A, B) is evaluated, expressions A and B can be evaluated in any order. To avoid that, do the recursive calls in separate statements. Use the following for reading the binary file, also described by binary.h. BFILE* openBinaryFileRead(const char* filename) This function opens binary file filename for reading and returns a pointer to a structure that describes the open file. It returns NULL if the file cannot be opened. int readBit(BFILE* f) This function reads one bit from file f and returns the bit (either 0 or 1). At the end of the file, readBit returns EOF. int readByte(BFILE* f) This function reads one byte (8 bits) from file f and returns the byte. Use it to get a byte that you wrote using writeByte. At the end of the file, readByte returns EOF. void closeBinaryFileRead(BFILE* f) This function closes file f. Once it is closed, you cannot read from it again. Write a main function in unhuffman.cpp. It should start by turning tracing on or off. Then it opens the binary file whose name is given in the command line, reads a tree from it, and, if tracing is turned on, writes the tree to the standard output for debugging. Make the trace readable. When you link unhuffman.cpp, it is critical that you use the same implementation of binary.h as for huffman.cpp. Either link both with binary1.cpp or both with binary2.cpp.

15. Test what you have so far. Your huffman program should write a description of the tree into the binary file and your unhuffman program should read it back correctly.

16. Write a function in huffman.cpp that takes the array of character codes, a binary file that is opened for writing, and the name of the file to read. It should reread the input file and write the code for each character that it reads into the open binary file. Modify main so that, after writing the tree to the binary file, it calls that new function to write then encoded characters. Note. Do not close the binary file between writing the tree and writing the character codes. Only close it after doing both steps.

17. Return to unhuffman.cpp. Write a function that reads the code for a single character and returns the character, as an integer. If it encounters and end-of-file, it should return EOF. This function will need take the Huffman tree and an open binary file for reading. As you read each bit, move down the tree in the appropriate direction. When you hit a leaf, that is the answer. Important note. The code for a character only has a bit in it for each nonleaf. There is no bit for a leaf. So when reading the code, only read a bit when you are at a nonleaf in the tree, to decide whether to move to the left or right. Do not read a bit when you are at a leaf.

18. Now write a function to do the uncompression. It should keep reading character codes and writing the corresponding character into the decoded file until the end-of-file is reached.

19. Test both programs. If you compress a file and then uncompress it, do you get back exactly what was in the original file? Try a file that contains tabs, spaces and has more than one line. Try some non-letter characters too.

20. Once everything is working, submit your work.

Things to Watch Out for

Huffman and unhuffman each get the names of files to open from the command line. Be sure not to read or write fixed files. If you do, your program will not work when I test it.
Only do tracing when it is turned on by option -t on the command line. Remember that -t occurs before the file names.
Start each file from the standard template.
Follow the coding standards. Do not become lax on those.

Submitting Your Work

To turn in your work, log into the Linux system, change your directory for the one for assignment 8, use the following command.

  ~abrahamsonk/2530/bin/submit 8 huffman.cpp unhuffman.cpp tree.h trace.cpp trace.h pqueue.h pqueue.cpp

If you created a file tree.cpp, add that to the list.

After submitting, you should receive confirmation that the submission was successful. If you do not receive confirmation, assume that the submission did not work.

Command

  ~abrahamsonk/2530/bin/submit 8

will show you what you have submitted for assignment 8.

You can do repeated submissions. New submissions will replace old ones.

Late Submissions

Late submissions will be accepted for 24 hours after the due date. If you miss a late submission deadline by a microsecond, your work will not be accepted.

Computer Science 2530 Fall 2016 Programming Assignment 8

Table of contents