CS 261 Homework 7 - Huffman Encoding and Decoding

Due Mon Apr 14 at 11:59pm

Overview

In this assignment, you will pratice working with tree data structures to make one of the most useful kinds of tree: a Huffman coding tree. Huffman coding is a technique for performing data compression--that is, for taking data (such as txt files) and modifying it so that they take up less space. Huffman coding is particularly useful for text, though it can be applied to other forms of media (such as images) as well.

Huffman encoding uses the idea that all data in computers is written as a sequence of bits--that is, of 0s and 1s. With Huffman encoding, rather than using the same number of 0s and 1s for each piece of data (e.g., for each character in a text file), you instead use a smaller number of bits to store more common characters (like the letter 'e'), and a greater number of bits to store less common characters (like the letter 'x'). For example, instead of using 8 bits for every 'e' and 8 bits for every 'x', use 3 bits for 'e' and 12 bits for 'x'. Because 'e' shows up much more frequently in (English) text, encoding a large text file this way can significantly reduce its size.

Huffman trees are a type of binary tree used in performing this kind of conversion. A Huffman tree includes all the elements (characters) your data, with more common elements found closer to the root. The distance and path away from the root is used to determine what bits (what 0s and 1s) are used to represent each character. This tree can be used both for encoding and for decoding--you can look up the Huffman coding for particular characters, and see what characters are associated with particular Huffman codes

More details on Huffman trees can be found in the textbook in Section 6.1 and 6.6 (pg 345). You should look over this section carefully; it has details about the data structure and example code that may be informative for this assignment.

For this assignment, you will be implementing a Huffman Tree. You will also be completing classes that use this tree to perform encoding and decoding--a system that can encode and decode data is called a codec. You will be creating codecs for processing simple Strings in and out of their bitwise representation (the 0s and 1s, what is called a bit string), and for encoding and decoding full text files.

This is a large assignment, involving a number of moving pieces and some new concepts (trees, bit strings, etc). However, most of the tasks should be straightforward--I have identifed almost exactly what methods you will need to write. Nevertheless, be sure to start early, and ask for help if you get stuck!

This assignment should be completed individually. You are welcome to ask for help from me, or from your classmates (but remember to follow the Gilligan's Island rule)!

Objectives

To practice implementing and working with trees
To learn about and understand bitwise representation and bit strings
To practice utilizing previous data structures, such as queues and lists
To practice with other methods for file processing

Necessary Files

You will need a copy of the Hwk7.zip file. This file contains a lot of classes and files you will need to import into Eclipse. (I am providing you with lots of "starter" code that handles most of the dirty details of working with files and such, so you can focus on dealing with the Huffman tree. That said, you will need to write some code to utilize your tree data structure).

The zip file contains the following classes. You should read through them carefully; the code is well commented, with TODO comments marking methods that you will need to complete. If there are any questions, let me know.

HuffmanTree261 is an interface that your HuffmanTree will need to implement. It contains a list of public methods your HuffmanTree should have. Note that these are the only public methods your HuffmanTree should have, though you are welcome to make as many private helper methods as you like. You should not modify this class.
- More details about implementing the individual methods can be found below.
Frequency is a simple wrapper class for an element and a "count" of how frequent it is. This ADT will help with building your HuffmanTree. You should not modify this class.
- Note that the instance variables of this ADT are public, so you can easily access them.
HuffmanStringCodec is a class that provides functionalityy to encode and decode basic Strings using a HuffmanTree. This class will instantiate and call methods on your HuffmanTree. Note that I have provided some of the functionality for you--all you will need to implement are the encode() and decode() methods. You will need to modify this class.
HuffmanFileCodec is a class that provides functionality to encode and decode text files using a HuffmanTree. This class will instantiate and call methods on your HuffmanTree. I have provided most of the functionality for you--all that's left is to fill in is the decode() method so you can read an encoded file!
- The encode() function is the most complicated part of this class, but will act as a template (particularly in terms of file IO methods) for writing your decode() method. Be sure and look through this method and its comments carefully!
- Note also the static helper methods bitStringToByte() and byteToBitString() which will let you work with bit strings even with files (that are a lot easier to debug than actual bytes!). You can see how these methods are used in the encode() method.
HuffmanTester is a runner class for the codecs. This includes code to demonstrate that String encoding/decoding works, and some code to enable you to easily work with files. Uncomment the appropriate code based on what you are testing. You may also want to write your own tester to check indivdiual methods.

The zip file also contains a couple of files you can use to test your program: hello.txt is a tiny file which is great for testing (because it only has a few characters); gettysburg.txt is the text of Lincoln's Gettysburg address, which adds a few more complications; alice.txt is the complete text of Lewis Carroll's Alice's Adventures in Wonderland--this is the kind of large text file you can compress using your program!

The zip file also contains a secret message ("secret.encoded") that I encoded using this method; in the end, you should be able to decode this file and see what it says! (Note that this has not be robustly tested across systems; if you have problems (but everything else works), let me know).

And last but not least, the zip file also contains the README.txt for this assignment, which you will need to fill out.

A Short Introduction to Bit Strings

Adapted from http://www.cs.duke.edu/csed/poop/huff/info/

In a computer, all data is encoded a sequence of bits--that is, as a list of 0s and 1s. These sequences, called bit strings, are binary numbers--numbers in base 2. So in effect, all data can be seen as a sequence of numbers.

You may recall from the first Hwk that it is possible to cast a char into an int, or to talk about the int representation of a char. For example, the int value of 'a' was 97. This is called the Unicode value or ASCII value (Unicode is like an extension of ASCII--supporting more strange characters). Since 'a' is represented by 97, that means an 'a' in a String can also be represented by the binary value of 97: or 1100001. This list of 0s and 1s would be the bit string version of 'a'.

Note that there are seven (7) 0s and 1s used to represent the character. This is because it takes 7 bits to represent a letter. (Really it takes 8 bits--a full byte--but in ASCII we don't use the "high" bit. So 'a' has a leading 0 that we're ignoring).
Every character has a unique number and bit string that represents it (you can even see a full list in a nice browsable interface here).

Of course, we don't necessarily need the full 7 bits to represent some text. For example, the String "go go loggers!" only has 8 different characters (including the space and exclamation point), so we could encode these characters using only 3 bits each:

ASCII coding
char	ASCII	binary
g	103	1100111
o	111	1101111
l	108	1101100
e	101	1100101
r	114	1110010
s	115	1110011
!	33	0100001
space	32	0100000

3-bit coding
char	code	binary
g	0	000
o	1	001
l	2	010
e	3	011
r	4	100
s	5	101
!	6	110
space	7	111

The string "go go loggers!" would be written (coded numerically) as 103 111 32 103 111 32 108 111 103 103 101 114 115 32. Although not easily readable by humans, this would be written as the following bit string (the spaces would not be written, just the 0's and 1's):

1100111 1101111 0100000 1100111 1101111 0100000 1101100 1101111 1100111 1100111 1100101 1110010 1110011 0100001

This is how the computer sees the String! Note that this takes 14*7=98 bits (98 0s or 1s).

If we use the 3-bit encodings, we can write the same String (coded numerically) as 0 1 7 0 1 7 2 1 0 0 3 4 5 6, or as a bit string:

000 001 111 000 001 111 010 001 000 000 011 100 101 110

Again, the spaces wouldn't be included in the bit String. This only takes 14*3 = 42 bits (42 0s and 1s), which is less than half the size!

More bits can be saved if we a variable number of bits. For example, if we only used 2 bits to encode the g, o, and space (which are used frequently), but used 4 bits to encode the l, e, r, s, and !, our bit string would only have (9*2)+(5*4)=38 bits. This is the essence of Huffman Encoding--an algorithm for determining an "optimal" number of bits based on the frequency of characters used.

Huffman Trees

A Huffman Tree is a binary tree (really a binary trie) that encodes bit strings for character in the "path" from the root to that character. If you trace the branch from the root to the character, writing down a 0 whenever you go left and a 1 whenever you go write, you'll have written down the bit string for that character.

Huffman Trees are then constructed so that more frequent elements are at a higher level of the tree, so have a shorter path from the root. This means that they have shorter bit strings (in terms of the number of 0s and 1s), so take less bits to encode.

For example, we might produce the following Huffman tree that provides the following encoding (for the letters in "go go loggers!"):

Huffman coding
char	bit string
g	11
o	00
l	0110
e	0111
r	010
s	1010
!	1011
space	100

You can decode a particular bit string using a Huffman Tree by starting at the root of the tree, and then traveling down the branches by following the bits in the bit string. Every time you come to a leaf (to a character), write down that character and go back to the root--and keep going through the bit string. This lets you convert bit strings back into letters!

For practice, try decoding the following bit string using the Huffman table above:

010001010011110011001111011001111010

Pro Tip: this table and bit string can provide good "known" values for you to test with!

Again, more details and examples of Huffman Trees can be found in the textbook, as well as at http://www.cs.duke.edu/csed/poop/huff/info/.

Assignment Details

Your assignment is to create a HuffmanTree and complete the provided codecs, thereby allowing you to encode and decode both Strings and text files.

This assignment can be seen as being made up of three parts: implementing the HuffmanTree, completing the HuffmanStringCodec, and completing the HuffmanFileCodec. You will need to implement the HuffmanTree first; however, you may wish to work on the HuffmanStringCodec simultanously, in order to help test your tree's functionality.

First, have you carefully read and followed the above discussion of bit strings? You should have!

HuffmanTree

The main class you will be creating will be the HuffmanTree class. This class should be an ADT representing a Huffman Tree.

Your HuffmanTree class should be a generic class (being able to encode a variable data type--not just Characters!). It should also implement the HuffmanTree261<E> interface.
A HuffmanTree is a kind of BinaryTree. This suggests a few different ways to implement the ADT. One strategy would be to make it a subclass of the BinaryTree class we've discussed during lecture. Another strategy (the one demonstrated in the textbook) would be to have it include a BinaryTree as an instance variable, thus acting as a kind of "wrapper" class.

However, the easiest and cleanest implementation for this assignment will be to make the HuffmanTree a brand new class that has the same structure (and some of the same methods) as a BinaryTree.
- The HuffmanTree's inner Node<E> class will different from the "normal" BinaryTree node in two ways:
  1. It should have an instance variable that is the "weight" of the Node. This will be used when building the HuffmanTree from a set of element frequencies.
  2. It should have an instance variable that is the bit string of the item that Node contains. While this information duplicates that established by the tree's structure (i.e., you can figure out the Node's bit string by traversing too it), storing the information will make it much easier and faster to fetch the codes for all the elements in the tree.
  3. It might also be handy to give the Nodes a method that returns whether or not it is a leaf.
- Since the HuffmanTree will be a BinaryTree, you will likely want to implement/copy-in some of the same methods (such as getData(), getLeftSubTree(), getRightSubTree(), and a toString() traversal methods). You'll also probably want an isLeaf() method that returns if the HuffmanTree is just a leaf (that is, if it has no subtrees).
  - You can also use the BinaryTree as an example for constructors that may be handy!

Once you have the inner class, instance variables, and constructors put together, you can start implementing the methods required by the HuffmanTree261 interface.

void add(E element, String bitString)
I recommend you implement this method first. This will allow you to create a HuffmanTree with an explicit set of elements and codes--for example, you could all this method multiple times to create the example tree and encoding table for "go go loggers!" described above. Such direct creation is very handy for testing!
- For this method, you will need to iterate through the characters in the provided bitString. Use these character to move a pointer to the "current node" down the tree, until you find the correct spot to add the element (this is a process somewhat similar to adding an element into an ordered linked-list!) If the character is a '0', go to the left. If the character is a '1', go to the right.
  - Note that if the current node has no child where you want to move (e.g., you have a '0' so want to go left, but there is no left child), then you will need to create a new Node with no element inside it in order to fill out the tree. This is expected behavior; only the leaves of Huffman Trees have data--the "inner" nodes will be empty. Not having a path to go down tells you that you need to create that empty node!
  - You should iterate through all but the last character of the bitString--this should take you to the parent of the Node you're going to add. Then you can use the final character to determine whether to add your new Node (with the given element) to the left or to the right.
    - Remember to also assign the bitString to the node's instance variable!
- You can test this element by printing out the tree using one of the tree traversal printers, or by inspecting the nodes inside the debugger. The tree should have the same structure as the one above!
String getBitString(E element)
A good next step is to implement the getBitString() method. This method should search the tree for the Node containing the given element, returning that Node's bitString when you find the node (see why we stored the bitString?)
- It is possible to perform this search recursively, using a depth-first search (similar to the traversals we've done in class, and that you did to print the elements of the tree). However, the whole point of the HuffmanTree is that we're more likely to encode elements that are near the top of the tree (that's why they are near the top). Thus instead you should use a breadth-first search for efficiency. This will let you check the top nodes (those with the shortest bitStrings and the most common) first, decreasing the amount of Nodes you actually need to look at.
- Recall that a breadth-first search uses a queue to iterate through the nodes: push the root onto the queue. Then while the queue isn't empty, pop an element off the queue, check if that's what you're looking for, and if not push it's children onto the queue.
  - You can use the Queue class that we implemented in lecture, or you can just use a LinkedList, with the addLast() method acting as push and the removeFirst() method acting as pop.
- Again, you can test this method with the example table above. Build your HuffmanTree using the previous add() method, and then test that getting each element gives you the bitString you expect!
String getCodeList()
Once you've written the getBitString() method, it's easy to figure out the getCodeList() method.
- This method should also use a breadth-first traversal (with a queue)--but in this case you'll be visiting all the nodes, and building a String of the elements and their codes.
- It is very important that you get the format of the returned String correct. Look carefully at the provided JavaDoc for what this String should look like. (File encoding and decoding will depend on this!)
- It should be easy to test this method--just print out the returned String and make sure it looks correct! You can keep using your same testing table.
  - In fact, once you have this method, it will become an invaluable tool for testing the rest of your program!
void createFromFrequencies(List<Frequency<E>> frequencies)
The last HuffmanTree method is the most involved. It fills the tree from the provided list of frequencies.
- The algorithm for building the tree from a set of frequencies is described on page 348 of the textbook (along with a detailed example, and even sample code!). The basic idea is to use a Priority Queue of "subtrees" (where each subtree initially has a single element and frequency or "weight"), and then combine the trees with the smallest weights to make a single tree with a bigger weight. Put that bigger tree back in the priority queue and keep going until you've combined everything into a single tree. That remaining tree is your completed HuffmanTree. To restate the textbook:
```
make new HuffmanTree for each item
push all the trees into a PriorityQueue
while the priority queue has more than 1 item
    pop the first two trees
    set them as the subtrees of a new HuffmanTree whose weight is the combined weight of the children
    push the new tree back into the queue
```
  - The Java Collections framework has a built-in PriorityQueue class that you can use for this purpose. This PriorityQueue will order items by their "natural ordering"
    - Note that the PriorityQueue class calls push offer() and pop poll().
    - In order to give the HuffmanTree class a "natural ordering," you'll have to make it implement the Comparable<HuffmanTree<E>> interface. The compareTo() method should return a number based on the difference in weights of the root nodes (so if this.root.weight < other.root.weight, return -1, etc).
    - Remember a class can implement multiple interfaces by separating them with a comma
  - Once you've collapsed everything down to a single HuffmanTree object, you can have the current tree "take over" that by having this.root point to that tree's root! The wonders of pointers!
- There's actually one more piece to this, which requires another method. After you've built the HuffmanTree, you'll need to go through and make sure that each of the leaf nodes have stored their proper bitString (for use with the getBitString() method).
  - You might call this method assignCodes() or something similar. Make it private, since it is just a helper method.
  - The easiest way to do this is to recurse through the tree, building up each bitString as you go, and then assigning that bitString whenever you hit a leaf (which is also when you stop recursing). Note that your method will actually have two recursive calls--one that goes down the left branch and one that goes down the right.
  - The textbook has an example of this recursive algorithm in the "printCode()" method described on page 351. Use that as a starting point!
  - Remember to make a launcher for this method so it has a nice interface, and to call the method at the very end of your createFromFrequencies() method!
- This method is a bit harder to test. Try calling it with the frequencies listed in the book's examples, or with the frequencies from the "go go loggers!" example (in fact, you can use some of the HuffmanStringCodec methods to do this easily).
  - You can make sure your HuffmanTree looks accurate by inspecting the bitstring codes via the previously implemented methods. Do the most common letters have shorter bitStrings?

Be sure and test and comment your class thoroughly. Once this is tested and working, you can put it aside and just finish up the codecs!

HuffmanStringCodec

The HuffmanStringCodec and the HuffmanFileCodec have 3 primary methods: createTree() which sets up a HuffmanTree, encode() which encodes a message using the HuffmanTree, and decode() which takes an encoded message and converts it back to normal.

With the HuffmanStringCodec, the createTree() method has been provided for you, (since the real work in setting up the tree occurs in your createFromFrequencies() method).
- And since this method already exists, you can use it to test your createFromFrequencies() method!
String encode(String input)
This is a straightforward method to write: simply iterate through the provided String, and convert each character into a bitString using the HuffmanTree. Combine all those bitStrings and return them.
- This method only makes sense if the HuffmanTree has been created. If it has not (i.e., if the variable is null), you should have the method throw an IllegalStateException. See the getEncoding() method at the bottom of the class for an example.
  - Be sure that your Exception includes an informative method about what the exception is!
- You can test this method with the same sample HuffmanTree and inputted Strings; see if you get the bitString you expect!
String decode(String bitString)
This method will require you to interact closely with the HuffmanTree. In order to decode a bit string, you'll need to iterate through each bit, and use whether that is a 0 or a 1 to "traverse" down a branch of the HuffmanTree. Once you hit a leaf, you can add the letter contained in that leaf to your output String.
- This is the very algorithm that was described in the section above, and is the core of the HuffmanTree usage. Note that this algorithm is also explained and demonstrated in the textbook on page 352. In fact, your code will look very similar to that!
- This method only makes sense if the HuffmanTree has been created. If it has not (i.e., if the variable is null), you should have the method throw an IllegalStateException. See the getEncoding() method at the bottom of the class for an example.
- This should be easy to test--simply check that the bitString you've generated gives back the original message

HuffmanFileCodec

For the HuffmanFileCodec, the createTree() and encode() methods are provided for you, in order to help handle all the file processing. There are also a few additional helper methods. Read through the provided methods carefully, and make sure you understand them. The encode() method in particular has a few components that you will need to reuse in your decode() method. More information about the provided code is discussed in the "Necessary Files" section above.

void decode(File encodedFile, PrintStream out)
The one method you will need to fill in is the decode() method. It's not too complicated, but there are a number of pieces you'll need to handle.
- You'll need to read through the file byte by byte. You will want to use a BufferedReader to do this--the process will be very similar to that used in the encode() method.
  - Remember to close the reader when you're done with it!
- The first thing you should do is read the first line from the file using the readLine() method (this works exactly like Scanner's nextLine()). This first line is the list of HuffmanCodes used to encode the file. You should take this line of codes and use them to create an appropriate HuffmanTree by calling the relevant createTree() method.
  - Because we've replace the "newline" and "return" characters in the encoded String, you should "re-replace" the Strings with the appropriate characters (<newline> with "\n" and <return> with "\r").
- The remainder of this method will work very similarly to the encode() method from the HuffmanStringCodec--and in fact you may be able to reuse some of that code.
  - Rather than iterate through the bits as characters in a String, you'll need to read a "chunk" of bits (a byte) from the file using the read() method. This will give you a byte (an int), which you can convert to a bitString using the appropriate helper method.
  - When you run out of bits in the byte you've read, you can read another byte in order to keep navigating down the tree!
  - After you've identified a character from some bits, you will want to output that character. Note that this method takes in a PrintStream as a parameter--a PrintStream is basically a place we can output characters. System.out is one such PrintStream; so you can imagine that someone has passed the System.out object as a parameter! A file can also have a PrintStream, so that we can "print" to a file the same way you would to System.out.
    - In other words: take the character you've decoded and print() it to the PrintStream!
  - There is one more important piece: files are encoded with a special character EOT which marks the end of the file (this is necessary because there are a few extra 0s used as padding, and we don't want to mistake those for an extra letter!). So if you're navigating down the tree and you get to the EOT character, you will need to stop reading through the bitString and the file (basically quit out of your loops).
- This method is the hardest to test (that is, the hardest to debug); your best bet is to encode a file (using the provided method) and then decode it to see if you get the right output. If you don't, you can try following the individual bits (0s and 1s) and see if they are letting you navigate your tree correctly.

And that should do it! Make sure all your methods and classes are fully functionally and documented. If there are any problems, let me know.

Timeline

This is a long, involved assignment. There are a numerous pieces, but you should be able to get them all done in the time allotted. A slightly aggressive timeline is as follows:

Start by reading through the assignment carefully, particularly checking that you understand the idea of a bit string and how the HuffmanTree will work. Do this over the weekend of Mar 29-30.
Start by creating the HuffmanTree class. Fill in the inner Node, the BinaryTree methods it will use (you can adapt the code from the lectures, found on Moodle), and filling in the skeleton of any required methods. Do this by Mon 03/31.
Implement the add() method next; this helps with testing. Do this by Wed 04/02
Implement the getBitString() method by Thurs 04/03 and the getCodeList() method by Fri 04/04.
You should implement the createFromFrequencies method and otherwise finish up your HuffmanTree by Mon 04/07.
Next implement the two methods for the HuffmanStringCodec. You should be able to do this by Wed 04/09 (they're pretty straightforward).
Finally, implement the decode method of the HuffmanFileCodec. This method is harder, but if you should be able to finish it by Fri 04/11, leaving you the weekend to catch up if you fall behind and to study for the exam!.

Extensions

There are a couple of places you could improve or extend this assignment:

One of the flaws with this approach is that we need to store the codelist in the file; this takes up extra room. There are other ways to store the codelist such that it takes up less room. For example, rather than writing out the whole bitString, you might write out a number representation. For example:"g:3,o:0,l:6,e:7,r:2,s:10,!:11, :4" You'll need to handle leading 0s...
Huffman encoding can be applied to more than just txt--as I mentioned in class, it is part of the common JPG compression scheme. You can encode an image by applying the encoding algorithm to the colors of pixels, rather than to the characters of a String. Then simply store a list of the "encoded" pixel colors, and decode by reading them out to refill the pixel array (e.g., using a BufferedImage).
- You can use your same (generic) HuffmanTree, just write a new codec!
- You will get better compression with black and white images, since they are more likely to have "frequent" colors that give you the compression.
- Indeed, a naive approach to this often won't provide much space savings. Independent research can find the ways to do more effective Huffman encoding on images--it involves taking a "difference" between an image and an "average" of the nearby pixels.

Submitting

BEFORE YOU SUBMIT: make sure your code is fully documented and functional! If your code doesn't run, I can't give you credit! Your name should be in the class comment at the top of each class.

Submit all your classes to the Hwk7 submission folder on hedwig. Make sure you upload your work to the correct folder!. You should upload your entire src folder with all your classes. Be sure to complete the provided README.txt file with details about your program.

The homework is due at midnight on Monday, Apr 14. Note that we also have the second exam that day; getting the homework done earlier will give you more chance to study!

Grading

This assignment will be graded on approximately the following criteria:

You have implemented the HuffmanTree class as a BinaryTree [10%]
You have implemented the add() method [10%]
You have implemented the getBitString() and getCodeList() methods [15%]
You have implemented the createFromFrequencies() method and its helpers [15%]
You have completed the HuffmanStringCodec class [20%]
You have implemented the decode() method of the HuffmanFileCodec class [10%]
Your program demonstrates good programming and object-oriented style (e.g., keep non-public methods and variables private!) [10%]
Your program is fully documented with complete JavaDoc comments [5%]
You have completed the included README.txt file [5%]