CS 261 Homework 6 - Huffman Encoding and Decoding

Due Tues Nov 18 at 11:59pm

Overview

In this assignment, you will pratice working with tree data structures to make one of the most useful kinds of tree: a Huffman coding tree. Huffman coding is a technique for performing data compression--that is, for taking data (such as txt files) and modifying it so that they take up less space. Huffman coding is particularly useful for text, though it can be applied to other forms of media (such as images) as well.

Huffman encoding uses the idea that all data in computers is written as a sequence of bits--that is, of 0s and 1s. With Huffman encoding, rather than using the same number of 0s and 1s for each piece of data (e.g., for each character in a text file), you instead use a smaller number of bits to store more common characters (like the letter 'e'), and a greater number of bits to store less common characters (like the letter 'x'). For example, instead of using 8 bits for every 'e' and 8 bits for every 'x', use 3 bits for 'e' and 12 bits for 'x'. Because 'e' shows up much more frequently in (English) text, encoding a large text file this way can significantly reduce its size.

Huffman trees are a type of binary tree used in performing this kind of conversion. A Huffman tree includes all the elements (characters) of your data, with more common elements found closer to the root. The distance and path away from the root is used to determine what bits (what 0s and 1s) are used to represent each character. This tree can be used both for encoding and for decoding--you can look up the Huffman coding for particular characters, and see what characters are associated with particular Huffman codes. In effect, the tree lets us map between characters and bits!

More details on Huffman trees can be found in the textbook in Section 6.1 and 6.6 (pg 345). You should look over this section carefully; it has details about the data structure and example code that may be informative for this assignment.

For this assignment, you will be implementing a Huffman Tree. You will also be implementing classes that use this tree to perform encoding and decoding--a system that can encode and decode data is called a codec. You will be creating codecs for processing simple Strings in and out of their bitwise representation (the 0s and 1s, what is called a bit string), and for encoding and decoding full text files.

This is a large assignment, involving a number of moving pieces and some new concepts (trees, bit strings, etc). However, most of the tasks should be straightforward--I have identifed almost exactly what methods you will need to write. Nevertheless, be sure to start early, and ask for help if you get stuck!

This assignment should be completed individually. You are welcome to ask for help from me, or from your classmates (but remember to follow the Gilligan's Island rule)!

Objectives

Necessary Files

You will need a copy of the Hwk6.zip file. This file contains a lot of classes and files you will need to import into Eclipse. (I am providing you with lots of "starter" code that handles some of the dirty details of working with files and such, so you can focus on dealing with the Huffman tree. That said, you will need to write some code to utilize your tree data structure).

The zip file contains the following classes. You should read through them carefully; the code is well commented, with TODO comments marking methods that you will need to complete. If there are any questions, let me know.

The zip file also contains a couple of files you can use to test your program: hello.txt is a tiny file which is great for testing (because it only has a few characters); gettysburg.txt is the text of Lincoln's Gettysburg address, which adds a few more complications; alice.txt is the complete text of Lewis Carroll's Alice's Adventures in Wonderland--this is the kind of large text file you can compress using your program!

The zip file also contains a secret message ("secret.encoded") that I encoded using this program; in the end, you should be able to decode this file and see what it says! (Note that this has not be robustly tested across systems; if you have problems (but everything else works), let me know).

And last but not least, the zip file also contains the README.txt for this assignment, which you will need to fill out.

Whew!

A Short Introduction to Bit Strings

Adapted from http://www.cs.duke.edu/csed/poop/huff/info/

In a computer, all data is encoded a sequence of bits--that is, as a list of 0s and 1s. These sequences, called bit strings, are binary numbers--numbers in base 2. So in effect, all data can be seen as a sequence of numbers.

You may recall from the LetterSet lab that it is possible to cast a char into an int, or to talk about the int representation of a char. For example, the int value of 'a' was 97. This is called the Unicode value or ASCII value (Unicode is like an extension of ASCII--supporting more strange characters). Since 'a' is represented by 97, that means an 'a' in a String can also be represented by the binary value of 97: or 1100001. This list of 0s and 1s would be the bit string version of 'a'.

Of course, we don't necessarily need the full 7 bits to represent some text. For example, the String "go go loggers!" only has 8 different characters (including the space and exclamation point), so we could encode these characters using only 3 bits each:

ASCII coding
char ASCII binary
g 103 1100111
o 111 1101111
l 108 1101100
e 101 1100101
r 114 1110010
s 115 1110011
! 33 0100001
space 32 0100000
3-bit coding
char code binary
g 0 000
o 1 001
l 2 010
e 3 011
r 4 100
s 5 101
! 6 110
space 7 111

The string "go go loggers!" would be written (coded numerically) as 103 111 32 103 111 32 108 111 103 103 101 114 115 32. Although not easily readable by humans, this would be written as the following bit string (the spaces would not be written, just the 0's and 1's):

1100111 1101111 0100000 1100111 1101111 0100000 1101100 1101111 1100111 1100111 1100101 1110010 1110011 0100001

This is how the computer sees the String! Note that this takes 14*7=98 bits (98 0s or 1s).

If we use the 3-bit encodings, we can write the same String (coded numerically) as 0 1 7 0 1 7 2 1 0 0 3 4 5 6, or as a bit string:

000 001 111 000 001 111 010 001 000 000 011 100 101 110

Again, the spaces wouldn't be included in the bit String. This only takes 14*3 = 42 bits (42 0s and 1s), which is less than half the size!

More space can be saved if we a variable number of bits. For example, if we only used 2 bits to encode the g, o, and space (which are used frequently), but used 4 bits to encode the l, e, r, s, and !, our bit string would only have (9*2)+(5*4)=38 bits. This is the essence of Huffman Encoding--an algorithm for determining an "optimal" number of bits based on the frequency of characters used.

Huffman Trees

A Huffman Tree is a binary tree (really a binary trie) that encodes bit strings for characters in the "path" from the root to that character. If you trace the branch from the root to the character, writing down a 0 whenever you go left and a 1 whenever you go right, you'll have written down the bit string for that character!

Huffman Trees are then constructed so that more frequent elements are at a higher level of the tree, so have a shorter path from the root. This means that they have shorter bit strings (in terms of the number of 0s and 1s), so take less bits to encode.

For example, we might produce the following Huffman tree that provides the following encoding (for the letters in "go go loggers!"):

Huffman coding
char bit string
g 11
o 00
l 0110
e 0111
r 010
s 1010
! 1011
space 100

You can decode a particular bit string using a Huffman Tree by starting at the root of the tree, and then traveling down the branches by following the bits in the bit string. Every time you come to a leaf (to a character), write down that character and go back to the root--and keep going through the bit string. This lets you convert bit strings back into letters!

For practice, try decoding the following bit string using the Huffman table above:

010001010011110011001111011001111010

Pro Tip: this table and bit string can provide good "known" values for you to test your code with!

Again, more details and examples of Huffman Trees can be found in the textbook, as well as at http://www.cs.duke.edu/csed/poop/huff/info/.

Assignment Details

Your assignment is to implement a HuffmanTree and complete the provided codecs, thereby allowing you to encode and decode both Strings and text files.

This assignment can be seen as being made up of three parts: implementing the HuffmanTree, completing the HuffmanStringCodec, and completing the HuffmanFileCodec. You will need to implement the HuffmanTree first; however, you may wish to work on the HuffmanStringCodec simultanously, in order to help test your tree's functionality.

First, have you carefully read and followed the above discussion of bit strings? You should have!

HuffmanTree

The main class you will be creating will be the HuffmanTree class. This class should be an ADT representing a Huffman Tree.

Once you have the inner Node class, instance variables, and constructors put together, you can start implementing the methods required by the AbstractHuffmanTree interface. Details on the HuffmanTree specific methods are below:

Be sure and test and comment your class thoroughly. Once this is tested and working, you can put it aside and just finish up the codecs!

HuffmanStringCodec

The HuffmanStringCodec and the HuffmanFileCodec have 3 primary methods: createTree() which sets up a HuffmanTree, encode() which encodes a message using the HuffmanTree, and decode() which takes an encoded message and converts it back to normal.

HuffmanFileCodec

For the HuffmanFileCodec, the createTree() methods (there are two versions!) are provided for you, in order to help handle some of the file processing.

And that should do it! Make sure all your methods and classes are fully functionally and documented. If there are any problems, let me know.

Timeline

This is a long, involved assignment. There are a numerous pieces, but you should be able to get them all done in the time allotted if you work diligently. An (admittedly) aggressive timeline is as follows:

Extensions

There are a couple of places you could improve or extend this assignment. These alterations can be worth a few points of extra credit.

Submitting

BEFORE YOU SUBMIT: make sure your code is fully documented and functional! If your code doesn't run, I can't give you credit! Your name should be in the class comment at the top of each class.

Submit all your classes to the Hwk6 submission folder on vhedwig. Make sure you upload your work to the correct folder!. You should upload your entire src folder with all your classes. Be sure to complete the provided README.txt file with details about your program.

The homework is due at midnight on Sun, Nov 16. Note that we also have the second exam the following Monday; getting the homework done earlier will give you more chance to study!

Grading

This assignment will be graded out of 38 points.