CS 261 Homework 7 - Heroes and Hashes

Due Wed Dec 10 at 11:59pm

Overview

For this assignment you will practice implementing a HashMap, programmatically testing the efficiencies of different load factors and hashing algorithms.

In particular, you will be implementing an ADT that represent a HashSet--a Set (an unordered list of unique items) built on top of a HashMap. Note that a HashSet is basically a HashMap but only deals with single values instead of key-value pairs, making the implementation a bit more straightforward. Specifically, you will be creating a HashSet to store a "roster" of Super Heroes, testing the efficiency of your program using a large database of superheroes! You will run a series of "benchmark" tests on your implementation to measure the impact of different load factors, comparing these values with the theoretical ones discussed in class and in the textbook.

This assignment should be completed individually. You are welcome to ask for help from me, or from your classmates (but remember to follow the Gilligan's Island rule)! Note that much of the code for this assignment can be adapted from chapter 7 of the textbook or from class lecture code, so it should be straightforward to write (and doable in the relatively short time provided).

Objectives

To practice implementing a HashMap
To practice with open addressing and linear probing
To practice testing the correctness of an implemented ADT
To practice measuring the efficiency of a data structure's implementation

Necessary Files

You will need a copy of the Hwk7.zip file. This file contains two interfaces that your classes will need to implement:

HeroSet This public interface gives the main Set functions that we care about for this ADT (basically everything but the iterator). Your HeroRoster class will need to implement this interface.
HeroHasher A interface for an object that has one purpose: to calculate hash codes for hero names. By creating different implementions of this interface, it becomes easy to swap out the hashing algorithm used by different ports without needing to modify the ADT's source code! This is an example of what is called the strategy design pattern (a concept you'll learn about more in CS240).

The zip file also contains a large sample data file you can use: marvelheroes.txt is a list of Marvel super heroes and their appearances. Each line in the file gives the name of the comic character, followed by a semicolon (;), followed by a code indicating a comic the hero appeared in. This file contains about 6500 characters with almost 100,000 appearances between them (it's a big file).

This file is adapted (cleaned up) from data found here; a more complete data file can also be found here.

Lastly, the zip contains the README for the assignment, which you will need to fill out. You will need to include detailed results of your benchmark tests, so don't forget that!

Assignment Details

You will be creating a couple of new classes for this assignment: the HeroRoster ADT, a CorrectnessTester testing class, and a RosterBenchmarker testing class.

HeroRoster

Your primary class will be your HeroRoster: an ADT representing a set of heroes. This class will be implemented directly as a HashSet using linear probing (the same way that the DecomposableShape was a LinkedList, and the HuffmanTree was a BinaryTree). You'll need to develop this class from scratch--except it won't really be from scratch; the textbook has lots of example code (see Chapter 7) that you can adapt, and we will be going over much of the implementation in class (again, which you can adapt). You are welcome to use this code in your program---just be sure and to credit the source of your code, and to add copious comments to demonstrate you understand how it works!

Your HeroRoster will need to implement the provided HeroSet interface. However, since constructors can't be specified in an interface, and an abstract class wouldn't be appropriate (why not?), the required constructor is detailed below:
```
  /**
   * Creates a new HeroRoster object, utilizing the given hasher
   * @param loadThreshold The maximum load factor for the roster
   * @param hasher A HeroHasher used to generate hashcodes for the heroes.
   */
  public HeroRoster(double loadThreshold, HeroHasher hasher)
  {
    ...
  }
```
- The first parameter is the maximum load factor--once this percentage of slots has been filled, your underlying hashtable should increase in size. 0.75 is generally a good default and useful for testing.
- The second parameter is the HeroHasher, described below:
- As an optional component, you can also have the constructor take in an initial number of slots. However, for this assignment you should run your benchmarks with a starting value of 5 for this assignment--after all, that's a good size for a hero team right?
The second parameter to for your constructor is for a Herohasher. This is a simple object that is able to take in a hero's name and "hash" it into a value that can be used by the roster.
- By specifying an object that does key hashing, you can easily swap out different methods---including custom methods---for calculate hashcodes. Thus rather than calling .hashCode() on the key in order to get a hashcode, you pass that key to the Hasher's hash() method in order to get the hashcode.
- You will need to provide at least two (2) different HeroHasher classes with different hashing algorithms. One of these should be the built-in Java hashCode() method. The other should be a method of your devising.
  - Free free to get creative with different ways to convert a license String into a (positive) integer. You might also look at using pieces similar to other, cryptographic hashing functions
  - You'll need to look for any impacts on your Roster's benchmarks when using different hashing functions!

The other methods your HeroRoster needs to implement are explained in the interface. I've included a few scattered notes below, but at this point you should be capable of writing an ADT to match an interface.

Start with the add() and contains() methods; these method will need to perform linear probing as we discussed in class. You'll need to watch out for items that have previously been removed; if you find a spot that used to have a removed item, you should add the new hero there.
- You might get good milage out of using a private probe() helper method that does the actual linear probing search; this will help avoid code duplication.
- If your hashtable ever hits the specified load threshold when adding, you will need to rehash the table into a bigger array. Go through all the items in the old array, calculate their new hashcode, and then add them to that table (using linear probing).
- Remember to update the size of the hash table after adding a hero!
The remove() method is the trickiest. The main point is that when we remove an item, we need to leave a marker saying that there used to be something there--some String that marks the spot as DELETED. You should use a constant to represent this marker so that you can easily change it.
- Remember to update the size of the hash table after removing a hero!
The clear() method should also reset the size of your hashtable back to its initial value
The getLoadFactor() is the one non-standard method you'll need to include. This should return the current load factor of the table (the percentage of how many "spots" are used).
- Note that deleting an item doesn't change the load factor of the table--that "spot" is still used when probing, it's just open for future additions. You'll also want to keep this in mind when calculating whether you should rehash or not.

As always, if there are any questions please ask (and sooner rather than later)!

CorrectnessTester

As you are implementing a new ADT from scratch, you will also need to test that it works! This class should include a set of unit tests (like we've used before) that demonstrates that your HeroRoster works as expected.

IMPORTANT: each "test" should print out what you're testing (e.g., "testing multiple adds"), what the expected result is (e.g., "expected size: 10") and what the actual result is (e.g., "actual size: 10"). Your final submission should include all of the given tests being run: in other words, I should be able to execute this class's main method and see that everything works.
I recommend initially testing with a small set of known values that you can also calculate by hand: e.g., hero identities for the 8 friends we discussed in class. Check that each method does what you expect!
You can also come up with "known" results by running your tests on the built-in java.util.HashSet class. We know that this hashset works, so your HeroRoster should produce the same result!
Your final CorrectnessTester might also include the following tests (e.g., I might use them for grading...):
1. Add the first 1000 entries from the hero database, checking how many additions are successful and the reported size of the roster
2. And and then remove the first 1000 entries from the hero database, checking the reported size of the roster (it should be empty!)
3. Add the first 1000 entries and then search for the first 1000 entries, checking how many searches are successful (all should be!)
4. Add the first 1000 entries and then search for the next 1000 entries, checking how many searches are successful
5. Add the first 1000 entries, then remove items based on the next 1000 entries, checking how many removals are successful and the reported size of the roster

RosterBenchmarker

Once you have your implemented working (and you're sure that it works, because you tested it!) you will need to run a number of efficiency tests (or "benchmarks") on your data structure, experimenting with different load thresholds and your different HeroHashers.

You should run your benchmarks on the following configurations:
1. A HeroRoster with the Java hasher and a load threshold of 0.25
2. A HeroRoster with the Java hasher and a load threshold of 0.5
3. A HeroRoster with the Java hasher and a load threshold of 0.75
4. A HeroRoster with the Java hasher and a load threshold of 0.9
5. A HeroRoster with the Java hasher and a load threshold of 0.95
6. A HeroRoster with your custom hasher and a load threshold of 0.25
7. A HeroRoster with your custom hasher and a load threshold of 0.75
8. The built-in Java HashSet class with a load threshold of 0.75. Note you will not need to measure average load factor for this configuration.
There are a lot of these configuration: I recommend you make a separate helper method (likely more than 1) to run the actual Benchmarks; then you can simply pass an appropriate object to your benchmark() method.
You will evaluate the performance of your ADT using the following tests. Each test should be treated as "stand alone".
1. Add the entire hero database to the Roster. You should measure:
  - The total wall-clock time it takes to add all the items
  - The average time required for each add() call (divide the total time by the number of method calls)
  - The average load factor of the table (calculate this value after each method call and then divide by the number of method calls)
  - The total memory space used by your hashtable in the end. You can calculate this (in kilobytes) with the following code:
    long memoryUsedInKb = (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory())/1024;
    Note that this will tell you the memory used by the entire program (including the file you've loaded in, any other temporary variables, etc), and can be easily influenced by the garbage collector. Nevertheless, it should give you a rough sense of how much space is being used by different configurations.
2. Add the entire hero database, then search for the heroes in all of the entries in the database. You should measure:
  - The average time required for each contains() call
  - The load factor of the table (this will be constant since the table is not changing!)
3. Add 1000 random entries from the database. The easiest way to do this is to load all of the heroes into an ArrayList, and then select random items from that ArrayList. You should use a seed value of 616 for your benchmark, so that you load the same random heroes each time. You should measure:
  - The total wall-clock time it takes to add all the items
  - The average time required for each add() call
  - The average load factor of the table after each add() call
4. After having added 1000 random entries, search for a different 1000 random entries (randomly chosen using the same seeded random number generator you used to add the entries). You should measure:
  - The average time required for each contains() call
5. Run the same benchmark as step 3, but with a seed value of your choosing.
6. Run the same benchmark as step 4, but with a seed value of your choosing.
Yes, there are a lot of tests to run--that's the nature of testing! However, you should be able to write helper methods to let you re-use a lot of the code, making it trivial to run these benchmarks repeatedly on different configurations.
Finally, you will need to write up the results of your benchmarks in a report. You should write this in a program like Word, creating tables of your data (in addition to allowing me to reproduce your results by running your Benchmarker). Additionally, your report should answer the following questions:
1. How does average add() time for your HeroRoster change with load factor? What would account for this difference?
2. How does the space required by your HeroRoster change with load factor? What would account for this difference/
3. Does the average contains() time match the theoretical number of collisions based on load factor? Note that you will need to interpret the rate of change, since you'd be comparing wall-clock time to # of collisions.
4. How does the performance of your implementation compare with that of the built-in Java HashSet?
5. What impact did your custom hashing algorithms have on the results? Did it slow down the add() and contains() methods? By how much? Why do you think this is the case?
Your report should submitted in .pdf format. Note that you can Print or Export to pdf from most word processors (and let me know if you have any problems).

Timeline

There are a lot of tasks to complete in this assignment--particularly in the benchmarking (though most of that is just adjusting variables and calling methods multiple times). But most of the work (implementing the HeroRoster as a HashSet) can be adapted from the textbook and the lecture code. But since we'll be continuing to talk about HashMaps, you may consider completing this assignment in the following order:

Start by reading through the assignment and making sure you understand the task. You should be able to quickly implement your HeroHasher classes; this involved writing hash functions similar to the one(s) we talked about in class. You should be able to do this by Tues 12/02.
You should then write the CorrectnessTester, running it on the java.util.HashMap class to make sure things work. This will basically give you a "tester" so that you can check your code as you write it! Try to do this by Wed or Thurs. Writting the Benchmarker would also be possible.
By this point, we will have discussed linear probing, and so you can begin implementing your HeroRoster class. This is the hardest part. Work on one method at a time; do two methods a day and you should be done by Mon 12/08.
Finally, you can finish up the Benchmarker and run your performance tests. Documenting the results in your report to turn it in by Wed 12/10.

Extensions

As an extension, you can implement a ChainedHeroRoster that uses chaining to resolve collisions, a QuadraticHeroRoster that uses quadratic probing to resolve collisions, or a DoubleHeroRoster that uses double hashing to resolve collisions. You should run your benchmarks on this new configuration and include the results in your report.

You can earn up to TBD points of extra credit implementing and benchmarking these other Rosters.

Submitting

BEFORE YOU SUBMIT: make sure your code is fully documented and functional! If your code doesn't run, I can't give you credit! Your name should be in the class comment at the top of each class.

Submit all your classes to the Hwk7 submission folder on vhedwig. Make sure you upload your work to the correct folder!. You should upload your entire src folder with all your classes.

Additionally, upload your written report in .pdf format. Your report should include tables of data as well as a brief analysis of that data (including answers to the above questions).

Finally, be sure to complete the provided README.txt file with details about your program.

The homework is due at midnight on Wednesday, Dec 10.

Grading

This assignment will be graded out of 37 points.

Functionality (31pt)
[1pt] You have implemented two different HeroHashers
[1pt] One HeroHasher uses an appropriate hashing algorithm of your own devising
[2pt] Your HeroRoster implements the required public interface
[1pt] Your HeroRoster utilizes a given HeroHasher
[1pt] You have implemented the HeroRoster class as a HashSet
[2pt] Your HeroRoster uses linear probing to resolve collisions
[2pt] Your HeroRoster rehashes based on the given load threshold
[1pt] Your HeroRoster appropriately handles deleted items
[1pt] Your getLoadFactor() method returns an appropriate value
[1pt] You have implemented a CorrectnessTester that tests your HeroRoster
[4pt] Your CorrectnessTester includes sufficient unit tests for your HeroRoster
[2pt] Your unit tests describe the test, the expected output, and the actual output
[1pt] Your have implemented a RosterBenchmarker that benchmarks your Roster
[2pt] Your Benchmarker tests the appropriate configurations
[3pt] Your Benchmarker runs the appropriate benchmarks
[1pt] You have written up the results of your benchmarks in a report
[5pt] You have answered the required questions regarding your system
Style & Documentation (6pt)
[2pt] Your code demonstrates good style (descriptive method & variable names, good coupling, etc)
[1pt] All methods include JavaDoc comments with appropriate JavaDoc tags (@param and @return).
[1pt] You have included sufficient inline comments explaining your code (particularly the linear probing!)
[1pt] Your report is submitted in .pdf format
[1pt] The README is completed