CS 161 Lab I - Word Frequency Counting

Due Fri Nov 08 at 9:00am

Overview

In this lab, you will practice searching and counting a large amount of data--specifically, all of the words in a full-length novel! You will create a program that "reads" through the novel, and then prints out details about the frequency of words used in the novel. Your output might look something like:

Total words: 121875
Unique words: 6569
Most frequent word: 
  the (4331)
Name occurrences:
  Elizabeth (635)
  Jane (292)
  Darcy (417)
  Bingley (306)
  Fredrick (0)
Words that occur more than 500 times:
  AND (3578)
  By (636)
  It (1535)
  is (857)
  a (1947)
  ...

These are the results for the provided full text of Jane Austin's Pride and Prejudice; as an extension you may also run your analysis on other books.

This lab will be completed in pairs. Be sure to review the pair programming guidelines. You should also try to work with a different partner than you have before!

Objectives

Necessary Files

You will need to download and extract the BlueJ project from the LabI.zip file. This project will supply you with the following files:

Details

In this lab, you will need to make a couple of new classes, but each one is fairly simple

  1. You will need somewhere to store the information you're gathering--namely, the counts of all the words in the novel. Since we'll have a lot of words, an ArrayList would be a good choice (think: why not an array?).

    However, you'll need to store both the words and the count of each word--what words there are and how many of each. You could use two ArrayLists, but that is a bad programming practice called parallel arrays (see the note on page 318 of the textbook).

    Instead, you'll want to create a separate class that can model an object which represents both the word and its number of appearances, and then make an ArrayList of those objects.

  2. Make a class called WordCount that tracks both a word and a count of how many times that word has appeared. Each instance of this class will represent a word and a number of appearances.
    • Think about what instance variables this class will need (hint: how many things are you keeping track of?)
    • What methods (if any) should this class have?
    • Normally you make instance variables private, but since this class is acting like a wrapper class, you might consider making them public. Public instance variables can be accessed directly using dot notation. It is also perfectly acceptable to just make getters and setters instead
    • Defining a toString() method will be helpful as well for printing out the word and its count.
  3. Now that you have an data type that can store your information, you'll want to make a class to organize and access that data. Make a new WordFrequencyList class that represents a list of words and their frequencies.
    • You can think of this class as being similar to the CardGame or GuitarString; it will have an ArrayList of items as an instance variable. Think: what type of object will you be putting in this list? How does this influence the way you declare the variable?
    • You will need to make an add() method that lets you add a word to your internal ArrayList. You will need to search the list to see if it already contains an "entry" for that word. If the word is already in the list, simply increase its current count by 1. If the word isn't already in the list, you should add it (with a count of 1).
      • "The" and "the" are the same word. What will you have to do to make sure that they are not counted separately?
    • You should also make a get() method that returns the WordCount at a particular index. This will let you "access" the WordFrequencyList so you can do things like print out items from it.
  4. At this point is a good idea to make your "main" class (e.g., WordTester) that you will use to actually read in the book. Note that you will be creating and looping through the WordScanner in this main class.
    • Try printing out the count of a single word (e.g., "the") as you read each word from the file, to make sure that your counts are increasing appropriately.
    • Note you shouldn't have any input or output in the WordFrequencyList class, just in the main class!
  5. Once your data set all setup, it is time to do something useful with it. You will need to create a number of useful functions (methods). These methods are detailed in the JavaDoc excerpt at the bottom of the page. Your functions should be able to answer the following questions:
    1. What is the most frequent word in the book?
    2. How often do the following names occur in the book: Elizabeth, Jane, Darcy, Bingley, and Fredrick?
    3. What words occur more than 500 times?
    4. How many total words are in the book?
    5. How many different words are in the book?

    Your tester should call each of these functions in turn, printing out the results out.

    • Note: this is a large enough data set that your program may take several seconds to run. If it is taking 10s of seconds, you probably have a bug (or a rather slow computer).
  6. As always, be sure and test your code to make sure your answers make sense! Note that you can use the included correct.txt file to check your answers.

Extensions

Submitting

Once you are sure that your program works (remember to test your code), make sure that both your and your partner's names are in the class comment at the top of all the classes you created. Then upload the entire project directory to the LabI submission folder on hedwig. Only one partner needs to upload the code. Make sure you upload your work to the correct folder! The lab is due at the start of class the morning after lab.

After you have submitted your solution, log onto Moodle and submit the Lab I Partner Evaluation. Both partners need to submit evaluations!

Grading

This assignment will be graded on approximately the following criteria:

Documentation

Below is an example Javadoc for your WordFrequencyList class.



Class WordFrequencyList

public class WordFrequencyList


Constructor Summary
WordFrequencyList()
          Constructs a new WordFrequencyList object, with an empty list of WordCounts
 
Method Summary
 void add(String word)
          Adds the specified word into the list.
 WordCount get(int index)
          Returns WordCount at the specified position in the list.
 WordCount mostFrequent()
          Returns the most frequent word in the list.
 int numberOfWords()
          Returns the number of unique words in the list.
 WordFrequencyList occursMoreThan(int n)
          Returns a list of all words that occur more than N times.
 int totalWords()
          Returns the total number of words in the list.
 int wordCount(String word)
          Returns the number of times the specified word occurs
 
Methods inherited from class
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordFrequencyList

public WordFrequencyList()
Constructs a new WordFrequencyList object, with an empty list of WordCounts

Method Detail

add

public void add(String word)
Adds the specified word into the list.

Parameters:
word - The word to be added.

get

public WordCount get(int index)
Returns WordCount at the specified position in the list.

Parameters:
index - the index of the element to return.
Returns:
the Word at the specified position.

mostFrequent

public WordCount mostFrequent()
Returns the most frequent word in the list.

Returns:
the most frequent word in the list

numberOfWords

public int numberOfWords()
Returns the number of unique words in the list.

Returns:
number of unique words in the list

occursMoreThan

public WordFrequencyList occursMoreThan(int n)
Returns a list of all words that occur more than N times.

Parameters:
n - Threshold; all words that occur more than n times will be returned.
Returns:
a list of all words that occur more than N times.

totalWords

public int totalWords()
Returns the total number of words in the list. This is the sum of all word counts.

Returns:
total number of words in the list

wordCount

public int wordCount(String word)
Returns the number of times the specified word occurs

Parameters:
word - the word whose count is desired
Returns:
the number of times the specified word occurs