CS 161 Lab I - Word Frequency Counting
Due Fri Nov 08 at 9:00am
Overview
In this lab, you will practice searching and counting a large amount of data--specifically, all of the words in a full-length novel! You will create a program that "reads" through the novel, and then prints out details about the frequency of words used in the novel. Your output might look something like:
Total words: 121875 Unique words: 6569 Most frequent word: the (4331) Name occurrences: Elizabeth (635) Jane (292) Darcy (417) Bingley (306) Fredrick (0) Words that occur more than 500 times: AND (3578) By (636) It (1535) is (857) a (1947) ...
These are the results for the provided full text of Jane Austin's Pride and Prejudice; as an extension you may also run your analysis on other books.
This lab will be completed in pairs. Be sure to review the pair programming guidelines. You should also try to work with a different partner than you have before!
Objectives
- To practice creating and using classes to store data
- To practice working with ArrayLists
- To practice writing searches
Necessary Files
You will need to download and extract the BlueJ project from the LabI.zip file. This project will supply you with the following files:
-
A
WordScanner
class that you can use to read each word of a book file. This class works similar to thejava.util.Scanner
we've used before, but is able to read words regardless of punctuation.- The class's constructor takes a String containing the name of the file you wish to scan. For example "pride.txt"
- The method has a
hasNextWord()
andnextWord()
method, which work similar to the methods of the java.util.Scanner (and follow the same pattern as the keyboard input from the GuitarPlayer!). You can use these methods and a loop to read through the file one word at a time.
-
The file
'pride.txt'
which contains the full text of Pride and Prejudice. Note that there is also a 'license.txt
' file that gives information on the license that was granted to use the text of the book. -
Finally, there is a file
'correct.txt'
which contains the output from my version of the program. You can use this to check your work.
Details
In this lab, you will need to make a couple of new classes, but each one is fairly simple
-
You will need somewhere to store the information you're gathering--namely, the counts of all the words in the novel. Since we'll have a lot of words, an
ArrayList
would be a good choice (think: why not an array?).However, you'll need to store both the words and the count of each word--what words there are and how many of each. You could use two
ArrayLists
, but that is a bad programming practice called parallel arrays (see the note on page 318 of the textbook).Instead, you'll want to create a separate class that can model an object which represents both the word and its number of appearances, and then make an
ArrayList
of those objects. -
Make a class called
WordCount
that tracks both a word and a count of how many times that word has appeared. Each instance of this class will represent a word and a number of appearances.- Think about what instance variables this class will need (hint: how many things are you keeping track of?)
- What methods (if any) should this class have?
- Normally you make instance variables private, but since this class is acting like a wrapper class, you might consider making them public. Public instance variables can be accessed directly using dot notation. It is also perfectly acceptable to just make getters and setters instead
- Defining a
toString()
method will be helpful as well for printing out the word and its count.
-
Now that you have an data type that can store your information, you'll want to make a class to organize and access that data. Make a new
WordFrequencyList
class that represents a list of words and their frequencies.- You can think of this class as being similar to the CardGame or GuitarString; it will have an
ArrayList
of items as an instance variable. Think: what type of object will you be putting in this list? How does this influence the way you declare the variable? -
You will need to make an
add()
method that lets you add a word to your internal ArrayList. You will need to search the list to see if it already contains an "entry" for that word. If the word is already in the list, simply increase its current count by 1. If the word isn't already in the list, you should add it (with a count of 1).- "The" and "the" are the same word. What will you have to do to make sure that they are not counted separately?
-
You should also make a
get()
method that returns the WordCount at a particular index. This will let you "access" theWordFrequencyList
so you can do things like print out items from it.
- You can think of this class as being similar to the CardGame or GuitarString; it will have an
-
At this point is a good idea to make your "main" class (e.g.,
WordTester
) that you will use to actually read in the book. Note that you will be creating and looping through theWordScanner
in this main class.- Try printing out the count of a single word (e.g., "the") as you read each word from the file, to make sure that your counts are increasing appropriately.
- Note you shouldn't have any input or output in the
WordFrequencyList
class, just in the main class!
-
Once your data set all setup, it is time to do something useful with it. You will need to create a number of useful functions (methods). These methods are detailed in the JavaDoc excerpt at the bottom of the page. Your functions should be able to answer the following questions:
- What is the most frequent word in the book?
- How often do the following names occur in the book: Elizabeth, Jane, Darcy, Bingley, and Fredrick?
- What words occur more than 500 times?
- How many total words are in the book?
- How many different words are in the book?
Your tester should call each of these functions in turn, printing out the results out.
- Note: this is a large enough data set that your program may take several seconds to run. If it is taking 10s of seconds, you probably have a bug (or a rather slow computer).
-
As always, be sure and test your code to make sure your answers make sense! Note that you can use the included
correct.txt
file to check your answers.
Extensions
- Could you use one of your search methods to find the top 10 most used words? You might try calling
occursMoreThan()
multiple times... would this be an efficient implementation or might there be a better way? - You can also try running these searches (or related ones) on other books. You can find lots of free books in .txt format at Project Gutenberg. Be sure and create a separate tester for other books, so that I can easily check your base code!
Submitting
Once you are sure that your program works (remember to test your code), make sure that both your and your partner's names are in the class comment at the top of all the classes you created. Then upload the entire project directory to the LabI submission folder on hedwig. Only one partner needs to upload the code. Make sure you upload your work to the correct folder! The lab is due at the start of class the morning after lab.
After you have submitted your solution, log onto Moodle and submit the Lab I Partner Evaluation. Both partners need to submit evaluations!
Grading
This assignment will be graded on approximately the following criteria:
- You have created a WordCount class to hold words and counts [10%]
- You have created a WordFrequencyList class to organize the list [5%]
- WordFrequencyList has an add() method that adds another count of a word [10%]
- You have implemented the get() and wordCount() methods [10%]
- You have implemented the mostFrequent() method [10%]
- You have implemented the occursMoreThan() method [10%]
- You have implemented the totalWords() method [10%]
- You have implemented the numberOfWords() method [10%]
- You have a tester that prints the results to the 5 listed questions [10%]
- You have used good programming style and documentation [10%]
- You completed your lab partner evaluation [5%]
Documentation
Below is an example Javadoc for your WordFrequencyList
class.
Class WordFrequencyList
-
public class WordFrequencyList
Constructor Summary WordFrequencyList()
Constructs a new WordFrequencyList object, with an empty list of WordCountsMethod Summary void
add(String word)
Adds the specified word into the list.WordCount
get(int index)
Returns WordCount at the specified position in the list.WordCount
mostFrequent()
Returns the most frequent word in the list.int
numberOfWords()
Returns the number of unique words in the list.WordFrequencyList
occursMoreThan(int n)
Returns a list of all words that occur more than N times.int
totalWords()
Returns the total number of words in the list.int
wordCount(String word)
Returns the number of times the specified word occursMethods inherited from class clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail WordFrequencyList
public WordFrequencyList()
- Constructs a new WordFrequencyList object, with an empty list of WordCounts
- Adds the specified word into the list.
- Parameters:
-
word
- The word to be added.
- Returns WordCount at the specified position in the list.
- Parameters:
-
index
- the index of the element to return. - Returns:
- the Word at the specified position.
- Returns the most frequent word in the list.
- Returns:
- the most frequent word in the list
- Returns the number of unique words in the list.
- Returns:
- number of unique words in the list
- Returns a list of all words that occur more than N times.
- Parameters:
-
n
- Threshold; all words that occur more than n times will be returned. - Returns:
- a list of all words that occur more than N times.
- Returns the total number of words in the list. This is the sum of all word counts.
- Returns:
- total number of words in the list
- Returns the number of times the specified word occurs
- Parameters:
-
word
- the word whose count is desired - Returns:
- the number of times the specified word occurs
Method Detail |
---|
add
public void add(String word)
get
public WordCount get(int index)
mostFrequent
public WordCount mostFrequent()
numberOfWords
public int numberOfWords()
occursMoreThan
public WordFrequencyList occursMoreThan(int n)
totalWords
public int totalWords()
wordCount
public int wordCount(String word)