CS 261 Homework 5 - Random Sentence Generator (RSG)

Due Fri Apr 05 at 11:59pm extended to Tues Apr 09 at 11:59pm

Overview

For this assignment, you will create a program that reads a user-specified grammar and generates a user-specified number of strings ("sentences", but they may be longer than a single sentence!) from that grammar. You will also be creating at least one grammar of your own. This grammar can be for anything: extension requests, Star Trek episode summaries, haikus, political soundbites--be creative!

Beyond simply being entertaining, this program can be very helpful for generating data sets to test programs with. We only need to create a grammar to descibe the test data, and then can produce as much of it as we want: as a simple example, imagine using this program to create random movies to help test your Movie Library!

Objectives

To work with formal grammars
Beocome grand master file parser!
To utilize some of the data structures we've talked about in class
To continue using recursion in developing systems
To get even more practic working with classes, generics, and interfaces

Necessary Files

You will want a copy of the Hwk5.zip file which contains the README for this assignment. It also has an example grammar file (the cat-&-dog example from class) in the format we will be using. This can be helpful for testing your code.

A Few Notes on Grammars

A grammar is a collection of substitution rules that describe a set of strings or sentences. Each sentence is a sequence of symbols called terminals. Different kinds of sentence fragments are represented by nonterminal symbols called variables. A set of rules for each variable specify how it can be replaced by a sequence of variables and terminals. One of the variables is designated as the start variable, which means that it represents an entire sentence.

An example grammar (in BNF format) is below, to supplement the ones described in class. The variables are A and B, while the terminals are 0, 1, and #. The start variable is A.

<A> ::= 0<A>1<A> | <B>
<B> ::= #

This grammar says that the variable A can be replaced either with the sequence 0A1A or the variable B, while the variable B can only be replaced with #.

From a conceptual point of view, a grammar can be used to generate strings of terminals in the following manner. (Note that this isn't precisely how your program will work, but will have an equivalent effect)

Begin with the start variable.
So long as there are still variables that have not been substituted, take a variable and pick a rule for that variable. Replace the variable with the right-hand side of the rule that you chose.

For example, the string 00#1#1# can be generated by the grammar above. The following derivation---which begins with the start variable, with one substitution made to the leftmost variable at each step (appropriately called a leftmost derivation)---proves that it can be done; the new part of each step is shown underlined.

A => 0A1A => 00A1A1A => 00B1A1A => 00#1A1A => 00#1B1A => 00#1#1A => 00#1#1B => 00#1#1#

Since 00#1#1# can be generated by the grammar, we would say that the string 00#1#1# is in the language of the grammar. In other words, the language of a grammar is the set of all strings that can be generated from it. (Many grammars, including this one, have an infinite number of strings in their languages. This grammar generates an infinite number of strings since the first rule can be used an arbitrary number of times in a derivation.)

In your random sentence generator, a grammar will describe a set of "sentences" (which may indeed be infinite). Each sentence you generate will be a sequence of characters, with the characters being the terminals in the grammar. The variables in the grammar will describe sentence fragments, with the start variable describing an entire sentence.

Details

There are a number of pieces ot this program. Rather than walking you through the classes and methods to make, the following sections will give details about the individual components of your program. It is up to you to figure out the design of the code!

READ THESE INSTRUCTIONS CAREFULLY! Specifications for the project may be scattered throughout the description. And if you have any questions, please ask!

Input and Output

Your program should take a grammar file as input and write a specified number of randomly-generated "sentences", separated by newlines, into an output file. The name of the input file, the number of sentences, the name of the start variable, and the name of the output file should be able to be passed as command-line arguments to the program. This means that the arguments will be help in the args String[] that is the main method's parameter.
You can tell Eclipse what command-line arguments to use by clicking the drop-down arrow next to the green Run button, selecting "Run Configurations", and then selecting the "Arguments" tab for the configuration of the class you want to run, and then entering the arguments list into the "Program Arguments box":
Remember that your Tester should still have as little logic as possible--that means you should pass these arguments into the constructor or methods of your program for processing.
If the wrong number of arguments are specified, or the arguments are otherwise invalid, your program should print an error message and terminate. Why not throw an exception?

Reading Grammar Files

Your program will need to read grammars with rules in the format described below. The Hwk5.zip file also has an example (cat_and_dog.txt)
1. Each set of rules starts with a left curly brace ("{") on its own line and ends with a right curly brace ("}") on its own line.
2. After the opening brace, the first line of the rule is its variable (what goes on the left-hand side). Its name is delimited by angle brackets ("<" and ">"), which you should remove before storing the variable's (the substring() method of the String class might be handy for this).
3. Subsequent lines of the rule set are alternative right-hand sides for rules associated with that variable (i.e., different ways of rewriting the variable). Each rule consists of a sequence of terminal and non-terminal symbols, with non-terminals' names enclosed in angle brackets. Note that because any character can be part of a symbol, there is no "delimiter" between symbols. Variables can be identified by their brackets, but otherwise characters can be treated as part of a nonterminal symbol. In effect, white space is preserved.
4. Any symbol, including angle brackets, can be included in terminal symbols if they are escaped (e.g., "\<").
5. There may be lines of text outside the curly braces. These lines are intended to be comments in the grammar and should be ignored by your program.
You may assume that the grammar files will always be correctly formatted; you do not have to anticipate or correct any errors in the grammar file. At the same time, do not make assumptions about the format of the grammar file BEYOND this specification--how many variables or rules there are, what the name of the start variable is, etc. In general, any grammar file that meets the requirements above should be parsed successfully by your program.
You should be feeling pretty good at using the scanner to parse files by now. Don't be fooled though--parsing this kind of data can be tricky! Just think about reading the file line by line, and then parsing each line character by character (HINT: using a loop and the charAt() method may be easier than trying to get a scanner to read the line).
The first thing you should do is make sure you can read and properly parse the grammar file. Try printing out the appropriate "type" of each token that you are reading. For example, with the partial grammar file
```
{
<START>
<SUBJECT> <PREDICATE>
}
{
<NOUN>
cat
dog
}
```
You might print out:
```
Rule for --START--
  RIGHTSIDE
    variable --SUBJECT--
    terminal -- --
    variable --PREDICATE--
Rule for --NOUN--
  RIGHTSIDE
    terminal --cat--
  RIGHTSIDE
    terminal --dog--
```
You don't have to match the precise appearance of this example, particularly not the indentation (though you may do it that way if you like). You just want to identify each token and what its function is; the dashes are there to confirm the presence of whitespace.

Ultimately, you'll remove or comment out this code, since this output isn't part of the program specification. However, your program should include the commented print-lines as demonstration of your testing! Don't be afraid to write code that is useful for development, even if it won't be in your final product.

Data Structures for the Grammar

From the definition of a grammar, we can draw the following specifications:
1. A grammar contains a collection of rules.
2. A rule is defined by a variable and a "right-hand side"
  - Each variable can have multiple rules represented by multiple "right-hand sides"
3. A "right-hand side" is a sequence of terminals and variables.
4. Each variable has a set of associated rules.
This should clearly suggest some dieas of how to design the classes and objects that will be used to represent a grammar. I expect you may likely want to use classes similar to the following:
- Terminal A class representing the String that makes up a terminal. This may be the simplest class.
- Variable A class representing the Variable that may get replaced. This may also be a simple class, but representing it as an object may help turn it into a usable ADT.
- Rule A class representing a single replacement rule (a Variable and its right-hand side). The right-hand side is a sequence of terminals, and so should be represented by an appropriate data structure. Which of the data structures we've talked about in class would be most appropriate? A stack? A queue? A simple list? Another class?
  - Since you will need to have a sequence that contains both Terminals and Variables, it makes sense to create an abstraction---either an abstract class or an interface---that can encompase both these types. For example, you might have a Symbol parent class.
  - Note that you may want to want to have a data structure (or even another class) to group together the rules for a single Variable so that you can pick from them easily.
- RightSide As mentioned above, having a class to represent a single RightSide will likely make your life easier; a Rule then represents a Variable and a list of its RightSides
- Grammar A class representing the set of rules and variables that make up the grammar, as well as the start variable. You will often need to search through this collection to look up rules, and so should keep that in mind when choosing an appropriate data structure to use.
  - A map is actually a highly appropriate structure for this (can you see why?), and you are welcome to use it if you wish ~~but for the purposes of this assignment you should use one of the data structure that we've already talked about~~.
- You will also probably want a class to represent a RandomSentence---what you are trying to produce!
Take some time to think carefully about how to structure and connect these classes--using an object-oriented approach like the one described above will lead to a clean, easy recursive algorithm for generating sentences (similar to how you generated expressions in the Random Art lab).
You will of course want to modify your parsing code to create appropriate instances of these objects. How can you test that you are creating the right objects with the right attributes?

Generating Random Sentences

The hard part is getting your grammar structured correctly. Once you have that, it's relatively straightfoward to implement a recursive algorithm to generate random sentences from the grammar. This algorithm will work similar to the one in Random Art lab, relying on the idea of generating sentence fragments and then putting those fragments back together.
Almost all of your classes will need to have some method that lets them be "evaluated" or to "generate" a String based on a Grammar object. To ensure that everyone has this functionality, you should create an interface for them to implement that has such a method. That way each class will have the functionality, but the implementation will obviously differ.
- Note that this method will need to have a reference to a Grammar object (likely passed as a parameter) because many of the objects need to look up information in the Grammar in order to do their work.
Below is a sketch of what your algorithm might look like:
The Grammar calls the method on its Start Variable (that you may recall represents a whole sentence), and returns the String that is produced from that. The Variable looks itself up in the Grammar and chooses a random Rule to use. It calls the method on this Rule, and returns the result as the String that it produced. The Rule goes through each Symbol in its right-hand side, and calls the method on those. It then returns the String that is the combination of the results of each of those Symbols If the method is called on a Terminal, the Terminal just returns its String. This is effectively the "stopping condition" of the recursion.
If you've organized your classes right, then you will be able to recurse through the data structures and produce output using VERY little code (these methods should be quite simple)--just like in the Random Art lab.
Also just like with the Random Art lab, depending on your grammar, there may be the possibility for infinite recursion. You may want to track your recursive depth and make sure you don't go too deep.
Make sure to test your program extensively. You may want to start by always picking the "first" option in a rule to make sure that your recursion works, and then introducing randomness later.

Creating the Output

After you've generated your sentences, you will need to write them out to file. This is actually pretty straight forward: once you've created a File object representing the file you want to write to, you can make a PrintWriter object that gives you a PrintStream to write to. PrintWriter has a println() method that works exactly like System.out.println() (in fact, it is the same method--System.out is aso a PrintStream!)
For a more detailed example, check out (once again) the Movie class from the first homework, and how that wrote movies to file.
You are welcome to also print the output to the console for easy testing :)
Remember, you need to write to the file specified in the command-line arguments, not using a JFileChooser!

Submitting

BEFORE YOU SUBMIT: make sure your code is fully documented and functional! If your code doesn't run, I can't give you credit!

Also make sure you that have filled out the README.txt! There are actually some questions of note here that you will need to answer.

Submit your program to the Hwk5 submission folder on hedwig, following the instructions detailed in Lab A. You should upload all you classes, as well as the README.txt.

You also must include a brand new custom grammar file of your own design, following the format described above. Don't just replace words in the cat_and_dog example--try to come up with something fun and creative :) Bonus points are available for awesome grammars!
Make sure to include a BNF-style reporting of your grammar at the top of the file, as a comment

The homework is due at midnight on Fri Apr 05.

Extensions

Add in the ability to handle special escaped characters, such as new lines.
Add appropriate methods (e.g., toString()) that allows you to print out the structure of your grammar. This can either look like your testing output, or use a different tree structure
Add a second parsing method that will properly read and parse grammars in BNF format. Be sure to include a BNF grammar that demonstrates your work!
Add in the ability to print a parse tree for the inputted grammar.

Grading

This assignment will be graded based on approximately the following criteria:

Your program can properly parse a grammar file (including escaped character!) (testing prints should be commented out) [15%]
Your program uses good OOP design and appropriate data structures in implementing the Grammar and its constituent classes [25%]
Your program can generate random sentences from a grammar [20%]
Your program writes the random sentences to a file [5%]
Your program uses command-line arguments to specify its execution [5%]
Your submission includes a new custom grammar file [5%]
Your program demonstrates good Java style. Coupling & cohesion are particularly relevant [10%]
Program is fully documented with complete Javadoc comments [10%]
The README is completed [5%]