Ling/CSE 472: Assignment 1: Regular expressions
Due October 9th, by the start of class
1. Elizalike
This part of the assignment asks you to create a program that
behaves like Weizenbaum's ELIZA (see p.32-33 of the text). We have
provided a skeleton of a script that handles input and output, and
provides an example of the Perl syntax for using regular expressions
to modify strings.
Each student should develop their own program, although you are
welcome to ask each other questions (in person, over email, or on the
EPost
bulletin board). You will need to find a partner for this
project, as one of the tasks is to test each other's programs (see
below).
Specifications: The basic strategy is to read in a string of input
from the user, modifying it successively (sometimes subtly, sometimes
drastically, depending on the input string), and print out the result.
To maintain the illusion of AI, it is crucial that elizalike print out
grammatical strings. (You may assume that it is given grammatical input.)
Furthermore, elizalike should be able to handle person deixis, referring
to itself in the first person and to the user in the second person.
Before you start, look at the list of items to
turn in below, so you know what you'll need to save.
Your tasks:
- Develop a list of sentences that you will use to test your program
to make sure it handles the person deixis correctly. Be sure to
include all ways in which 1st and 2nd person are marked in English
(full pronoun paradigms, and subject-verb agreement with the verb be),
and all possible forms each of those elements can appear in
(including variation in capitalization).
- Modify the perl script to implement the handling of person deixis.
The basic strategy is to first replace any second person reference in
the input with THIRD person reference to Eliza (or some other string
that's unlikely to show up otherwise). Then replace any first person
reference in the input with second person reference. Finally, replace
third person reference to Eliza with first person reference. Each
of these steps will take several lines as you handle pronouns and verbs
and upper and lower case letters (i.e., if the user type "My friend..."
Elizalike's output should be "Your friend..." and not "your friend...").
Be sure to read all of the comments in the file (lines starting with #,
which are for human consumption and ignored by Perl).
You should probably test each line as you add it, by running the
program again and using an appropriate sentence from your test file.
Note that before you make any changes, the program runs, just in a
boring way: It repeats whatever the user types in.
Instructions on using Perl
- Add at least two statements that find one keyword in the input and
change the whole string to something different. (See the second and
third examples on page 33 of the textbook for a model, but don't copy
them exactly!)
- Add at least two statements that find some keyword in the input,
and return a significantly changed output that noneless contains some
part of the input that may vary from time to time. (See the first
example on page 33, but feel free to get fancier than that!)
- Find a partner and exchange programs. Looking at the code
for your partner's program, try to find at least 2 interestingly
different inputs that cause their program to produce ungrammatical
output. (Keep your inputs grammatical!) We're pretty sure you'll
be able to find these, but if your partner's program is too perfect,
you can get full credit for this part of the assignment by turning
in an explanation of 5 pitfalls you looked for and how they were
avoided.
- Modify your program to avoid the ungrammatical outputs your
partner found (if any). It is preferable to keep the original
functionality of your program and fix the bugs, but if that's
impossible, you can replace the problematic statement(s) with simpler
ones with different behavior.
- In 2-4 paragraphs, discuss why English morphology and syntax
make this program relatively straightforward, and how it would
be more complicated in some other specific language.
To turn in:
- Your list of test sentences.
- The first version of your program (that you gave to your partner).
- The name of your partner and the problems you found with their
program, or an explanation of how they avoided 5 pitfalls you thought up.
- The second version of your program.
- Your discussion of English and other language morphology and
syntax --- see the last task above.
All of the above should be turned in via email to ebender@u and davidgg@u by 2:30pm on Thursday, October 9.
2. Tokenizer
This part of the assignment asks you to write a perl script that
will take an ordinary text file, and return a file with the same
content, reformatted to be one sentence per line.
Each student should develop their own program, although you are
welcome to ask each other questions (in person, over email, or on the
EPost
bulletin board).
Once again, we will supply a skeleton perl script which handles
input and output (this time reading in a file and writing out to a
file). We will also supply a test file that you will use to develop
the script.
The basic algorithm is the following:
- Read the input file in one line at a time, and modify the
lines as follows, before printing them to the output file:
- Remove all existing newlines.
- Replace all periods that do not indicate the end of a sentence
with a special string.
- Do the same for other typical sentence-final punctuation marks
(using a different special string for each one).
- Put in a newline after every remaining sentence-final punctuation
mark.
- Replace the special strings you put in with the punctuation marks
they correspond to.
Specifications: Treat .?!: as sentence-ending punctuation.
Quotation marks after a sentence-final element should be on the same
line as that element. Don't worry about break a single quote over different
lines.
Your tasks:
- Download the skeleton script, the test file, and a tokenized version of the test
file for comparison (the gold standard) to
your local machine.
- Modify the perl script as specified above so that it produces
an output file that matches the gold standard. You call this script
with an argument designating the input file:
perl tokenizer.pl inputfile
- Find some other text file to run it over, such as an article from
Yahoo! news. Identify at least 3 cases your script doesn't yet handle
properly, 2 where it overgenerates (splitting where it shouldn't) and
1 where it undergenerates (not splitting where it should). (If you
don't find them in the first file you try, run more text
through.)
- Modify your script to handle those 3 cases properly.
- One might prefer to have entire quotes show up on one line in the
output, even if they contain complete sentences within them. Discuss
why this isn't possible given the general algorithm for this script,
or modify the script so that it does keep entire quotes on one line.
(Hint: To make it work, you'll need to put Perl into paragraph mode
and enable multiling matches. This can be done with the following two
statements:
$/ = "";
$* = 1;
There's a bit more to it than that, though.)
To turn in:
- The first version of your script.
- A brief description of the cases you didn't handle properly.
- The second version of your script.
- The discussion requested in the last task above, or a third
version of your script.
All of the above should be turned in via email to ebender@u and davidgg@u by 2:30pm on Thursday, October 9.
Back to main course page