LING 571

LING571

Project 1: Hw2 — Hw4

Hw2 due on 10/13/05

1. Project Overview

The task: to implement CYK algorithm using one of the following languages:

Perl, C/C++, or Java.

The project is divided into three parts: (see Sections 4-5 for input/output format)

· Hw2, due 10/13/05: 10 pts: write code that reads grammar, lexicon, input sentence, and stores them in some appropriate data structures. Output a flat tree with the words and their POS tags.

· Hw3, due 10/20/05: 40 pts: add the CYK part, and output a single parse tree with the highest probability.

· Hw4, due 10/27/05: 250 pts: modify the code so that it outputs the N-best parse trees, where N is an input parameter.

As I explained in the class, the reason that Hw4 is worth more points is because I don’t want you to lose too many points if you could not make your code work during the first two weeks of the project.

Once you are done with Hw2, you are encouraged to work on Hw3 and Hw4 as soon as possible, as they take more time to complete. In fact, Hw3 is really a special case of Hw4, so you can work on Hw4 directly after finishing Hw2.

2. Major steps for Hw2:

(1) Copy files to your local directory your_hw2_dir

cp ~fxia/dropbox/571/Hw2/* your_hw2_dir/

(2) Write your own code.

a. The main file should be called Hw2.*. The file extension * is “C”(C++), “c” (C), “java” (java), or “pl” (perl), depending on programming languages that you use.

b. The command to run the code should be

cat input | ./run_hw2 syntactic_rule_file lexicon_file {N_best} > output”

· I have created two files named “grammar” and “lexicon” to show you what syntactic_rule_file and lexicon_file should look like.

· N_best is an optional parameter. If it is absent, the default value is 1, which means the code outputs the parse tree with the highest probability.

· Test your code with the example grammar/lexicon/test_input files. Also test it with your own sentences or grammar files.

(3) Write a README file:

a. Specify the programming language that you use.

b. List the names of all the source code files

c. Give a short description of data structures you used (something similar to the list in Section 7)

d. Write the command used to compile the code.

e.g., make Makefile

e. Write the command used to run the code, which should be the same as specified in (2b).

f. List any bugs/problems/comments that you want us to be aware of.

(4) Check your directory and tar it.

Your directory should contain the following files:

a. Files copied from ~fxia/dropbox/571/Hw2: grammar, lexicon, and test_input.

b. A file called “Readme”

c. A file called “run_hw2” which is a one-line shell script that runs java, a perl code, or a binary C/C++ code.

è this is the file that the grader will run to test your program.

d. Source codes, header file, binary codes, etc. The file contains the main function should be called “Hw2.*” (e.g., Hw2.C, Hw2.H, Hw2.pl, Hw2.java)

Tar the directory. Suppose the directory is called your_hw2_dir

cd to the parent directory of your_hw2.dir

tar –cvf userid_hw2.tar your_hw2_dir

userid is your pongo userid. e.g., my tar file should be called “fxia_hw2.tar”.

(5) Submit your homework using ESubmit.

a. I am not sure whether ESubmit works on linux. If not, you have to sftp the file from pongo to your local windows machine. Suppose the file is still called userid_hw2.tar.

b. Open a web browser, and go to our course website: http://faculty.washington.edu/fxia/courses/LING571-Fall2005.html

c. Click “Homework submission” on the left-hand side. Then Log in using your UW id and password.

d. Choose “Turn in document”, and click Next

e. On the next screen, click “Next” again.

f. Type in “Document Title” (any title will do), and click “Browse” to choose the file (userid_hw2.tar) to upload. Then click the arrow for the “Format” field and choose “Self Extracting Archive”. Then click “Submit Documents”.

g. A window will pop up, and you click “Ok”. Wait patiently as it could take a couple of minutes for the tool to upload your file.

h. Once the uploading is complete, you should see a “Digital Receipt”. Click “click here” to have a copy of the receipt emailed to me.

For more information on using ESubmit, please check the website:

http://catalyst.washington.edu/student/esubmit.html

3. Major steps for Hw3 and Hw4:

The steps are the same as the ones for Hw2, except that the file names will be called “run_hw3/run_hw4”, “Hw3.C/Hw4.C”, etc. I will create the ESubmit turn-in area for “Hw3” and “Hw4” later to reduce the potential confusion.

4. Format of Input files:

The input files for all three homework are the same. The blank lines in the input files should

be ignored.

(1) The input sentence file is a text file in which each line is a sentence. See test_input as an example

(2) The syntactic rule file is of the following format. See “grammar” as an example.

     Y => X1 X2 | prob | head_pos | word1 word2

     prob is the prob of the the rule. head_pos is the position of the head child (the index for children starts with 0). I will explain the term “head child” in class this week.

     word1 and word2 are terminals such as English words. They are the head words for X1 and X2 respectively.

     For instance,
                    S => NP VP | 0.8 | 1 | he bought

     means the probability of the rule “S=>NP(he) VP(bought)” is 0.8, and VP is the head child of S.

For the fields “head_pos”, “word1” and “word2”, we use “-“ to indicate that the values of the fields are unknown.

            For instance,
                 S => NP VP | 0.8 | 1 | - -
             means the probability of the unlexicalized rule S=>NP VP is 0.8, and VP is the head child of S.

(3) The lexicon is of the following format. See “lexicon” as an example

                  word pos_tag1 prob1 pos_tag2 prob2 ...

               For instance,
                  book N 0.01 V 0.05

means P(book|N)=0.01, and P(book|V)=0.05

5. Format of the output files:

· The output is a text file in the following format: See test_output.hw2 etc.

(Note: the sample files are meant to show you the output format; they are not necessarily the correct answers)

Sent $sent_id: parse_tree=$num_of_parse_tree: $sent

#1: $prob $parse_tree_1

#2: $prob $parse_tree_2

….

#n: $prob $parse_tree_n

Output the first line only if there are no parse trees for that sentence.

The format of parse tree: Each parse tree should be on one line.

The parse trees are in Penn Treebank format.

For instance,

NPè Det N

is expressed as

(NP (Det …) (N ….))

Note: there are two brackets (instead of one) for the top-level S.

For instance

SèNP VP on the top level

is expressed as

((S (NP ….) (VP …))),

not as

(S (NP …) (VP …))

For Hw2: one tree per sentence. The tree is a flat one-level tree, as I do not require you to complete the CYK parser. If a word has more than one POS tag, choose the tag with the highest probability. So the parse tree for a sentence such as “the student bought a book” looks like the following:

((S (Det a) (N student) (V bought) (Det a) (N book)))

Here the $prob is defined to be the multiplication of P(w|tag). In this case, it is

P(a|Det) * P(student|N) * P(bought|V) * P(a|Det) * P(book|N).

For Hw3: one tree per sentence. The tree is the parse tree with the highest probability.

For Hw4: n tree per sentence, where n=min(N-best, # of parse trees for the sentence). The trees should be ordered according to the probability in descending order.

6. Grade:

0%: if the homework is not submitted.

25%: if the code does not run on the test_input

25%-50%: if the code runs on test_input, but crash on the real test data.

50-100%: if the code does not crash, but the output is not totally correct.

100%: the output is correct on the real test data.

For instance, if you submit Hw2, but it does not run on the test input, you will get 10*25% = 2.5 pts.

7. Data Structure

The CYK algorithm is very simple conceptually, but you need to think about what data structures you want to use to store various kind of info. Here are some suggestions to store input:

T: a 1-dimension array to store all the terminals (e.g., words in the lexicon)
N: a 1-dimension array to store all the non-terminals, including POS tags. You might want to store POS tags before other non-terminals.
Sent: 1-dimension array to store the input sentence.

Lexicon: a 2-dimension array of float that stores the probability P (word | POS-tag).
Unlex[][][]: Unlexicalized syntactic rules can be stored in a 3-dimension array, e.g., Unlex[i][j][k]=prob, where A=>B C, and i, j, k are the indices of A, B, and C in N[] respectively.
Lex[][][]: Lexicalized rules can be stored in a 3-dimension array, e.g., Lex[l][i][j]=prob, where i and j are indices of word1 and word2 in T[], and l is the index of the corresponding unlexicalized rule in Unlex[][][].

Notice that the last three arrays (especially the last one) are sparse. You might want to use sparse matrixes or other data structures to store the info. For instance, instead of using a 3-dimension array for Unlex, you might want to define a new type t that stores a context-free rule, and use a 1-dimension array of type t to store unlexicalized rules.

You also need several data structures to store the parsing results.

If you have any question about this project, please ask me as soon as possible. Don’t wait until the last minute.