Team Project 2 (Hw6-10, excluding Hw9)



      1. Project Overview


The task: to extract PCFG from the English Penn Treebank, and use it for CYK parsing.


The project is divided into four parts:

·        Hw6, due 11/10/05: 30 pts: write code to read Penn Treebank trees, extract context-free rules, sort them and output the lexicon, the original grammar and the grammar in CNF.


·        Hw7, due 11/17/05: 30 pts: use the grammar for CYK parsing, output the best parse trees for the sentences in Section 00, and calculate the parsing results with EVALB.


·        Hw8, due 11/23/05: 240 pts: try to improve the results with smoothing and better treatment for unknown words. Get a new parsing result.


·        Hw10: due 12/8/05: 100 pts: write a short report and prepare for a presentation.



2. Major components:


    (1) Training stage:  (Hw6)

-         read in Treebank trees

-         extract the rules and store them in some data structure

-         calculate the probability and smoothing if needed (smoothing can be done in Hw8)

-          convert the grammar into CNF

-          output the grammars and lexicons: (seven output files in total.):

o       smooth/unsmoothed original/CNF grammars (4 files),

o       smooth/unsmoothed lexicons (2 files)

o       one summary file.


   (2) Parsing stage: (Hw7)

-         use the grammar as input to your CYK parser: make sure your parser can handle such a large grammar and a large lexicon.

-         deal with unknown words: make sure that your code works well when there are unknown words in input sentences. This step can be done in Hw8.

-         write a code that gets sentences from a *.mrg file. Use the sentences as input to your parser.

-         output the best parse tree for each sentence in section 00.


(3) Evaluation: (Hw7)

-         write a code that converts the format of a Treebank  file so that each parse tree is on one line: Multi-line parse tree => one-line parse tree.

-          calculate various measures using EVALB.

-          write a tool that indents a parse tree file: one-line parse tree => multi-line indented parse tree


(4) Improving the system: (Hw8)

-         better smoothing

-         better unknown word treatment

-         fix bugs

-         get new parsing results


(5) Writing a report and wrap-up:  (Hw10)

-         improve the parsing results (optional)

-         write a report (see Section 9 for details)

-         prepare for a presentation



3. Files that are provided for this project


The files are all under ~fxia/dropbox/571/Project 2:

(1)   The English Treebank data:

         Training data: wsj-02-21.mrg 

         Development data: wsj-00.mrg (This file is needed for Hw7 and Hw8)


(2)   The EVALB tool:

a.       To run the EVALB tool, just type: $dir/evalb –p $dir/COLLINS.prm  gold_standard system_output > evalb_result

b.      $dir is the directory where EVALB tool is stored.

c.       gold_standard comes from wsj-00.mrg, but each parse tree should be on one line.

d.      system_output is the output of your parser.

e.       For more details, see README in the EVALB directory.



4. Major steps for Hw6:


(1)   Write your own code.

a.       The main file should be called Hw6.*.

b.      The command to run the code should be

    cat treebank_file | ./run_hw6 output_prefix > summary_info


                         Your code should produce seven files. (see Section 8)

·        output_prefix.grammar_orig:  the original grammar with no smoothing

·        output_prefix.grammar_orig_sm: the original grammar with smoothing

·        output_prefix.grammar_cnf: the grammar in CNF (according to Definition 3)

·        output_prefix.grammar_cnf_sm: the grammar in CNF with smoothing

·        output_prefix.lexicon: the lexicon

·        output_prefix.lexicon_sm:  the lexicon with smoothing

·        summary_info: the summary info such as the number of trees processed, the number of syntactic rules, the size of the lexicon etc.



(2)   Write a draft of the report



(3)   Check your directory and tar it.


Your directory should contain the following files:

a.       A file called “Report”

b.      A file called “run_hw6” which is a one-line shell script that runs java, a perl code, or a binary C/C++ code.

č this is the file that the grader will run to test your program.

c.       Source codes, header file, binary codes, etc. The file contains the main function should be called “Hw6.*” (e.g., Hw6.C, Hw6.H,,

d.      The seven files named wsj_02_21.* (six grammar files and one summary file) created by run_hw5 when running the command line:

                              cat wsj_02_21.mrg | ./run_hw6 wsj_02_21 > wsj_02_21.summary



Tar the directory.


(4)   Submit your homework using ESubmit.



   5. Major steps for Hw7:

         (1) Programming

-         Fix the bugs in code for Hw6 if any.

-         Write a CYK parser:  you can modify one of your CYK parsers (from Hw4) so that it can handle a real grammar/lexicon with slightly different formats.

-         Get the sentence strings from wsj_00.mrg, and use them as input to the parser.

-         Run EValB to get the parsing results.


(2)   Modify the report to add information about parsing results


(3)   Check your directory and tar it: Your directory should include all the files for Hw6, plus the following:

                  (a) The source code for the parser: the main file should be called “Hw7.*”


         (b) The parser: a file called “run_hw7”. And the command line for run_hw7 should be

                        cat sent_file | ./run_hw7 grammar lexicon {N_best} > output_file


                  (c) Conversion tool #1: a binary/Perl code called “parse_conv” which converts

                      the *.mrg files so that each parse tree is on one line.


                  (d) An input to EVALB: a file called “wsj_00.mrg_line”, which is created by running

                         cat wsj_00.mrg | ./parse_conv > wsj_00.mrg_line 


                  (e) Conversion tool #2: a binary/Perl code called “get_sent”, which gets sentences from *.mrg files.


                  (f) An input to the parser: a file called “wsj_00.sent”, which includes all the sentences in wsj_00.mrg. It is created by running

                         cat wsj_00.mrg | ./get_sent > wsj_00.sent


                  (g) The parsing result: a file called “wsj_00.parse”, which is created by running

                        cat wsj_00.sent | ./run_hw7 grammar lexicon 1  > wsj_00.parse


                        choose the grammar and lexicon so that the parsing result is the best.


                  (h) Conversion tool #3: a tool called “indent_parse_tree”, which takes a one-line parse tree and outputs the indented tree (i.e., the format used in *.mrg files)


                  (i) The indented parse trees: a file called “wsj_00.parse_indented”, which is created by running

                               cat wsj_00.parse | ./indent_parse_tree > wsj_00.parse_indented


                  (j) The EVALB result: a file called “wsj_00.evalb_result”, which is created by running

                         evalb –p COLLINS.prm wsj_00.mrg_line wsj_00.parse > wsj_00.evalb_result


                  (k) An updated “Report” file.



(4)   Submit your homework using ESubmit



  6. Major steps for Hw8:


      The main task of Hw8 is to fix the bugs in Hw6 and Hw7, and try to improve parsing results.

      Your submitted tar file should include all the files listed for Hw6 and Hw7, except that “hw7” in the file names will be replaced by “hw8”.



  7. Major steps for Hw10:


     The main task of Hw10 is to write a report and prepare for a presentation.


      If you want to improve parsing results, feel free to do so. If you get better results, you can submit the new results by including the same files as the ones for Hw8 (except that “hw8” in file names are replaced by “hw10”).


      Note: you must submit the report and presentation for Hw10. And the submission of the code/parsing results are required ONLY IF you believe that you get better results than before.


    The final report should be called “Report”. The presentation should be called “present.ppt”. You will give a presentation on the last day of the class (Dec 8), and the presentation should be about 10-12 minutes (including Q&A).



  8. Format of output files:

(1)   The syntactic rule files:


For unsmoothed files:

            Y => X1 ... Xn | prob count

             Prob is the P(Y1=>X1 … Xn  |Y1), and count is the occurrence of  the rule in the corpus.


      For smoothed files:

            Y => X1 … X2 | prob1 count1 prob2 count2


            prob2 and count2 are the real prob and count.

            prob1 and count1 are the ones after smoothing


      Sort the rules according to the frequency of Y: the most common Y is listed first.

      For each Y, sort the rules so that the most frequent rule is listed first.



(2)   The lexicon:


For both smoothed and unsmoothed lexicon, the format is

     word  pos_tag1 prob1 count1 pos_tag2 prob2 count2 …


Sort the file according to word frequency. That is, the first line is for the most frequent word, and so on.


Within a line, if a word has more than one POS tag, sort the list so that the most frequent POS tag and its count are listed first.



(3)   The parse tree files.


%%%% indented parse tree file

Same format as the English Penn Treebank, e.g.,

( (S  (NP (NN xx))

        (VP (VBD yy))) )


%%%% un-indented parse tree file

Each line is a parse tree. e.g.,

((S (NP (NN xx)) (VP (VBD yy))))




   9. The report

·        The report should be in text, WORD or  pdf format.

·        The report should include the following sections:

-         System overview:

o       team members

o       programming language

o       modules

o       job division

-         Major issues

o       treatment of unknown words

o       smoothing

o       data structures for chart, syntactic rules, lexicon, parse trees, …

-         Results

o       size of syntactic rules and lexicon

o       parsing results with different smoothing and OOV treatments

o       speed of the parser: sentences parsed per minute.

-         Conclusion and future work

o       What you have learned from this project

o       What are the remaining problems

o       Any suggestions and comments about the project design, team work, etc.




   10. The presentation


   The presentation should include all the major points in the report, and anything else you want to add. Make sure that you can finish your presentation within 10 minutes (excluding Q&A time)



  11. Grade for Hw6, Hw7, and Hw8: 

        0%: if the homework is not submitted.

        25%: if the code does not run on the section 00

        25%-50%: if the code runs on section 00, but crash on other test data.

        50-100%: if the code does not crash, but the output looks problematic.


         For instance, if you submit Hw6, but it does not run on the test input, you will get 30*25% = 7.5 pts.



   12. Grade for Hw10:

       30% for report

         30% for presentation

         40% for parsing results: the team with the best parsing results gets 40 points; the team or teams with the second best results get 35 points, and so on.


  13. Teamwork:

        The default is that all the members in the same team will get exactly the same score for this project. However, if you feel that this default is not fair for whatever reasons and want a different grading system, please let me know ASAP (no later than December 1).

        If there are any problems with your team, please try to resolve the issue by talking to your teammates first. If that does not work, please let me know. We will hold a meeting with all the members in that team.