Ling/CSE 472: Assignment 5: PCFGs, Treebanks

Part 1: Probabilities from a treebank

This part of the assignment concerns counting rule frequency in a treebank, such as would need to be done in order to build a PCFG.

The treebank in question is the Redwoods Treebank (version of 2004). Unlike the Penn Treebank and similar resources, the Redwoods Treebank is built by using a broad-coverage precision grammar (the English Resource Grammar (ERG)) to parse a corpus. Human annotators then choose among the candidate parses proposed by the ERG, potentially rejecting all of them if the ERG does not in fact propose an appropriate parse for a given sentence. The Redwoods system facilitates this annotation by calculating binary discriminators which distinguish sets of trees, and furthermore records the decisions that annotators make for each discriminator. As the grammar evolves, the corpus can be reparsed and the decisions can be rerun to produce a nearly disambiguated treebank consistent with the current grammar version. The additional annotation required is minimized.

Another distinguishing factor of the Redwoods Treebank is that the analyses stored for each sentence are not CFG trees, but rather full-fledged HPSG (feature-structure/unification grammar) representations, including both syntax and semantics. For the purposes of this assignment, we'll be working with simplified representations of those trees which are CFG-like: simple atomic symbols for the nodes. However, rather than use categories like S, NP, VP, we use the actual ERG rule identifiers for each node. Here is a list of rule identifiers and a brief description for each rule. On the surfare, this is not unlike the improved CFG which we saw in lecture, with subcategorization frames etc. (Note: when the left hand side of the rule is empty, this means the right hand side is the root of the tree)

This version of the Redwoods treebank covers two domains: customer service email and spoken dialogues regarding travel and meeting planning. The files we'll be working with represent a sample of each. Here is a short snippet of each one, to give you the flavor: customer service email and travel/meeting planning.

The files you will actually be working with have been preprocessed with a script written using Jeremy Kahn's Lingua::Treebank Perl library. This script takes each tree and takes it apart, printing out the CFG rule that would generate each subtree of depth one.

Your tasks

Treating each of the files (email and dialogue) separately, find:

The rewrite possibilities for each of the following symbols and the relative frequency (= probability estimate) for each of the possibilities (consider using a program to print all of them out).
1. subjh
2. extracomp
3. nadj_rc
4. extradj_i_vp
5. frag_np
6. imper
A symbol for which the relative frequencies of the different possibilities are substantially different across the two corpora.
A potential explanation for the difference.

Turn your answers to the three questions above (for each of the two files), along with a write up of 3-4 coherent paragraphs addressing:

How exactly you did what you did (or what you couldn't do and what you tried) and
what you've learned about:
1. linux/unix tools or python (depending on what you used);
2. treebanks;
3. CFGs;
4. frequency counting;
5. anything else

Some hints:

Hint #1: The unix utilities uniq and sort might be helpful in this. sort takes a file and sorts the lines in ascii order. uniq takes a file and condenses consecutive occurrences of the same line down to one. When called with the -c flag, it produces a count for each of these. sort can be made to sort on something other than the first column of text. For example, sort -k 2 foo sorts the contents of the file foo based on the second thing in each line, where things are separated by whitespace (default). For more on sort and unique, see this brief tutorial.

Hint #2: You can opt to just use python for everything and not to use unix tools. In order to do that, you will need to learn/know how to open files in python; read text from file and split it in a way that is convenient for your purpose; store data and counts from the file in the dictionary data structure. There should be plenty of accessible tutorials for this. Also, use the Canvas bulletin board and office hours!

Hint #3: Once you've gotten the counts for each of the rewrite rules, you might consider calculating relative frequencies using a python or other (perl, bash...) script, or a spreadsheet program. There are more unix tools, such as awk, that can do the job or part of the job for you. Part of this assignemnt is working out how to use the different tools at your disposal to solve problems. Please ask lots of questions!

Part 2: Probabilistic parsing

Step through the pseudocode for probalisitic CKY (see Figure 14.3 on p.465 of the text) to see how it would parse the following sentence, given this grammar (the numeric indices are the the "third dimension" of the table mentioned in the book):

Kim adores snow in Oslo.

Your write-up should include three things:

A diagram showing the final state of the chart.
A list of edge descriptions in the order in which the edges were added to chart. Each edge description should indicate the rule that licenses it, its span, and its probability.
1-2 paragraphs reflecting on what you learned while stepping through the CKY algorithm. It can be about parsing, about algorithms in general, or something else related to class material.

Use the following format for displaying the chart: (NB - This figure is intended solely to illustrate the display method and bears no ressemblance to the chart for the sentence you are considering.)

1 2 3

0 Nom: .004
N: .08 S: .0006 S: .00008

1 VP: .0001
V: .09
VP: .003

2 NP: .01
Nom: .05
N: .08

	1	2	3
0	Nom: .004 N: .08	S: .0006	S: .00008
1		VP: .0001 V: .09	VP: .003
2			NP: .01 Nom: .05 N: .08