Ling/CSE 472: Assignment 4: PCFGs

Due November 18th, by the start of class, consists of 2 parts

Part 1: Probabilistic parsing

Step through the corrected version of the pseudocode for probalisitic CKY (see slide 7 of 11/9) to see how it would parse the following sentence, given this grammar:

Kim adores snow in Oslo.

Your write-up should include two things:

We recommend using a format like the following for displaying the chart: (NB - This figure is intended solely to illustrate the display method and bears no ressemblance to the chart for the sentence you are considering.)   
 123
1Nom: .004
N: .08
 S: .0008
2VP: .0001
V: .09
VP: .003 
3 NP: .01
Nom: .05
N: .08
 

Part 2: Probabilities from a treebank

This part of the assignment concerns counting rule frequency in a treebank, such as would need to be done in order to build a PCFG.

The treebank in question is the Redwoods Treebank. Unlike the Penn Treebank and similar resources, the Redwoods Treebank is built by using a broad-coverage precision grammar (the English Resource Grammar (ERG)) to parse a corpus. Human annotators then choose among the candidate parses proposed by the ERG, potentially rejecting all of them if the ERG does not in fact propose an appropriate parse for a given sentence. The Redwoods system facilitates this annotation by calculating binary discriminators which distinguish sets of trees, and furthermore records the decisions that annotators make for each discriminator. As the grammar evolves, the corpus can be reparsed and the decisions can be rerun to produce a nearly disambiguated treebank consistent with the current grammar version. The additional annotation required is minimized.

Another distinguishing factor of the Redwoods Treebank is that the analyses stored for each sentence are not CFG trees, but rather full-fledge HPSG (feature-structure/unification grammar) representations, including both syntax and semantics. For the purposes of this assignment, we'll be working with simplified representations of those trees which are CFG-like: simple atomic symbols for the nodes. However, rather than use categories like S, NP, VP, I've chosen to export the trees using the rule identifiers for each node. (This is because the grammar is strictly binary-branching, so the different rules one could extract for S, NP, VP would be less interesting than what you might expect from the Penn Treebank.) Here is a list of rule identifiers and a brief description for each rule.

The Redwoods treebank covers two domains: customer service email and spoken dialogues regarding travel and meeting planning. The files we'll be working with represent a sample of each. Here is a short snippet of each one, to give you the flavor: customer service email and travel/meeting planning.

The files you will actually be working with have been preprocessed with a script written using Jeremy Kahn's Lingua::Treebank Perl library. This script takes each tree and takes it apart, printing out the CFG rule that would generate each subtree of depth one.

Your tasks

Treating each of the files (email and dialogue) separately, find:

Turn in the above, along with an explanation of how you did it.

(Hint #1: The unix utilities uniq and sort might be helpful in this. sort takes a file and sorts the lines in ascii order. uniq takes a file and condenses consecutive occurrences of the same line down to one. When called with the -c flag, it produces a count for each of these. sort can be made to sort on something other than the first column of text. For example, sort -k 2 foo sorts the contents of the file foo based on the second thing in each line, where things are separated by whitespace (default). For more on sort and unique, see this brief tutorial.)

(Hint #2: Once you've gotten the counts for each of the rewrite rules, you might consider calculating relative frequencies using a Perl script or a spreadsheet program.)