Ling/CSE 472: Assignment 5: PCFGs, Treebanks

Due May 22nd by 11:59pm, consists of 2 parts

(NB: No questions answered after 5pm 5/22)

Part 1: Probabilities from a treebank

This part of the assignment concerns counting rule frequency in a treebank, such as would need to be done in order to build a PCFG.

The treebank in question is the Redwoods Treebank (version of 2004). Unlike the Penn Treebank and similar resources, the Redwoods Treebank is built by using a broad-coverage precision grammar (the English Resource Grammar (ERG)) to parse a corpus. Human annotators then choose among the candidate parses proposed by the ERG, potentially rejecting all of them if the ERG does not in fact propose an appropriate parse for a given sentence. The Redwoods system facilitates this annotation by calculating binary discriminators which distinguish sets of trees, and furthermore records the decisions that annotators make for each discriminator. As the grammar evolves, the corpus can be reparsed and the decisions can be rerun to produce a nearly disambiguated treebank consistent with the current grammar version. The additional annotation required is minimized.

Another distinguishing factor of the Redwoods Treebank is that the analyses stored for each sentence are not CFG trees, but rather full-fledged HPSG (feature-structure/unification grammar) representations, including both syntax and semantics. For the purposes of this assignment, we'll be working with simplified representations of those trees which are CFG-like: simple atomic symbols for the nodes. However, rather than use categories like S, NP, VP, I've chosen to export the trees using the rule identifiers for each node. Here is a list of rule identifiers and a brief description for each rule. (Note: when the left hand side of the rule is empty, this means the right hand side is the root of the tree)

This version of the Redwoods treebank covers two domains: customer service email and spoken dialogues regarding travel and meeting planning. The files we'll be working with represent a sample of each. Here is a short snippet of each one, to give you the flavor: customer service email and travel/meeting planning.

The files you will actually be working with have been preprocessed with a script written using Jeremy Kahn's Lingua::Treebank Perl library. This script takes each tree and takes it apart, printing out the CFG rule that would generate each subtree of depth one.

Your tasks

Treating each of the files (email and dialogue) separately, find:

The rewrite possibilities for each of the following symbols and the relative frequency (= probability estimate) of each.
1. subjh
2. extracomp
3. nadj_rc
4. extradj_i_vp
5. frag_np
6. imper
A symbol for which the relative frequencies of the different possibilities are substantially different across the two corpora.
A potential explanation for the difference.

Turn in the above, along with your write up. In your writeup, reflect on how exactly you did what you did (or what you couldn't do and what you tried) and what you've learned about: 1) linux tools; 2) treebanks; 3) CFGs; 4) frequency counting; 5) anything else. The writeup should be 3-4 coherent paragraphs.

(Hint #1: The unix utilities uniq and sort might be helpful in this. sort takes a file and sorts the lines in ascii order. uniq takes a file and condenses consecutive occurrences of the same line down to one. When called with the -c flag, it produces a count for each of these. sort can be made to sort on something other than the first column of text. For example, sort -k 2 foo sorts the contents of the file foo based on the second thing in each line, where things are separated by whitespace (default). For more on sort and unique, see this brief tutorial.)

(Hint #2: Once you've gotten the counts for each of the rewrite rules, you might consider calculating relative frequencies using a python or Perl script / bash script, or a spreadsheet program. There are more unix tools, such as awk, that can do the job or part of the job for you. Part of this assignment is learning how to use unix tools on your own.)

Part 2: Probabilistic parsing

Step through the pseudocode for probalisitic CKY (see Figure 14.3 on p.465 of the text) to see how it would parse the following sentence, given this grammar:

Kim adores snow in Oslo.

Your write-up should include three things:

A diagram showing the final state of the chart.
A list of edge descriptions in the order in which the edges were added to chart. Each edge description should indicate the rule that licenses it, its span, and its probability.
1-2 paragraphs reflecting on what you learned while stepping through the CKY algorithm. It can be about parsing, about algorithms in general, or something else related to class material.

Use the following format for displaying the chart: (NB - This figure is intended solely to illustrate the display method and bears no ressemblance to the chart for the sentence you are considering.)

1 2 3

1 Nom: .004
N: .08

2 VP: .0001
V: .09
VP: .003

3 S: .0008 NP: .01
Nom: .05
N: .08

	1	2	3
1	Nom: .004 N: .08
2	VP: .0001 V: .09	VP: .003
3	S: .0008		NP: .01 Nom: .05 N: .08