| Home | Course | Web Tools | Corpora|
| Indices | Software | Conferences | Standards |

Corpus methods in syntax: Labs/answer key

This page contains the lab assignments that we have done so far, as well as links to answers for some of the excercises. I recommend that you don't look at the answers until you think you have a final answer on your own. If you get stumped, ask me or someone else for help/a hint. You won't get nearly as much out of looking at the answers as you will figuring them out yourself. On the other hand, there are hints sprinkled through the labs as well. These you can look at earlier.

Note also that the answers provided here are by no means the only correct answers. There are typically multiple, equally valid solutions to these problems. In some cases, the `answer' will be more of a discussion.

Acknowledgment These labs, especially labs 2 and 6, were constructed with reference to Dan Jurafsky's labs for a similar course.

Jump to:

Lab 1: September 1, 2000

This lab involved exploring web resources and getting used to using the computers in B21 Dwinelle.

Back to top

Lab 2: September 8, 2000

Basic Unix/Linux

Here is a description of some basic linux/unix commands.

Making freq

Use this link to download the file daffodils, used in this lab.

First, take a look at these additional unix commands.

Back to top

Lab 3: September 15, 2000

Corpora

Grep I

Grep/Egrep II: Regular Expressions

Here is a table describing grep and egrep regular expressions.

And here are some hints on using grep and egrep with regular expressions:

And now for the tasks:

Back to top

Lab 4: September 22, 2000

Tags and an intro to emacs

Replicating some results from the textbook, and other stats

Back to top

Lab 5: September 29, 2000

Catch-up lab; no new assignments.

Back to top

Lab 6: October 6, 2000

Getting acquainted

I plan to spend two weeks on tgrep, so don't despair if you don't get through this one this week. On the other hand, let me know if you finish even the advanced problems, so I'll know to dream up some more.

Here's a quick answer to the question ``why tgrep?'' provided in the tgrep documentation online:

Treebank bracketing is represented by multiple levels of parentheses that look a lot like Lisp:

   (TOP (S (NP-SBJ my best friend)
           (VP gave
               (NP me)
               (NP chocolate)
               (NP-TMP yesterday))
           .))

You can't effectively search this kind of thing using plain old grep, because grep can only look at single lines. So we have tgrep, which can look for geometric relationships in the "trees" represented above.

(end quote)

Tree structure

For this section, refer to section 3.4 of the tgrep documentation handout. (Or go to the LDC's tgrep documentation page, or type man tgrepdoc at the prompt.) You might also want to look at AMALGAM's table for the Penn Treebank tagset (linked from the standards section of our class web page).

Some advanced problems

As a general guideline, it's always best snoop around a little bit to see how the TREEBANK folks decided to parse the structure you're looking into.

Back to top

Lab 7: October 13, 2000

This week is planned as a continuation of the tgrep lab from last week. Here are a couple more advanced excercises if you finish those above:

Back to top

Lab 8: October 20, 2000

Enough people missed one of the two previous weeks that this week will be a catch-up lab. I have added information and answers for the advanced problems in Lab 6. Feel free to work on your own projects as well.

Back to top

Lab 9: October 27, 2000

Getting set up

In this lab and for the rest of the semester, we will be writing short programs in Perl (and maybe long ones if you get into something). These programs are files with Perl commands in them that are executable. Linux has something called a path where it looks for executables. In order to make sure that linux can find executables in your home directory, you need to add your home directory to your path. To do this:

Perl basics

A tiny program

A not-so-tiny program

Paragraph mode and multiline matches

Opening and closing files

While loops

Substitute operator

Printing to an output file

Improvements

Back to top

Lab 10: November 3, 2000

Background

In class on 11/20 we will be discussing an article by Sampson regarding the variety of possible sequences of immediate constituents of NPs. That article makes some surprising claims and doesn't provide much specific data. The goal of this lab is to write a program that will get the specific data. In particular, we are going to find all of the different sequences of immediate constituents of NPs in a corpus and keep track of how many times each occurs.

The file /home/handouts/wsj_np contains all of the NPs in the corpus /home/corpora/english/tgrep/wsj_mrg.corpus. It was produced with this command:

tgrep -a '/^NP/' /home/corpora/english/tgrep/wsj_mrg.corpus > wsj_np

The file /home/handouts/wsj_np_short contains the first 200 or so lines of the larger file.

Pseudocode

Here is one way to do it: (The hints generally tell you which perl functions exactly you'll need for each step, some give more information.)

The program described by this pseudocode creates an output file with one NP-description per line. You can then use sort and uniq -c on that file to get a frequency count of the different NP types.

Helpful bits of perl

This section describes the (new) perl commands you'll need for this program. You'll also need to use some you learned about last week, so I recommend that you finish that lab first (although you don't need to finish all of the ``improvements'' in that lab first).

Jump to:

Scalar variables

Scalar variables are variables which store strings or numbers. All scalar variable names start with $. You can assign a value to a scalar variable with the = operator:

$foo = 3; # The value of $foo is now 3.
$foo = $foo+2 # The value of $foo is the previous value plus 2, or 5.
$foo = "hello"; # The value of $foo is now the string ``hello''.
$foo = $foo." world" # The value of $foo is now the string ``hello world''.
# . is the concatenation operator.
$bar = $foo # The value of $bar is now whatever the value of $foo was
# i.e., the string ``hello world''. The value of $foo is unchanged.

The concatenation operator (.) will be useful in this program because you can use a scalar variable to store the sequence of immediate constituents, building it up as you go along by concatenating each node label onto whatever you had before. (You might want to separate them with some character like _, for readability.)

For more on scalar variables, see Chapter 2.

Back to "helpful bits of perl"

if-then-else

An if statement is like a while loop in that they are both control structures. If statements look like this:

if (test) {
    do_this_1;
    do_this_2;
    do_this_3;
} else {
    do_this_4;
    do_this_5;
    do_this_6;
}

If the test returns true, then only the do_this statements 1 through 3 are executed. If the test returns false, then only the do_this statements 4 through 6 are executed. The else part is optional. In that case, if the test returns false, Perl will go on to the next part of the program. You can also string together elsifs like this:

if (test) {
    do_this_1;
} elsif (test) {
    do_this_2;
} elsif (test) {
    do_this_3; 
} else {
    do_this_4;
}

The elsif saves you embedding ifs in each other and keeps the number of curly brackets down.

For more on if-then-else and other control structures, see Chapters 4 and 9.

Back to "helpful bits of perl"

Matching $_

The match operator is =~:
$a = "hello world";
$a =~ /\verb+^+he/; # true, the regular expression /^he/ matches
# the string "hello world"

However, if the variable you want to match is $_, you don't have to specify it.

if (/^he/) {
   do_this;
} 

is equivalent to:

if ($_ =~ /^he/) {
   do_this;
} 

For more on the match operator, see Chapter 7.

Back to "helpful bits of perl"

Array variables

Another kind of variable is an array variable. An array variable stores an array, or an ordered list of scalar values. Array variable names always start with @. For example:

@foo = (2, 9, "bar")

The value of @foo is now a three element array. The first element is 2, the second is 9, and the third is the string ``bar''. Because the array is ordered, we can get at each of these values separately. For example, the value of $foo[0] is 2, the value of $foo[1] is 9 and the value of $foo[2] is ``bar''. We can also assign the value of particular elements separately:

$foo[1] = "hi"    # The value of @foo is now (2, "hi", "bar")

Note that when referring to specific elements of an array, you use $ instead of @. Also note that the first element is indexed with 0.

The special variable @ARGV (introduced in the improvements section last week) is an array variable.

For more on array variables, see Chapter 3.

Back to "helpful bits of perl"

split

The split function takes a string and turns it into a sequence of values (which can then be stored in an array variable) by breaking it up at every occurence of some regular expression. The syntax of split is as follows:

split(/delimiter/,string)

For example:

$line = "a:bbb:cd:e";
@fields = split(/:/,$line); #The value of @fields is ("a","bbb","cd","e")
#The value of $line is unchanged.

The default value for the delimiter is whitespace, and default value for the string is $_. So, the following command is all you need to make @fields be an array containing all of the words in a line just read in (assuming that's what the current value of $_ is).

@fields = split;

For more on split, see pages 89--90.

Back to "helpful bits of perl"

length

The function length returns the length in characters of its argument. For example:

$foo = "hello";
$bar = length($foo); #The value of $bar is 5.

The function length isn't really described in Learning Perl, although the index lists a couple of examples.

Back to "helpful bits of perl"

substr

The function substr takes a string as input and returns a string as output. Its syntax is as follows:

substr(string,offset,length)

where string is the string you want a substring of, offset is the point in the string that you want the substring to start at, and length is how long you want the substring to be. (If you leave the length unspecified, you get a substring that goes to the end of the original string.) Note that the first character is at offset 0.

Examples:

$string = "chocolate";
$sub = substr($string,0,3); #The value of $sub is the string ``cho''.
$sub = substr($string,5) #The value of $sub is the string ``late''.

For more on substr, see pages 154--156.

Back to "helpful bits of perl"

Improvements

Here are some things you might add to your program:

Back to top

Lab 11: November 17, 2000

Testing Sampson's results

In class on Monday, we will be discussing Sampson (1987) (in the reader). He makes some (I believe) outrageous claims about the notion of grammaticality on the basis of some frequency data, namely the frequency with which different patterns of immediate constituents of NPs appear. In this section, you will be asked to see how one of the treebank corpora (Brown or WSJ) compares to his smaller corpus. In the next section, you will be asked to provide some data he doesn't, namely, information about what the rare types really look like. Both sections assume that you've completed Lab 10 (minus the `Improvements' section), so if you haven't done Lab 10 yet, do it first.

Back to top

-----

Emily M. Bender
Last modified: Nov 6 2001