Corpus methods in syntax: Labs/answer key

This page contains the lab assignments that we have done so far, as well as links to answers for some of the excercises. I recommend that you don't look at the answers until you think you have a final answer on your own. If you get stumped, ask me or someone else for help/a hint. You won't get nearly as much out of looking at the answers as you will figuring them out yourself. On the other hand, there are hints sprinkled through the labs as well. These you can look at earlier.

Note also that the answers provided here are by no means the only correct answers. There are typically multiple, equally valid solutions to these problems. In some cases, the `answer' will be more of a discussion.

Acknowledgment These labs, especially labs 2 and 6, were constructed with reference to Dan Jurafsky's labs for a similar course.

Jump to:

Lab 1
Lab 2
Lab 3
Lab 4
Lab 5
Lab 6
Lab 7
Lab 8
Lab 9
Lab 10
Lab 11

Lab 1: September 1, 2000

This lab involved exploring web resources and getting used to using the computers in B21 Dwinelle.

Lab 2: September 8, 2000

Basic Unix/Linux

Here is a description of some basic linux/unix commands.

Using cd explore the directory structure that you have access to.
Draw the directory tree under /home/ on a piece of scratch paper.
Find a file with over 10,000 characters. (Answer)
Use less to look at the big file. See what happens when you type space, b, f, h and /the.
Using head and > make a file with the first 7 lines of that big file in your home directory. (Answer)
Copy the file /home/handouts/daffodils to your home directory. (Answer)

Making freq

Use this link to download the file daffodils, used in this lab.

First, take a look at these additional unix commands.

How many words are there in the file daffodils? (Answer)
What does wc -w do? (Answer)
Does wc count spaces? new lines? (Answer)
Use tr to reformat daffodil with one word per line. (Hint: the octal code for a `newline' is '\012') (Answer)
Use sort to sort the output of the tr command. (Answer)
Use uniq -c to count the unique words in the output of the sort command. (Answer)
Use sort again on the output of the uniq -c command so that the most frequent words come out first. (Answer)
Does this command do everything you'd like it to do? If not, take a look at the man page for tr (man tr) and see what you can do. (Answer)
Why won't uniq -c alone do the trick? (Answer)
Check out the man pages for uniq, sort and wc, by typing man sort, etc.

Lab 3: September 15, 2000

Corpora

Go to /home/corpora and poke around to see what corpora have been added.
Use less on some of the corpus files as well as on README files to see how these corpora are structured.
If time, look at the GUTINDEX.ALL for texts you'd like added to this directory.

Grep I

Syntax:
```
 grep pattern filenames 
```
Use grep to find all lines of the poem daffodils that contain the word daffodils. (If you don't have a copy of that file in your home directory, you can get it from /home/handouts/daffodils or here) (Answer)
What does the -i option do? (Hint: try adding -i after grep to your previous query.) (Answer)
What does the -v option do? (Answer)
Find all lines in the poem that contain the word and. (Hint: Use single quotes around your search pattern.) (Answer)
Find all lines that contain the word and but not the word but. (Answer)
The -C option (e.g., grep -C2) returns preceding and following lines for each match. Use this option to find every line that contains the word daffodils with one line before and one line after of context. (Answer)
Normally if you give grep a directory name instead of a file name, it returns an error message. The -r option tells it instead to look at all files in that directory (and recursively in subdirectories, too). Use grep -r to find a file with the word tendrils in it. (Answer)
Find examples of the phrase even if from the .txt files in /home/corpora/english/gutenberg/etexts00. (Answer)

Grep/Egrep II: Regular Expressions

Here is a table describing grep and egrep regular expressions.

And here are some hints on using grep and egrep with regular expressions:

Enclose regular expressions in single quotes (' ').
You can put the pattern you're working on in a file, and the call it with the -f option:
```
grep -f patfile searchfiles
```
In this case, you don't need the single quotes. In fact, you can have multiple patterns on separate lines in the file. Be sure that the file ends with exactly one newline.
More information is available in the man page (man grep) and by typing grep --help.
Some of the corpora in the ICAME directory are tagged. It's worthwhile to take a peek in the files you're searching to see what the formatting is like.

And now for the tasks:

Find a line in some file that begins with the word to. (Answer)
Find a line in some file with exactly one x in it. (Hint) (Answer)
Using |, search a short file for all present-tense forms of be. Be sure to exclude cases where the string you're looking for is just part of a longer word. (Answer)
Now find all gerunds not preceded by a form of be. (Hint: It may be easier to accomplish this as two searches, connected by |.) (Answer)
Using [A-Za-z] search for all instances of the followed by thing in your short file with exactly one word in between. (Answer)
What happens if you use .* instead of [A-Za-z]? (Answer)
Devise a regular expression that finds all words in which the letters are in alphabetical order. Now find all such words of six or more letters in some corpus. (Hint: This is easier to do if you have one word per line, like we did in making freq.) (Answer)
Devise a regular expression that finds all words in which all five vowels appear, exactly once each, in alphabetical order. Now find all such words in some corpus. (Hint: Again, this is easier to do if you have one word per line, like we did in making freq.) (Answer)
Using  with grep, search /usr/dict/words for 5 letter palindromes. (Answer)

Lab 4: September 22, 2000

Tags and an intro to emacs

The only tagged corpus we have at the moment is the Brown corpus, in /home/corpora/english/ICAME/texts/browntag/. What files are in that directory, and what do they contain?
The tagging scheme used in the Brown corpus is described on the AMALGAM website. The other handout gives an abbreviated version of their description. (A link to the full description on the AMALGAM site is in the standards section of the class website: www.linguistics.berkeley.edu/~bender/290A/standards.html.) Look over the tagset to familiarize yourself with the system.
Using the search function in emacs, find an example of each of the following tags in brown_n.tag: FW-UH, DO+PPSS, EX+HVD. Here are some detailed instructions:
- To start emacs, type emacs & in your xterm. The \& means that emacs is running in the background, so you can still use your xterm for other things.
- Open the file brown_n.tag. To do this, type Cntrl-X Cntrl-F, and then the path to the file, i.e.,:
```
/home/corpora/english/ICAME/texts/browntag/brown_n.tag
```
- This file is write protected, so you can't type in it.
- Type Cntrl-S to search, and type in the string you want to search for.
- To exit a search, type Cntrl-G or use the arrow keys to move your cursor away from the found item.
- To exit emacs when you're done, type Cntrl-X Cntrl-C.

Replicating some results from the textbook, and other stats

Kennedy reports that the Brown corpus is actually 1,014,312 words long. What does wc say (use the file btl.txt)? (Answer)
Use less to look at btl.txt and see why wc might have found a different result. (Answer)
Kennedy (p.103) gives the percentages of the total corpus accounted for by each major word class. Pick one and verify it. (Hint: Look at the list of tags to find out which tags to count, then use grep and wc, but first make sure you have one word per line. You can get a calculator by typing xcalc & at the prompt.) (Answer)
On p.105 Kennedy reports that the tag combination NN JJ appears 932 times per million words in the informative prose sections of the LOB compared to 743 times per million words in the imaginative prose sections. How do the analogoussections of the Brown corpus compare? (Answer)
Take a look at the sentences matching that pattern. Is the example he gives (the enquiry proper) typical?
Find a tag that occurs in only one text type in the Brown corpus. (Answer)
The files brown_h.tag, brown_n.tag and brown_p.tag all have roughly 68,000 words. Do they have the same proportion of foreign words? (The different text categories in the Brown corpus are described in Kennedy, p.24--26.) (Answer)
What proportion of foreign words used in the corpus are nouns? (Answer)
Is this proportion constant across the different text categories? (Consider, say, brown_g.tag and brown_n.tag.) (Answer)
Find some examples of two verbs with exactly the same tag immediately adjacent to each other. (Answer)
Does love occur more frequently as a noun or as a verb in the Brown corpus? (Hint: use sort, uniq and grep.) (Answer)

Lab 5: September 29, 2000

Catch-up lab; no new assignments.

Lab 6: October 6, 2000

Getting acquainted

I plan to spend two weeks on tgrep, so don't despair if you don't get through this one this week. On the other hand, let me know if you finish even the advanced problems, so I'll know to dream up some more.

Here's a quick answer to the question ``why tgrep?'' provided in the tgrep documentation online:

Treebank bracketing is represented by multiple levels of parentheses that look a lot like Lisp:

   (TOP (S (NP-SBJ my best friend)
           (VP gave
               (NP me)
               (NP chocolate)
               (NP-TMP yesterday))
           .))

You can't effectively search this kind of thing using plain old grep, because grep can only look at single lines. So we have tgrep, which can look for geometric relationships in the "trees" represented above.

(end quote)

Try this tgrep query (don't worry about breaking it down right now):

tgrep '/^NP/ << chocolate' /home/corpora/english/tgrep/brown.corpus | more

The basic syntax for tgrep is
```
     tgrep pattern corpus
     
```
In order to use tgrep, the corpus must first be prepared. Such prepared corpora will live in the directory /home/corpora/english/tgrep. If you're going to be working with the same corpus most of the time, you can store it in the environment variable TGREP_CORPUS, by typing:
```
setenv TGREP_CORPUS /home/corpora/english/tgrep/brown.corpus
```
if you want to use the brown corpus. (This environment variable is like the variable DISPLAY: once you set it, it stays set until a) you log out or b) you reset it. You can put the statement above in your .cshrc and it will get set every time you log in.)
Once you've set the value of TGREP_CORPUS, try the tgrep query again:
```
tgrep '/^NP/ << chocolate' | more
```
What does the -w flag do? (Answer)

Tree structure

For this section, refer to section 3.4 of the tgrep documentation handout. (Or go to the LDC's tgrep documentation page, or type man tgrepdoc at the prompt.) You might also want to look at AMALGAM's table for the Penn Treebank tagset (linked from the standards section of our class web page).

Find all instances of race as a verb in your corpus. (Hint: Find all instances of race immediately dominated by a verb tag.) (Answer)
Modify the query to print out the whole sentence containing race. (Answer)
Modify the query to print out just the VP containing race. (Answer)
Write a search pattern to find transitive uses of the verb walk. (Answer) Does your pattern turn up anything that you wouldn't call a bona fide transitive use?
Write a search pattern to find attributive uses of adjectives. (Hint: Look for mere first to get an idea of what the structure looks like. (How do I do this?)) (Answer)
Write a search pattern to find predicative uses of adjectives (in particular, adjectives that appear after the copula). (Hint: Look for happy first to get an idea of what the structure looks like.) (Answer)
Now find the 25 most common predicative adjectives and the 25 most common attributive adjectives. (Hint: Use the backquote (`, see section 3.5 of the tgrep documentation handout) to print out just the adjectives, then use sort and uniq -c like we did for freq.) (Answer)
Find examples of verbs other than give that occur in the double object construction. (Hint: You'll need a negated operator, and some nth daughter operators.) (Answer)

Some advanced problems

As a general guideline, it's always best snoop around a little bit to see how the TREEBANK folks decided to parse the structure you're looking into.

Design a tgrep query that will find pseudopassives. (Answer)
In pseudopassives, the verb is passive, but the subject corresponds not to the object of the (active) verb, but rather to the object of a preposition which is in turn a dependent of the verb. An example is given in (1).
```
(1) This bed was slept on by George Washington.
     
```
A further fact about pseudopassives is that they are only grammatical if the PP with a missing object is immediately adjacent to the verb:
```
(2) a.?The protesters have called loudly for a change in government.
    b. A change in government is what is called for.
    c.*A change in governmnet is what is called loudly for.
       
```
Design a tgrep query that will find examples of with-less absolutes. (Hint) (Answer)
Absolutes are adjuncts comprising a subject and a non-finite predicate (NP, AP, PP predicates or non-finite verbal predicates). They can be marked by with, as in (3), or appear without with, as in (4). Both with and with-less absolutes can appear sentence initially and sentence finally (compare the a and b sentences).
```
(3) a. With the weather so nice, we really should go outside and enjoy it.
    b. The game is tied with one minute left on the clock.

(4) a. Weather permitting, the game will be held on Sunday.
    b. The child asleep on the floor, toys spread out all around her.
       
```
Design a tgrep query that will find long distance dependencies spanning at least three clauses, e.g.:
```
(5) Who do you think they said would come?}
     
```
(Hint: Try tgrepping for what and which to get an idea of what long distance dependencies look like in the TREEBANK system.) (Hint) (Answer)

Lab 7: October 13, 2000

This week is planned as a continuation of the tgrep lab from last week. Here are a couple more advanced excercises if you finish those above:

The answers provided to the question about attributive and predicative uses of adjectives above assumed that attributive adjectives never occur as ADJP, but only as single JJ nodes, sister to the head noun. This is not, in fact, the case.
- Find examples of an ADJP in attributive position.
- Design a query to count these attributive adjectives as well.
- Do you get any difference in the membership of the 25 most frequent attributive adjectives?
- Do any similar problems occur with the predicative uses?

Lab 8: October 20, 2000

Enough people missed one of the two previous weeks that this week will be a catch-up lab. I have added information and answers for the advanced problems in Lab 6. Feel free to work on your own projects as well.

Lab 9: October 27, 2000

Getting set up

In this lab and for the rest of the semester, we will be writing short programs in Perl (and maybe long ones if you get into something). These programs are files with Perl commands in them that are executable. Linux has something called a path where it looks for executables. In order to make sure that linux can find executables in your home directory, you need to add your home directory to your path. To do this:

Open your .cshrc file with emacs or vi:
```
emacs ~/.cshrc &
```
Add the following line to your .cshrc:
```
setenv PATH /home/\it your_login_name:$PATH 
```
(Be sure to replace your_login_name with your actual login name.)
Save the file. (In emacs, this is done with ctrl-x ctrl-s. The emacs window will stay open, which is good, because you'll need it later. To close the file without quitting emacs, do ctrl-x k.)
In your xterm, type:
```
source .cshrc
```
This tells the computer to look at the contents of your .cshrc file. It always does this when you log in, so you won't have to do this again.

Perl basics

All perl programs must start with this line:
```
#!/usr/bin/perl
```
The comment character is #. That is (with the exception of that first line), anything after a # on a line is ignored by perl, so you can use # to write comments in your code, to help you remember what it's supposed to be doing.
With the exception of that first line, all perl statements must end with a `;'.
To get perl to print something to the terminal, use the command print, like this:
```
print ("your message");
```
The quotes (") are important, because without them, perl tries to interpret ``your message'' as commands. The quotes tell perl that you mean it to see ``your message'' as a string.
This print command is very useful in debugging---if you want to see if perl is actually getting to a certain part of your code, insert a print statement there.
If you want a newline at the end of your print message, add a \n at the end, inside the quotes.

A tiny program

Write a perl program that prints ``Hello world'' to the screen. (Answer)
- In your emacs window, open a new file. (You can do this with the menus [Files--Open File] or with the command ctrl-x ctrl-f. Either way, text will appear at the bottom of the screen asking you to name the file.)
- Write your program in that file, and save it (Files--Save Buffer or ctrl-x ctrl-s).
- In your xterm, make the file executable:
```
chmod u+x filename
```
  (For more on chmod do man chmod.)
- In your xterm, try running the file by typing its name. Did it work?
- If it says ``filename:command not found'' then you probably haven't set your path correctly (from Part 1). Ask for help.
- If it doesn't work, but doesn't say ``command not found'', there is probably a bug in your short program. Go back to your emacs window, and see if you can see what the bug is and/or ask for help.

A not-so-tiny program

In this section, we're going to write a program that takes a text file as input, and outputs a file that contains the same text, but formatted one sentence per line. (Such a program is useful because grep is much more useful for linguistic purposes when you have files formatted this way, but many electronic corpora aren't.)
Start by pretending that all sentences end in a period, and all periods in the text mark the end of sentences. (This is obviously wrong, but it's better to get the basic structure of the program together and then refine it, than to put in all the rest of the details at the beginning.)
To write this program, we'll need the following perl commands/operators. (Don't worry if these things sound mysterious, they're explained below.) Chapter references are to Learning Perl, but this handout should contain everything you need to know.
- The open and close operators for filehandles. (Chapter 10)
- The substitute operator (s/inputstring/outputstring/flags). (Ch. 7, pages 88ff)
- A while loop, which will let us look at the input file paragraph by paragraph. (Ch. 4)
- The print operator, to write our output to the output file. (Ch. 10)
- A couple of commands that tell perl to take things paragraph by paragraph (instead of line by line), and which allow the substitution operator to match patterns that cross newlines. (Not really handled in Learning Perl.)
The general outline of this program is as follows (``pseudocode''): (Hint)
- Set perl to paragraph mode and multiline matches.
- Open the input file.
- Open the output file.
- While there are paragraphs in the input file, take them one at a time and:
  - Reformat them. (Hint)
  - Print them to the output file.
- Close the input and output files.
You should make a test file, called ``senttest'' in your home directory to test this program on as you develop and debug it. To do this, pick your favorite (untagged, unparsed) corpus and do this:
```
head -100 favorite_corpus > ~/senttest
```
(Some of the corpora were orginally formatted on other computer systems with different newline characters (technically, with both a newline and a linefeed, where unix/linux has only a newline). The corpora in /home/corpora/english/soc.text/ definitely don't have this problem.)
Making reference to the information in the following subsections, write a perl program that reformats a corpus to one sentence per line, and call it bysent. (Answer: The whole program)

Paragraph mode and multiline matches

Put the following lines at the beginning of your program:
```
$/ = "";  # paragraph mode
$* = 1;   # multiline matches
```
($/ and $* are special variables that perl checks to see what mode it's in. These lines set the value of those variables to the behavior we want. The default values are such that perl takes one line at a time and doesn't allow multiline matches. Everything after the #s is comments.)

Opening and closing files

To open a file to read from:
```
open (FILEHANDLE, "file_path_name");
```
To open a file to write to:
```
open (FILEHANDLE, ">file_path_name");
```
Filehandles allow you to refer to the files you've opened. Filehandles are (by convention) in all caps. Be sure to use different filehandles for your input and output files. Also, be sure to use a different (new) file for you output file. Otherwise, perl will overwrite your input file.
To close a file, once you're done with it (in this case, at the end of the program):
```
close (FILEHANDLE);
```
Perl will automatically close all open files when it finishes a program, but it's good form to do this anyway.

While loops

While loops are one way of getting perl to repeat actions. The basic form of while loops is this:
```
while (test) {
     do_this;
     do_this;
     do_this;
}
```
The effect is that perl will do the test, and if it turns up true, execute, in order, all of the do_this commands. (In perl, 0, the empty string, and undef are false. Everything else is true.) When it's done with those, it runs the test again, and if it's still true, repeats. Note that unless something in the test or do_this commands potentially changes the outcome of the test, while loops can run forever. If your program is taking much longer than you think it should, stop it with ctrl-c and take a look at your while loops.
One nifty operator in perl is <>. It can be used in the test of a while loop as follows:
```
while (<FILEHANDLE>) {
     do_this;
     do_this;
     do_this;
}
```
Here, the test is whether there is anything left in the file FILEHANDLE. But it has a side-effect: every time it's called, it reads one line (or, in paragraph mode, one paragraph) into the variable $_ and gets ready to read the next one. So, we can use it to go through a file one line (or paragraph) at a time. Note that, by moving through the file, it will eventually get to the end, so that the test will eventually return false. The variable $_ is helpful, because it serves as a default value for many operators, including the substitute operator we will be using.

Substitute operator

The substitute operator looks like this:
```
s/inputstring/outputstring/flags
```
That is, it looks for the input string, and, if it finds it, replaces it with the output string. Where is it looking? If you don't specify where, it will take $_ as it's input value and then change the value of $_, so that the next time you use $_, it will have the new value. (Because so many commands use this default variable, you can write whole programs where it gets used in almost every line but is never mentioned.)
The input string is actually a regular expression, but you don't have to worry about that for the basic form of this program. (Perl regular expressions are very similar to tr and grep regular expressions. See Ch. 7 for details.) The output string is just a string, but it may have variables in it.
The flags are symbols that modify the behavior of the substitute operator. For example, the flag i makes it case insensitive, and the flag g makes sure it matches every possible instance in the input (like -a in tgrep). (Hint)
Some other helpful information:
- \n indicates a newline, in both the input string and the output string.
- The . is a special character, so to actually match a period, you have to use \.
- To indicate a space in the input or output string, just type a space between the slashes (or between whatever parts of the string it should appear between).

Printing to an output file

The tiny program in section 3 used the command print. Print actually takes two arguments: a place to print to and what to print. The default place to print to is the screen, or standard output, which is assigned the filehandle STDOUT. The default thing to print is, of course, $_.
So, for bysent, we only need to specify the output filehandle, like this:
```
print FILEHANDLE;
```

Improvements

The array @ARGV stores the arguments a perl program was called with. So, if you call your program like this:
```
bysent daphne.txt
```
The argument (``daphne.txt'') is stored in the variable $ARGV[0]. Use this to change bysent so that you can tell it at runtime what you want the input to be. It is also useful for the output file to reflect the name of the input file. (Answer)
With this more general-purpose version of bysent, it is a good idea to test whether the output file already exists, before over-writing it. (Answer) This can be done with the -e operator:
```
-e filename
```
This operator returns true if the file exists and false if it doesn't. If it does exist, you want the program to quit. The best way to do this is with the die operator. If the program reaches a die operator, it stops executing. You can also get it to print a message when it dies, like this:
```
die "That file already exists!!"
```
In order to die only when the file already exists, you need an if-then statement. Here's the syntax:
```
if (test) {
    do_this;
   } else {
    do_this;
}
```
The else part is optional. If there is no else part, but the test turns up false, perl won't do anything (that is, it'll go on to the next thing in the program). For more on the if-then and related elements, see Chapters 4 and 9.
Fix the program so that other end-of-sentence punctuation marks are also interpreted. (Answer)
Fix the program so that sentence-internal periods are not interpreted as end-of-sentence marks. (Hints: There are several sentence-internal contexts in which periods are used, and each will need to be handled separately. Once you've figured out how to identify the context, change those periods to some other symbol, before putting newlines after the remaining periods. Then change the `keeper' periods back. Chapter 7 has information about regular expressions.)

Lab 10: November 3, 2000

Background

In class on 11/20 we will be discussing an article by Sampson regarding the variety of possible sequences of immediate constituents of NPs. That article makes some surprising claims and doesn't provide much specific data. The goal of this lab is to write a program that will get the specific data. In particular, we are going to find all of the different sequences of immediate constituents of NPs in a corpus and keep track of how many times each occurs.

The file /home/handouts/wsj_np contains all of the NPs in the corpus /home/corpora/english/tgrep/wsj_mrg.corpus. It was produced with this command:

tgrep -a '/^NP/' /home/corpora/english/tgrep/wsj_mrg.corpus > wsj_np

The file /home/handouts/wsj_np_short contains the first 200 or so lines of the larger file.

Copy /home/handouts/wsj_np_short to your home directory, and take a look at it. What cues are there in the formatting that will help you with this program?
Note that there are multiple NP tags in this corpus. The regular expression /^NP/ found things like NP-SUBJ and NP-TMP. We want to consider all of these, of course.
Please do not copy the larger file wsj_np. It's BIG. Also, while you are developing your program, use the smaller file as input. It'll save lots of debugging time, and probably keep memory usage more manageable, too.

Pseudocode

Here is one way to do it: (The hints generally tell you which perl functions exactly you'll need for each step, some give more information.)

Open the input and output files. (Hint)
Step through input file, one line at a time, and: (Hint)
- Check to see if the line is the beginning of a new NP. (Hint)
- If so: (Hint)
  - The previous NP is finished, print it to the output file. (Hint)
  - See how long the label is (i.e., is it three characters like (NP, or more like (NP-SUBJ. (Hint)
  - Get the first node after the NP label, and remember it as the first part of a string describing the immediate constituents. (Hint)
- Otherwise: (Hint)
  - See if there is a node label in the right place for immediate constituents. (Hint)
  - If so: (Hint)
    - Get the label of the node. (Hint)
    - Add it to your string describing the immediate constituents. (Hint) (bigger hint)
Print the last NP found. (Hint)
Close input and output files. (Hint)

The program described by this pseudocode creates an output file with one NP-description per line. You can then use sort and uniq -c on that file to get a frequency count of the different NP types.

Copy the pseudocode into the file that will contain your perl program. (I called mine nptyper.) Be sure to comment out the pseudocode (i.e., use #).
Look through the pseudocode and be sure you understand how it will work. For instance, why is the printing to the output file done so early in the code?
Fill in the control structures (if statements and while loops).
With reference to the information in the next section, fill in the rest of the code. (Answer: the whole program)
Once you've got your program working with wsj_np_short, run it on the big file (/home/handouts/wsj_np).
How many different sequences of immediate constituents are there in this file? (Answer)
How many of them occur only once? (Answer)

Helpful bits of perl

This section describes the (new) perl commands you'll need for this program. You'll also need to use some you learned about last week, so I recommend that you finish that lab first (although you don't need to finish all of the ``improvements'' in that lab first).

Jump to:

Scalar variables
if-then-else
Matching $_
Array variables
split
length
substr

Scalar variables

Scalar variables are variables which store strings or numbers. All scalar variable names start with $. You can assign a value to a scalar variable with the = operator:

$foo = 3; # The value of $foo is now 3.

$foo = $foo+2 # The value of $foo is the previous value plus 2, or 5.

$foo = "hello"; # The value of $foo is now the string ``hello''.

$foo = $foo." world" # The value of $foo is now the string ``hello world''.

# . is the concatenation operator.

$bar = $foo # The value of $bar is now whatever the value of $foo was

# i.e., the string ``hello world''. The value of $foo is unchanged.

The concatenation operator (.) will be useful in this program because you can use a scalar variable to store the sequence of immediate constituents, building it up as you go along by concatenating each node label onto whatever you had before. (You might want to separate them with some character like _, for readability.)

For more on scalar variables, see Chapter 2.

Back to "helpful bits of perl"

if-then-else

An if statement is like a while loop in that they are both control structures. If statements look like this:

if (test) {
    do_this_1;
    do_this_2;
    do_this_3;
} else {
    do_this_4;
    do_this_5;
    do_this_6;
}

If the test returns true, then only the do_this statements 1 through 3 are executed. If the test returns false, then only the do_this statements 4 through 6 are executed. The else part is optional. In that case, if the test returns false, Perl will go on to the next part of the program. You can also string together elsifs like this:

if (test) {
    do_this_1;
} elsif (test) {
    do_this_2;
} elsif (test) {
    do_this_3; 
} else {
    do_this_4;
}

The elsif saves you embedding ifs in each other and keeps the number of curly brackets down.

For more on if-then-else and other control structures, see Chapters 4 and 9.

Back to "helpful bits of perl"

Matching $_

The match operator is =~:

$a = "hello world";

$a =~ /\verb+^+he/; # true, the regular expression /^he/ matches

# the string "hello world"

However, if the variable you want to match is $_, you don't have to specify it.

if (/^he/) {
   do_this;
}

is equivalent to:

if ($_ =~ /^he/) {
   do_this;
}

For more on the match operator, see Chapter 7.

Back to "helpful bits of perl"

Array variables

Another kind of variable is an array variable. An array variable stores an array, or an ordered list of scalar values. Array variable names always start with @. For example:

@foo = (2, 9, "bar")

The value of @foo is now a three element array. The first element is 2, the second is 9, and the third is the string ``bar''. Because the array is ordered, we can get at each of these values separately. For example, the value of $foo[0] is 2, the value of $foo[1] is 9 and the value of $foo[2] is ``bar''. We can also assign the value of particular elements separately:

$foo[1] = "hi"    # The value of @foo is now (2, "hi", "bar")

Note that when referring to specific elements of an array, you use $ instead of @. Also note that the first element is indexed with 0.

The special variable @ARGV (introduced in the improvements section last week) is an array variable.

For more on array variables, see Chapter 3.

Back to "helpful bits of perl"

split

The split function takes a string and turns it into a sequence of values (which can then be stored in an array variable) by breaking it up at every occurence of some regular expression. The syntax of split is as follows:

split(/delimiter/,string)

For example:

$line = "a:bbb:cd:e";

@fields = split(/:/,$line); #The value of @fields is ("a","bbb","cd","e")

#The value of $line is unchanged.

The default value for the delimiter is whitespace, and default value for the string is $_. So, the following command is all you need to make @fields be an array containing all of the words in a line just read in (assuming that's what the current value of $_ is).

@fields = split;

For more on split, see pages 89--90.

Back to "helpful bits of perl"

length

The function length returns the length in characters of its argument. For example:

$foo = "hello";

$bar = length($foo); #The value of $bar is 5.

The function length isn't really described in Learning Perl, although the index lists a couple of examples.

Back to "helpful bits of perl"

substr

The function substr takes a string as input and returns a string as output. Its syntax is as follows:

substr(string,offset,length)

where string is the string you want a substring of, offset is the point in the string that you want the substring to start at, and length is how long you want the substring to be. (If you leave the length unspecified, you get a substring that goes to the end of the original string.) Note that the first character is at offset 0.

Examples:

$string = "chocolate";

$sub = substr($string,0,3); #The value of $sub is the string ``cho''.

$sub = substr($string,5) #The value of $sub is the string ``late''.

For more on substr, see pages 154--156.

Back to "helpful bits of perl"

Improvements

Here are some things you might add to your program:

The way the pseudocode is written above, the output file will always start with one blank line. Fix this by checking whether you have anything saved in your NP-describing string before printing it.
Check whether the output file already exists, but instead of dying write away, have the program ask the user if they want to continue and overwrite the file.
Put the sort and uniq -c calls into the Perl program, so you don't have to do it separately each time. (See the use of | with the open operator on page 21 of Learning Perl.)
Alternatively, you could use an associative array (Ch. 5) to keep track of the number of times you encounter each constituent sequence as you go.
The way we used the formatting of the input file was really something of a hack. A more general solution would be to read in each NP and create a representation of the tree structure (apparent from the matching parentheses). Such a general solution would make the program easier to adapt to other needs. However, trees are a somewhat cumbersome data structure in perl.

Lab 11: November 17, 2000

Testing Sampson's results

In class on Monday, we will be discussing Sampson (1987) (in the reader). He makes some (I believe) outrageous claims about the notion of grammaticality on the basis of some frequency data, namely the frequency with which different patterns of immediate constituents of NPs appear. In this section, you will be asked to see how one of the treebank corpora (Brown or WSJ) compares to his smaller corpus. In the next section, you will be asked to provide some data he doesn't, namely, information about what the rare types really look like. Both sections assume that you've completed Lab 10 (minus the `Improvements' section), so if you haven't done Lab 10 yet, do it first.

Fill in the following table. (Note that here an `NP type' is a sequence of immediate constituents of an NP. By `hapax NP types' I mean NP types instantiated by only one token in the corpus.)

Measurement	Sampson's result	Your result
Corpus	part of the LOB	.
Total `words' (including punctuation marks)	39,969	.
NP tokens	8,328	.
NP types	747	.
most common NP type	DT NN	.
most common type as # of tokens	1135	.
most common type as % of tokens	13.63%	.
2nd most common NP type	---	.
2 most common types as % of tokens	---	.
hapax NP types	468	.
% of types that are hapax	62.65%	.
% of tokens that are of hapax types	5.62%	.

Some actual data

Look at a couple of screenfulls of the hapax NP types. Does each one require a separate rule? What generalizations can you make about the NPs you looked at?
Find three hapax NP types that look particularly unusual. Use tgrep to find the sentences they appear in.

Further things to do

Sampson states that his corpus contains only 47 different constituent types within noun phrases. How many does your corpus contain? (Hints: Consider only the immediate constituents of NPs. You don't need perl for this --- take the output of your `nptyper' program (Lab 10), and use tr, sort, uniq and wc.)
Sampson finds that the mean length of an NP in his corpus is 2.32 immediate constituents (ICs), with a standard deviation of 0.94 ICs. What is the mean length of an NP in your corpus? (Hint: To get the mean length of the NP tokens, and not the types, take the output of your nptyper before sending it through sort and uniq. You don't need a perl script for this one either; wc will do.)
What is the mean length of the hapax NPs?
Rounding the mean length of the hapax NPs to the nearest whole number, how many possible sequences are there of that length or shorter?
Find the standard deviation of the length of the NPs in your corpus. (Here's how to calculate the standard deviation. This time, you will need Perl.)
Write a perl script that will take an input file of NP types (i.e., lists of immediate constituents) and return an example for each type (by querying a corpus with tgrep). It should return both the NP instantiating the type and the sentence the NP is found in. (If you know LaTeX, you might consider formatting the output as a .tex file.)

Emily M. Bender
Last modified: Nov 6 2001

$foo = 3;	# The value of $foo is now 3.
$foo = $foo+2	# The value of $foo is the previous value plus 2, or 5.
$foo = "hello";	# The value of $foo is now the string ``hello''.
$foo = $foo." world"	# The value of $foo is now the string ``hello world''.
	# . is the concatenation operator.
$bar = $foo	# The value of $bar is now whatever the value of $foo was
	# i.e., the string ``hello world''. The value of $foo is unchanged.

$a = "hello world";
$a =~ /\verb+^+he/;	# true, the regular expression /^he/ matches
	# the string "hello world"

$line = "a:bbb:cd:e";
@fields = split(/:/,$line);	#The value of @fields is ("a","bbb","cd","e")
	#The value of $line is unchanged.

$string = "chocolate";
$sub = substr($string,0,3);	#The value of $sub is the string ``cho''.
$sub = substr($string,5)	#The value of $sub is the string ``late''.