|
This page contains the lab assignments that we have done so far, as well as links to answers for some of the excercises. I recommend that you don't look at the answers until you think you have a final answer on your own. If you get stumped, ask me or someone else for help/a hint. You won't get nearly as much out of looking at the answers as you will figuring them out yourself. On the other hand, there are hints sprinkled through the labs as well. These you can look at earlier.
Note also that the answers provided here are by no means the only correct answers. There are typically multiple, equally valid solutions to these problems. In some cases, the `answer' will be more of a discussion.
Acknowledgment These labs, especially labs 2 and 6, were constructed with reference to Dan Jurafsky's labs for a similar course.
Jump to:
This lab involved exploring web resources and getting used to using the computers in B21 Dwinelle.
Here is a description of some basic linux/unix commands.
Use this link to download the file daffodils, used in this lab.
First, take a look at these additional unix commands.
grep pattern filenames
Here is a table describing grep and egrep regular expressions.
And here are some hints on using grep and egrep with regular expressions:
grep -f patfile searchfiles
In this case, you don't need the single quotes. In fact, you can have multiple patterns on separate lines in the file. Be sure that the file ends with exactly one newline.
/home/corpora/english/ICAME/texts/browntag/brown_n.tag
Catch-up lab; no new assignments.
I plan to spend two weeks on tgrep, so don't despair if you don't get through this one this week. On the other hand, let me know if you finish even the advanced problems, so I'll know to dream up some more.
Here's a quick answer to the question ``why tgrep?'' provided in the tgrep documentation online:
Treebank bracketing is represented by multiple levels of parentheses that look a lot like Lisp:
(TOP (S (NP-SBJ my best friend) (VP gave (NP me) (NP chocolate) (NP-TMP yesterday)) .))
You can't effectively search this kind of thing using plain old grep, because grep can only look at single lines. So we have tgrep, which can look for geometric relationships in the "trees" represented above.
(end quote)
tgrep '/^NP/ << chocolate' /home/corpora/english/tgrep/brown.corpus | more
tgrep pattern corpusIn order to use tgrep, the corpus must first be prepared. Such prepared corpora will live in the directory /home/corpora/english/tgrep. If you're going to be working with the same corpus most of the time, you can store it in the environment variable TGREP_CORPUS, by typing:
setenv TGREP_CORPUS /home/corpora/english/tgrep/brown.corpusif you want to use the brown corpus. (This environment variable is like the variable DISPLAY: once you set it, it stays set until a) you log out or b) you reset it. You can put the statement above in your .cshrc and it will get set every time you log in.)
tgrep '/^NP/ << chocolate' | more
For this section, refer to section 3.4 of the tgrep documentation handout. (Or go to the LDC's tgrep documentation page, or type man tgrepdoc at the prompt.) You might also want to look at AMALGAM's table for the Penn Treebank tagset (linked from the standards section of our class web page).
As a general guideline, it's always best snoop around a little bit to see how the TREEBANK folks decided to parse the structure you're looking into.
In pseudopassives, the verb is passive, but the subject corresponds not to the object of the (active) verb, but rather to the object of a preposition which is in turn a dependent of the verb. An example is given in (1).
(1) This bed was slept on by George Washington.A further fact about pseudopassives is that they are only grammatical if the PP with a missing object is immediately adjacent to the verb:
(2) a.?The protesters have called loudly for a change in government. b. A change in government is what is called for. c.*A change in governmnet is what is called loudly for.
Absolutes are adjuncts comprising a subject and a non-finite predicate (NP, AP, PP predicates or non-finite verbal predicates). They can be marked by with, as in (3), or appear without with, as in (4). Both with and with-less absolutes can appear sentence initially and sentence finally (compare the a and b sentences).
(3) a. With the weather so nice, we really should go outside and enjoy it. b. The game is tied with one minute left on the clock. (4) a. Weather permitting, the game will be held on Sunday. b. The child asleep on the floor, toys spread out all around her.
(5) Who do you think they said would come?}(Hint: Try tgrepping for what and which to get an idea of what long distance dependencies look like in the TREEBANK system.) (Hint) (Answer)
This week is planned as a continuation of the tgrep lab from last week. Here are a couple more advanced excercises if you finish those above:
Enough people missed one of the two previous weeks that this week will be a catch-up lab. I have added information and answers for the advanced problems in Lab 6. Feel free to work on your own projects as well.
In this lab and for the rest of the semester, we will be writing short programs in Perl (and maybe long ones if you get into something). These programs are files with Perl commands in them that are executable. Linux has something called a path where it looks for executables. In order to make sure that linux can find executables in your home directory, you need to add your home directory to your path. To do this:
emacs ~/.cshrc &
setenv PATH /home/\it your_login_name:$PATH(Be sure to replace your_login_name with your actual login name.)
source .cshrcThis tells the computer to look at the contents of your .cshrc file. It always does this when you log in, so you won't have to do this again.
#!/usr/bin/perl
print ("your message");
chmod u+x filename(For more on chmod do man chmod.)
head -100 favorite_corpus > ~/senttest(Some of the corpora were orginally formatted on other computer systems with different newline characters (technically, with both a newline and a linefeed, where unix/linux has only a newline). The corpora in /home/corpora/english/soc.text/ definitely don't have this problem.)
$/ = ""; # paragraph mode $* = 1; # multiline matches($/ and $* are special variables that perl checks to see what mode it's in. These lines set the value of those variables to the behavior we want. The default values are such that perl takes one line at a time and doesn't allow multiline matches. Everything after the #s is comments.)
open (FILEHANDLE, "file_path_name");
open (FILEHANDLE, ">file_path_name");
close (FILEHANDLE);Perl will automatically close all open files when it finishes a program, but it's good form to do this anyway.
while (test) { do_this; do_this; do_this; }The effect is that perl will do the test, and if it turns up true, execute, in order, all of the do_this commands. (In perl, 0, the empty string, and undef are false. Everything else is true.) When it's done with those, it runs the test again, and if it's still true, repeats. Note that unless something in the test or do_this commands potentially changes the outcome of the test, while loops can run forever. If your program is taking much longer than you think it should, stop it with ctrl-c and take a look at your while loops.
while (<FILEHANDLE>) { do_this; do_this; do_this; }Here, the test is whether there is anything left in the file FILEHANDLE. But it has a side-effect: every time it's called, it reads one line (or, in paragraph mode, one paragraph) into the variable $_ and gets ready to read the next one. So, we can use it to go through a file one line (or paragraph) at a time. Note that, by moving through the file, it will eventually get to the end, so that the test will eventually return false. The variable $_ is helpful, because it serves as a default value for many operators, including the substitute operator we will be using.
s/inputstring/outputstring/flagsThat is, it looks for the input string, and, if it finds it, replaces it with the output string. Where is it looking? If you don't specify where, it will take $_ as it's input value and then change the value of $_, so that the next time you use $_, it will have the new value. (Because so many commands use this default variable, you can write whole programs where it gets used in almost every line but is never mentioned.)
print FILEHANDLE;
bysent daphne.txtThe argument (``daphne.txt'') is stored in the variable $ARGV[0]. Use this to change bysent so that you can tell it at runtime what you want the input to be. It is also useful for the output file to reflect the name of the input file. (Answer)
-e filenameThis operator returns true if the file exists and false if it doesn't. If it does exist, you want the program to quit. The best way to do this is with the die operator. If the program reaches a die operator, it stops executing. You can also get it to print a message when it dies, like this:
die "That file already exists!!"In order to die only when the file already exists, you need an if-then statement. Here's the syntax:
if (test) { do_this; } else { do_this; }The else part is optional. If there is no else part, but the test turns up false, perl won't do anything (that is, it'll go on to the next thing in the program). For more on the if-then and related elements, see Chapters 4 and 9.
In class on 11/20 we will be discussing an article by Sampson regarding the variety of possible sequences of immediate constituents of NPs. That article makes some surprising claims and doesn't provide much specific data. The goal of this lab is to write a program that will get the specific data. In particular, we are going to find all of the different sequences of immediate constituents of NPs in a corpus and keep track of how many times each occurs.
The file /home/handouts/wsj_np contains all of the NPs in the corpus /home/corpora/english/tgrep/wsj_mrg.corpus. It was produced with this command:
tgrep -a '/^NP/' /home/corpora/english/tgrep/wsj_mrg.corpus > wsj_np
The file /home/handouts/wsj_np_short contains the first 200 or so lines of the larger file.
Here is one way to do it: (The hints generally tell you which perl functions exactly you'll need for each step, some give more information.)
The program described by this pseudocode creates an output file with one NP-description per line. You can then use sort and uniq -c on that file to get a frequency count of the different NP types.
This section describes the (new) perl commands you'll need for this program. You'll also need to use some you learned about last week, so I recommend that you finish that lab first (although you don't need to finish all of the ``improvements'' in that lab first).
Jump to:
Scalar variables are variables which store strings or numbers. All scalar variable names start with $. You can assign a value to a scalar variable with the = operator:
$foo = 3; | # The value of $foo is now 3. |
$foo = $foo+2 | # The value of $foo is the previous value plus 2, or 5. |
$foo = "hello"; | # The value of $foo is now the string ``hello''. |
$foo = $foo." world" | # The value of $foo is now the string ``hello world''. |
# . is the concatenation operator. | |
$bar = $foo | # The value of $bar is now whatever the value of $foo was |
# i.e., the string ``hello world''. The value of $foo is unchanged. |
The concatenation operator (.) will be useful in this program because you can use a scalar variable to store the sequence of immediate constituents, building it up as you go along by concatenating each node label onto whatever you had before. (You might want to separate them with some character like _, for readability.)
For more on scalar variables, see Chapter 2.
Back to "helpful bits of perl"
An if statement is like a while loop in that they are both control structures. If statements look like this:
if (test) { do_this_1; do_this_2; do_this_3; } else { do_this_4; do_this_5; do_this_6; }
If the test returns true, then only the do_this statements 1 through 3 are executed. If the test returns false, then only the do_this statements 4 through 6 are executed. The else part is optional. In that case, if the test returns false, Perl will go on to the next part of the program. You can also string together elsifs like this:
if (test) { do_this_1; } elsif (test) { do_this_2; } elsif (test) { do_this_3; } else { do_this_4; }
The elsif saves you embedding ifs in each other and keeps the number of curly brackets down.
For more on if-then-else and other control structures, see Chapters 4 and 9.
Back to "helpful bits of perl"
The match operator is =~:
$a = "hello world"; | |
$a =~ /\verb+^+he/; | # true, the regular expression /^he/ matches |
# the string "hello world" |
However, if the variable you want to match is $_, you don't have to specify it.
if (/^he/) { do_this; }
is equivalent to:
if ($_ =~ /^he/) { do_this; }
For more on the match operator, see Chapter 7.
Back to "helpful bits of perl"
Another kind of variable is an array variable. An array variable stores an array, or an ordered list of scalar values. Array variable names always start with @. For example:
@foo = (2, 9, "bar")
The value of @foo is now a three element array. The first element is 2, the second is 9, and the third is the string ``bar''. Because the array is ordered, we can get at each of these values separately. For example, the value of $foo[0] is 2, the value of $foo[1] is 9 and the value of $foo[2] is ``bar''. We can also assign the value of particular elements separately:
$foo[1] = "hi" # The value of @foo is now (2, "hi", "bar")
Note that when referring to specific elements of an array, you use $ instead of @. Also note that the first element is indexed with 0.
The special variable @ARGV (introduced in the improvements section last week) is an array variable.
For more on array variables, see Chapter 3.
Back to "helpful bits of perl"
The split function takes a string and turns it into a sequence of values (which can then be stored in an array variable) by breaking it up at every occurence of some regular expression. The syntax of split is as follows:
split(/delimiter/,string)
For example:
$line = "a:bbb:cd:e"; | |
@fields = split(/:/,$line); | #The value of @fields is ("a","bbb","cd","e") |
#The value of $line is unchanged. |
The default value for the delimiter is whitespace, and default value for the string is $_. So, the following command is all you need to make @fields be an array containing all of the words in a line just read in (assuming that's what the current value of $_ is).
@fields = split;
For more on split, see pages 89--90.
Back to "helpful bits of perl"
The function length returns the length in characters of its argument. For example:
$foo = "hello"; | |
$bar = length($foo); | #The value of $bar is 5. |
The function length isn't really described in Learning Perl, although the index lists a couple of examples.
Back to "helpful bits of perl"
The function substr takes a string as input and returns a string as output. Its syntax is as follows:
substr(string,offset,length)
where string is the string you want a substring of, offset is the point in the string that you want the substring to start at, and length is how long you want the substring to be. (If you leave the length unspecified, you get a substring that goes to the end of the original string.) Note that the first character is at offset 0.
Examples:
$string = "chocolate"; | |
$sub = substr($string,0,3); | #The value of $sub is the string ``cho''. |
$sub = substr($string,5) | #The value of $sub is the string ``late''. |
For more on substr, see pages 154--156.
Back to "helpful bits of perl"
Here are some things you might add to your program:
In class on Monday, we will be discussing Sampson (1987) (in the reader). He makes some (I believe) outrageous claims about the notion of grammaticality on the basis of some frequency data, namely the frequency with which different patterns of immediate constituents of NPs appear. In this section, you will be asked to see how one of the treebank corpora (Brown or WSJ) compares to his smaller corpus. In the next section, you will be asked to provide some data he doesn't, namely, information about what the rare types really look like. Both sections assume that you've completed Lab 10 (minus the `Improvements' section), so if you haven't done Lab 10 yet, do it first.
Measurement | Sampson's result | Your result |
Corpus | part of the LOB | . |
Total `words' (including punctuation marks) | 39,969 | . |
NP tokens | 8,328 | . |
NP types | 747 | . |
most common NP type | DT NN | . |
most common type as # of tokens | 1135 | . |
most common type as % of tokens | 13.63% | . |
2nd most common NP type | --- | . |
2 most common types as % of tokens | --- | . |
hapax NP types | 468 | . |
% of types that are hapax | 62.65% | . |
% of tokens that are of hapax types | 5.62% | . |