Back to labs/answer key page

The following command will find all sentences with NN JJ sequences in btl.txt.

grep '_NN [A-Za-z]*_JJ' btl.txt 

However, if there are two NN JJ sequences in the same sentences, it will only match it once. So piping that through wc could result in artificially deflated numbers.

The following command will find all sentences with two NN JJ sequences in btl.txt.

grep '_NN [A-Za-z]*_JJ .*_NN [A-Za-z]*_JJ' btl.txt

Likewise, the following would find all sentences with three NN JJ sequences, but there aren't any.

Finally, to efficiently search all of the files in categories a-j, but not the rest, we can apply the grep-style regular expressions to the filename in the grep command:

grep '_NN [A-Za-z]*_JJ' brown_[a-j].tag | wc
    979   30514  278880
grep '_NN [A-Za-z]*_JJ .*_NN [A-Za-z]*_JJ' brown_[a-j].tag | wc
     23     994    8786
grep '_NN [A-Za-z]*_JJ' brown_[k-r].tag | wc
    190    4826   42472 
grep '_NN [A-Za-z]*_JJ .*_NN [A-Za-z]*_JJ' brown_[k-r].tag | wc
      4     161    1424 

Thus we found 1002 instances of NN JJ in the categories a-j and 194 in the categories k-r. However, these numbers are not strictly comparable to the results reported for the LOB because those results are reported in terms of instances per million words. The whole corpus has about one million words, of which 75% are in categories a-j and 25% in categories k-r. Since there are slightly more than one million words in the whole corpus, it won't do to take 750,000 for a-j and 250,000 for k-r. To find out how many words there actually are in these categories, I edited copies of the files in brown1 to remove the line numbers. Using wc, I still got 640 more words than the official figure. Assuming that these extra words I'm finding are evenly distributed across the texts, I subtracted 480 and 160 words respectively from the numbers I was getting for a-j and k-r. This gives:

A-J:  760,108
K-R:  254,204

Using these word counts, there are 1318 instances of NN JJ per million words in the categories a-j and 763 instances per million words in the categories k-r. Thus these results for the Brown corpus are similar to what Kennedy reports for the LOB corpus for imaginative prose (categories k-r), but considerably greater for informative prose. (That is, NN JJ appears to be more frequent in American informative prose than in British informative prose.)

Back to labs/answer key page

-----

Emily M. Bender
Last modified: Fri Dec 8 12:02:49 2000