Ling/CSE 472: Assignment 3:
N-grams and Corpora
Due April 28th, by 11:59pm
Part 1: Corpora
This is a Treasure Hunt -- it is designed to introduce you to a range of linguistic corpora and to give you an idea of what is available to you on our server.
Please start at the UW Linguistics Treehouse Wiki:
http://depts.washington.edu/uwcl/twiki/bin/view.cgi/Main/WebHome
Begin by finding and reading the Corpus Usage Guidelines. Note: Anyone with a Patas account is a lab member. Then, explore the information in the CompLing Database and answer the following questions about corpora. Some of these questions will be easier than others. Many will require only a single word answer while others may require a few sentences. Be sure to use your own words to complete your short answers -- you will not get credit for copied answers! For certain types of answers (e.g., yes/no, numbers, paths) this admonition is irrelevant.
Note: You may need to follow web links from the database entry pages to get some of the information. For example, for data from the LDC, you may find it helpful to use the LDC's Catalog Search feature. In other cases, you may need to investigate actual corpus files on Patas. This exercise will be most beneficial to you if you take this opportunity to really explore the corpora.
To Turn In
The file CorporaQ.txt has an outline of the questions to be answered (also below). Modify CorporaQ.txt by adding your answers, then print it to a pdf file, and turn it in via Canvas.
- Are you permitted to copy a corpus onto your own machine for research purposes?
- What is the process for requesting 'Available' corpora?
- Is there a single set of license conditions for all of these corpora?
- Are all of the 'Installed' corpora accessible immediately?
- How many corpora are Installed on our server?
- How many more are Available for installation upon request?
- What is the LDC? Specifically, what does the acronym stand for and what is its mission/purpose (in your own words)?
- How many of the Installed or Available corpora include:
- Finnish data?
- Arabic data?
- English data?
- Are all of the corpora collections of written text? If not, what other kind of content is there?
- Find the corpus with this description:
"TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance."
- What is the name of the corpus?
- What sentence is used in the audio example given in the description for the above corpus?
- What demographic data is included in this corpus?
- What is the path to the corpus on Patas?
- Find the Europarl Parallel Corpus. Hint: It includes Finnish data.
- Give the reference for the paper that must be cited if you use this corpus:
- What is the purpose of the corpus?
- How many languages are included in this corpus?
- What is the original source of the data?
- What is the format of the data files?
- What version of the corpus do we have installed?
- Find the Google N-gram corpus (note: Ignore the GALE version of the corpus).
- Describe this corpus in a couple of sentences.
- What is the path to the corpus on Patas?
- What preprocessing was done to the data?
- What information is contained in the file 5gm.idx?
- What is the unigram counts for the word 'the' in this corpus?
- Can you find a unigram of count 1 in this corpus?
- Give 5 specific examples of unigrams that reveal problems with tokenization or filtering.
- In a couple of sentences, describe one specific question you could answer given this data.
- List the names of 4 installed corpora that include dialogue act annotation and the languages they include.
- name:
language:
- name:
language:
- name:
language:
- name:
language:
- Find a POS tagged version of the Brown Corpus.
- What is ICAME? Specifically, what does the acronym stand for and what is its mission/purpose (in your own words)?
- What is the full path to the file on Patas containing the Brown POS-tagged text?
- Describe briefly the format of the text file.
- There are two English corpora installed that have syntactically annotated text, Treebank-2 and CCG.
- More specifically (but briefly), what do these corpora contain? What are the main points of these projects?
- Sticking to the WSJ texts, compare the first parsed file in each corpus (wsj_0001). What English sentences are parsed?
- Give the compete path of the file in the CCG corpus that provides a short explanation of the format of the parsed files:
- Give the complete path of the file in the Treebank-2 corpus that illustrates some problems that occurred when the POS tags were merged with the parsing results:
- In a paragraph, describe three other things of interest you found among the corpora, i.e., something you were not asked about above.
- In a paragraph, describe a type of language use that is not represented in any of the corpora you explored and answer the question: How might lack of such language use in the resources used for training NLP systems affect the functioning of the technology trained on corpus data?
Part 2: N-grams
The
SRI Language Modeling
Toolkit
is a toolkit for creating and using N-gram language models. It
is installed on Patas, at /NLP_TOOLS/ml_tools/lm/srilm. In this part of the exercise,
you will use it to train a series of language models, and see how well they
model various sets of test data.
The Data
Copy these files to a directory on Patas.
holmes.txt
|
614,774 words |
The complete Sherlock Holmes novels and short stories by A. Conan Doyle, with
the exception of the collection of stories His Last Bow (see below) and
the collection The Case Book of Sherlock Holmes (which is not yet in the
public domain in this country). We will use this corpus to train the language
models.
|
hislastbow.txt
|
91,144 words |
The collection of Sherlock Holmes short stories His Last Bow by A. Conan
Doyle.
|
lostworld.txt
|
89,600 words |
The novel The Lost World by A. Conan Doyle
|
otherauthors.txt
|
52,516 words |
Stories by English Authors: London, a collection of short stories
written around the same time as the Sherlock Holmes canon and The Lost World.
|
We will use two utilities, ngram-count and ngram, both found
in /NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64/. I suggest setting your PATH variable to
include this path, at least for the duration of this assignment, by adding the following to the end of the file .bashrc in your home directory:
PATH=/NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64:$PATH
Changes to your .bashrc are regsitered when you log in. So to see this take effect, you can log out and log back in, or just once (when you've made the change), type:
. ~/.bashrc
Then to confirm that it worked type
which ngram-count
The system should respond with the path to ngram-count, i.e. /NLP_TOOLS/ml_tools/lm/srilm/latest/bin/i686-m64/ngram-count.
You can find basic documentation for ngram and ngram-count
here, and more extensive documentation
here
.
Step 1: Build a language model
The following command will create a bigram language model called wbbigram.bo,
using Witten-Bell discounting, from the text file holmes.txt:
ngram-count -text holmes.txt -order 2 -wbdiscount -lm wbbigram.bo
Step 2: Test the model
The following command will evaluate the language model wbbigram.bo against the
test file hislastbow.txt
ngram -lm wbbigram.bo -order 2 -ppl hislastbow.txt
To Turn In
The file NgramQ.txt has an outline of the items to turn
in. Modify the file by adding your answers and then turn it in (as a plain text file) via Canvas. Note: Questions 1, 5, and 8-10 require answers in the form of one or more paragraphs.
-
What are each of the flags that you used in ngram and ngram-count
for (in your own words)?
Evaluate this language model against the other test sets, lostworld.txt
and otherauthors.txt. In your writeup, tell us:
-
The perplexity (ppl) against hislastbow.txt
-
The perplexity against lostworld.txt
-
The perplexity against otherauthors.txt
-
Why do you think the files with the higher perplexity got the higher
perplexity?
Now build trigram and 4gram language models against the same training data
(still using Witten-Bell discounting). Tell us:
-
The six perplexity figures: one for each combination of language model and test
set.
Build more language models using different smoothing methods. In particular, use
"Ristad's natural discounting law" (the -ndiscount flag) and
Kneser-Ney discounting (the -kndiscount flag). Tell us:
-
Which combination - N-gram order, discounting method, test file - gives the
best perplexity result?
In addition, answer these questions:
-
How are the data files you used formatted, i.e., what preprocessing was done on the texts?
-
What additional preprocessing step should have been taken?
-
Discuss how/whether this does or does not affect the quality of the language models built.
Back to main course page