The SRI Language Modeling Toolkit is a toolkit for creating and using N-gram language models. It is installed on Patas, at /NLP_TOOLS/ml_tools/lm/srilm. In this part of the exercise, you will use it to train a series of language models, and see how well they model various sets of test data.
Copy these files to a directory on Patas.
holmes.txt | 614,774 words | The complete Sherlock Holmes novels and short stories by A. Conan Doyle, with the exception of the collection of stories His Last Bow (see below) and the collection The Case Book of Sherlock Holmes (which is not yet in the public domain in this country). We will use this corpus to train the language models. |
hislastbow.txt | 91,144 words | The collection of Sherlock Holmes short stories His Last Bow by A. Conan Doyle. |
lostworld.txt | 89,600 words | The novel The Lost World by A. Conan Doyle |
otherauthors.txt | 52,516 words | Stories by English Authors: London, a collection of short stories written around the same time as the Sherlock Holmes canon and The Lost World. |
We will use two utilities, ngram-count and ngram, both found in /NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64/. I suggest setting your PATH variable to include this path, at least for the duration of this assignment, by adding the following to the end of the file .bashrc in your home directory:
PATH=/NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64:$PATH
Changes to your .bashrc are regsitered when you log in. So to see this take effect, you can log out and log back in, or just once (when you've made the change), type: . ~/.bashrc
Then to confirm that it worked type
which ngram-countThe system should respond with the path to ngram-count, i.e. /NLP_TOOLS/ml_tools/lm/srilm/latest/bin/i686-m64/ngram-count.
You can find basic documentation for ngram and ngram-count here, and more extensive documentation here .
The following command will create a bigram language model called wbbigram.bo, using Witten-Bell discounting, from the text file holmes.txt:
ngram-count -text holmes.txt -order 2 -wbdiscount -lm wbbigram.bo
The following command will evaluate the language model wbbigram.bo against the test file hislastbow.txt
ngram -lm wbbigram.bo -order 2 -ppl hislastbow.txtThe file NgramQ.txt has an outline of the items to turn in. Include answers to all questions in your write up and add more discussion/framing in the form of an introduction and a conclusion. You may have a numbered list in your write up, an item per question, but there should still be an introduction and a conclusion and none of the sentences should assume that the reader is already thinking about the specific question you are answering, unless you have included the question in your answer. Turn in (as a PDF file) via Canvas. Note: Questions 1, 5, and 8-10 require answers in the form of one or more paragraphs.
Evaluate this language model against the other test sets, lostworld.txt and otherauthors.txt. In your writeup, tell us:
Now build trigram and 4gram language models against the same training data (still using Witten-Bell discounting). Tell us:
Build more language models using different smoothing methods. In particular, use "Ristad's natural discounting law" (the -ndiscount flag) and Kneser-Ney discounting (the -kndiscount flag). Tell us:
In addition, answer these questions:
In this part of the assignment, you will interact with GPT-2, a large, modern language model, accessible via web demo, and reflect on possible impacts of fluent-sounding language models.