Ling/CSE 472: Assignment 4: Language models

Part 1: N-grams

The SRI Language Modeling Toolkit is a toolkit for creating and using N-gram language models. It is installed on Patas, at /NLP_TOOLS/ml_tools/lm/srilm. In this part of the exercise, you will use it to train a series of language models, and see how well they model various sets of test data.

The Data

Copy these files to a directory on Patas.

holmes.txt	614,774 words	The complete Sherlock Holmes novels and short stories by A. Conan Doyle, with the exception of the collection of stories His Last Bow (see below) and the collection The Case Book of Sherlock Holmes (which is not yet in the public domain in this country). We will use this corpus to train the language models.
hislastbow.txt	91,144 words	The collection of Sherlock Holmes short stories His Last Bow by A. Conan Doyle.
lostworld.txt	89,600 words	The novel The Lost World by A. Conan Doyle
otherauthors.txt	52,516 words	Stories by English Authors: London, a collection of short stories written around the same time as the Sherlock Holmes canon and The Lost World.

We will use two utilities, ngram-count and ngram, both found in /NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64/. I suggest setting your PATH variable to include this path, at least for the duration of this assignment, by adding the following to the end of the file .bashrc in your home directory:

PATH=/NLP_TOOLS/ml_tools/lm/srilm/srilm-1.5.3/bin/i686-m64:$PATH

Changes to your .bashrc are regsitered when you log in. So to see this take effect, you can log out and log back in, or just once (when you've made the change), type: . ~/.bashrc

Then to confirm that it worked type

which ngram-count

The system should respond with the path to ngram-count, i.e. /NLP_TOOLS/ml_tools/lm/srilm/latest/bin/i686-m64/ngram-count.

You can find basic documentation for ngram and ngram-count here, and more extensive documentation here .

Step 1: Build a language model

The following command will create a bigram language model called wbbigram.bo, using Witten-Bell discounting, from the text file holmes.txt:

ngram-count -text holmes.txt -order 2 -wbdiscount -lm wbbigram.bo

Step 2: Test the model

The following command will evaluate the language model wbbigram.bo against the test file hislastbow.txt

ngram -lm wbbigram.bo -order 2 -ppl hislastbow.txt

To Turn In

The file NgramQ.txt has an outline of the items to turn in. Include answers to all questions in your write up and add more discussion/framing in the form of an introduction and a conclusion. You may have a numbered list in your write up, an item per question, but there should still be an introduction and a conclusion and none of the sentences should assume that the reader is already thinking about the specific question you are answering, unless you have included the question in your answer. Turn in (as a PDF file) via Canvas. Note: Questions 1, 5, and 8-10 require answers in the form of one or more paragraphs.

What are each of the flags that you used in ngram and ngram-count for (in your own words)?

Evaluate this language model against the other test sets, lostworld.txt and otherauthors.txt. In your writeup, tell us:

The perplexity (ppl) against hislastbow.txt
The perplexity against lostworld.txt
The perplexity against otherauthors.txt
Why do you think the files with the higher perplexity got the higher perplexity? (Make sure to reflect on each comparison.) **FIXME** what do we mean by comparisons here?

Now build trigram and 4gram language models against the same training data (still using Witten-Bell discounting). Tell us:

The six perplexity figures: one for each combination of language model and test set.

Build more language models using different smoothing methods. In particular, use "Ristad's natural discounting law" (the -ndiscount flag) and Kneser-Ney discounting (the -kndiscount flag). Tell us:

Which combination - N-gram order, discounting method, test file - gives the best perplexity result?

In addition, answer these questions:

How are the data files you used formatted, i.e., what preprocessing was done on the texts?
What additional preprocessing step should have been taken?
Discuss how/whether this does or does not affect the quality of the language models built.

Part 2: GPT-2

In this part of the assignment, you will interact with GPT-2, a large, modern language model, accessible via web demo, and reflect on possible impacts of fluent-sounding language models.

Visit the GPT-2 demo and play with it a bit to get a sense of how it works.
Now try the Talk to Transformer demo of GPT-2 (which will output whole paragraphs at a time).
Find two different articles on the same topic that vary in some salient way. It could be that one is an opinion piece and the other is an encyclopedia (e.g. Wikipedia) article, or perhaps both are news articles written from different points of view. Take the first sentence (or two) of each and use it as input to the Talk to Transformer demo and save those outputs.
In your write up, include answers to the following:
1. What did you give Talk to Transformer as input and what were the corresponding outputs?
2. What stylistic choices of the outputs reflect the intpus?
3. What properies of GPT-2's the training data might explain those differences?
4. Does the output you got indicate that GPT-2 understood your input? If so, what about it shows that? If not, what explains the degree to which the output is coherent?
5. If you encountered either of your input+output pairs presented as if they were an actual news article, would you be able to tell they weren't? If so, how? If not, why not?
6. Nathan et al 2007 encourage us to think of pervasive use. What kinds of applications of GPT-2 or similar technology could lead to a scenario where GPT-2 written "news" articles are pervasive on the web and frequently shared via social media without the people sharing them knowing they were written by GPT-2? In 2-3 sentences, describe such an application.
7. In 2-3 sentences, reflect on the possible negative consequences of wide-spread GPT-2 authored "news" articles, for individuals and/or society.

Back to main course page