Homework 4: Language Models [100 points]

Instructions

In this assignment, you will gain experience working with Markov models on text.

A skeleton file homework4.py containing empty definitions for each question has been provided. Since portions of this assignment will be graded automatically, none of the names or function signatures in this file should be modified. However, you are free to introduce additional variables or functions if needed.

You may import definitions from any standard Python library, and are encouraged to do so in case you find yourself reinventing the wheel.

You will find that in addition to a problem specification, most programming questions also include a pair of examples from the Python interpreter. These are meant to illustrate typical use cases, and should not be taken as comprehensive test suites.

You are strongly encouraged to follow the Python style guidelines set forth in PEP 8, which was written in part by the creator of Python. However, your code will not be graded for style.

Once you have completed the assignment, you should submit your file on Gradescope. You may submit as many times as you would like before the deadline, but only the last submission will be saved.

1. N-Gram Model [95 points]

In this section, you will build a simple language model that can be used to generate random text resembling a source document. Your use of external code should be limited to built-in Python modules, which excludes, for example, NumPy and NLTK.

[5 points] Write a simple tokenization function tokenize(text) which takes as input a string of text and returns a list of tokens derived from that text. Here, we define a token to be a contiguous sequence of non-whitespace characters, with the exception that any punctuation mark should be treated as an individual token. Hint: Use the built-in constant string.punctuation, found in the string module.
```
>>> tokenize("  This is an example.  ")
['This', 'is', 'an', 'example', '.']
```
```
>>> tokenize("'Medium-rare,' she said.")
["'", 'Medium', '-', 'rare', ',', "'", 'she', 'said', '.']
```
[10 points] Write a function ngrams(n, tokens) that produces a list of all $n$-grams of the specified size from the input token list. Each $n$-gram should consist of a 2-element tuple (context, token), where the context is itself an $(n−1)$-element tuple comprised of the $n−1$ words preceding the current token. The sentence should be padded with $n−1$ "<START>" tokens at the beginning and a single "<END>" token at the end. If $n=1$, all contexts should be empty tuples. You may assume that $n\ge1$.
```
>>> ngrams(1, ["a", "b", "c"])
[((), 'a'), ((), 'b'), ((), 'c'), ((), '<END>')]
```
```
>>> ngrams(2, ["a", "b", "c"])
[(('<START>',), 'a'), (('a',), 'b'), (('b',), 'c'), (('c',), '<END>')]
```
```
>>> ngrams(3, ["a", "b", "c"])
[(('<START>', '<START>'), 'a'), (('<START>', 'a'), 'b'), (('a', 'b'), 'c'), (('b', 'c'), '<END>')]
```
[10 points] In the NgramModel class, write an initialization method __init__(self, n) which stores the order $n$ of the model and initializes any necessary internal variables. Then write a method update(self, sentence) which computes the $n$-grams for the input sentence and updates the internal counts. Lastly, write a method prob(self, context, token) which accepts an $(n−1)$-tuple representing a context and a token, and returns the probability of that token occuring, given the preceding context.
```
>>> m = NgramModel(1)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> m.prob((), "a")
0.3
>>> m.prob((), "c")
0.1
>>> m.prob((), "<END>")
0.2
```
```
>>> m = NgramModel(2)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> m.prob(("<START>",), "a")
1.0
>>> m.prob(("b",), "c")
0.3333333333333333
>>> m.prob(("a",), "x")
0.0
```
[20 points] In the NgramModel class, write a method random_token(self, context) which returns a random token according to the probability distribution determined by the given context. Specifically, let $T=\langle t_1,t_2, \cdots, t_n \rangle$ be the set of tokens which can occur in the given context, sorted according to Python’s natural lexicographic ordering, and let $0\le r<1$ be a random number between 0 and 1. Your method should return the token $t_i$ such that
\[\sum_{j=1}^{i-1} P(t_j\ |\ \text{context}) \le r < \sum_{j=1}^i P(t_j\ | \ \text{context}).\]
You should use a single call to the random.random() function to generate $r$.
```
>>> m = NgramModel(1)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> random.seed(1)
>>> [m.random_token(())
     for i in range(25)]
['<END>', 'c', 'b', 'a', 'a', 'a', 'b', 'b', '<END>', '<END>', 'c', 'a', 'b', '<END>', 'a', 'b', 'a', 'd', 'd', '<END>', '<END>', 'b', 'd', 'a', 'a']
```
```
>>> m = NgramModel(2)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> random.seed(2)
>>> [m.random_token(("<START>",)) for i in range(6)]
['a', 'a', 'a', 'a', 'a', 'a']
>>> [m.random_token(("b",)) for i in range(6)]
['c', '<END>', 'a', 'a', 'a', '<END>']
```
[20 points] In the NgramModel class, write a method random_text(self, token_count) which returns a string of space-separated tokens chosen at random using the random_token(self, context) method. Your starting context should always be the $(n−1)$-tuple ("<START>", ..., "<START>"), and the context should be updated as tokens are generated. If $n=1$, your context should always be the empty tuple. Whenever the special token "<END>" is encountered, you should reset the context to the starting context.
```
>>> m = NgramModel(1)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> random.seed(1)
>>> m.random_text(13)
'<END> c b a a a b b <END> <END> c a b'
```
```
>>> m = NgramModel(2)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> random.seed(2)
>>> m.random_text(15)
'a b <END> a b c d <END> a b a b a b c'
```

[15 points] Write a function create_ngram_model(n, path) which loads the text at the given path and creates an $n$-gram model from the resulting data. Each line in the file should be treated as a separate sentence.

# No random seeds, so your results may vary
>>> m = create_ngram_model(1, "frankenstein.txt"); m.random_text(15)
'beat astonishment brought his for how , door <END> his . pertinacity to I felt'
>>> m = create_ngram_model(2, "frankenstein.txt"); m.random_text(15)
'As the great was extreme during the end of being . <END> Fortunately the sun'
>>> m = create_ngram_model(3, "frankenstein.txt"); m.random_text(15)
'I had so long inhabited . <END> You were thrown , by returning with greater'
>>> m = create_ngram_model(4, "frankenstein.txt"); m.random_text(15)
'We were soon joined by Elizabeth . <END> At these moments I wept bitterly and'

[15 points] Suppose we define the perplexity of a sequence of m tokens $\langle w_1, w_2, \cdots, w_m \rangle$ to be
\[\sqrt[m]{\frac{1}{P(w_1, w_2, \cdots, w_m)}}.\]
For example, in the case of a bigram model under the framework used in the rest of the assignment, we would generate the bigrams $\langle (w_0=\langle \text{START} \rangle , w_1), (w_1, w_2), \cdots,(w_{m−1}, w_m), (w_m, w_{m+1} = \langle \text{END}\rangle)\rangle$, and would then compute the perplexity as
\[\sqrt[m+1]{\prod_{i=1}^{m+1} \frac{1}{P(w_i\ | \ w_{i-1})}}.\]
Intuitively, the lower the perplexity, the better the input sequence is explained by the model. Higher values indicate the input was “perplexing” from the model’s point of view, hence the term perplexity.

In the NgramModel class, write a method perplexity(self, sentence) which computes the $n$-grams for the input sentence and returns their perplexity under the current model. Hint: Consider performing an intermediate computation in log-space and re-exponentiating at the end, so as to avoid numerical overflow.
```
>>> m = NgramModel(1)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> m.perplexity("a b")
3.815714141844439
```
```
>>> m = NgramModel(2)
>>> m.update("a b c d")
>>> m.update("a b a b")
>>> m.perplexity("a b")
1.4422495703074083
```

2. Use AI to Emulate a Corpus of Your Choice [EXTRA CREDIT: up to 10 points]

I have provided you a corpus of Mary Shelley’s Frankenstein for you to train your language model on. You have also been able to generate fake Frankenstein sentences (very thematic) using this language model. But what if you wanted to generate a language model based on a different training corpus? Perhaps you might be a Harry Potter fan who wants to write some fan fiction, or maybe you want to download a ton of Tweets with a particular hashtag to model how people are talking about some particular thing.

Generate an English corpus in the form of a file corpus.txt where, like in frankenstein.txt, each line contains a single sentence.
1. Document your source for the corpus. Something copyrighted is OK as long as you don’t share it further than here; this constitutes fair use for educational purposes.
2. Do not use a corpus that has already been generated in this exact format. Part of your assignment here is to use Python to do this bookkeeping work.
3. Submit corpus.txt as well as some .py script that you used to collate the text into this specific format.
Produce a few dozen sentences, some with random starts and some with chosen first words.
1. You may need to write another method to achieve a random sentence with a chosen first word.
2. Submit generated.txt containing your generated sentences.
In a file named reflections.txt, write a quick reflection on how you gathered your corpus and how well your outputs resemble (or do not resemble) typical English sentences. What limitations does this type of language model have in generating new text? What steps could be taken to improve your model?

In order to receive full extra credit points for this assignment, you must do all of the above steps. Partial credit may be awarded at instructor discretion, although be warned that no credit will be awarded without submitting a reflection (reflections.txt).

3. Feedback [5 points]

[1 point] Approximately how many hours did you spend on this assignment?
[2 point] Which aspects of this assignment did you find most challenging? Were there any significant stumbling blocks?
[2 point] Which aspects of this assignment did you like? Is there anything you would have changed?