This assignment is due on Friday, February 28, 2020 before 01:30PM.

# Instructions

In this homework, you will finetune a pre-trained language model and analyze the test generated by it. You will do this on a dataset of presidential speeches as well as on another dataset of your choice. You will be using GPT-2, a large Transformer-based language model that was original trained on text from the web, as your pre-trained model.

## Part 1 - Fine-tune on Presidential Speeches

Run through the provided Colab in order to finetune GPT-2 on presidential speeches. Then modify the Colab to answer the following questions.

1. Compute the perplexity of test and validation sets according to GPT-2 without fine-tuning and according to GPT-2 with finetuning. Does perplexity go down after fine-tuning?
2. Generate at least 100 samples from GPT-2 with fine-tuning and 100 without fine-tuning. Compute the word (or token) overlap of the generated text with the text in the test set. Which set of generated sentences has more words in common with the text in the test set? Is this what you expected?
3. The provided code uses top-k with k=50 for generation. Experiment with different sampling strategies and observe how this impacts the quality and diversity of the generations. If you’d like, implement a measure of text diversity such as self-BLEU or dist-1 (the number of unique generated words divided by the total number of generated words), and plot how it changes as you vary the value of either temperature, k, or p.

## Part 2 - Build your own dataset.

Build a text dataset and finetune GPT-2 on it. Your dataset can be any text you want. For best results (to avoid overfitting if the dataset is too small or long computation time if the dataset is too big), I recommend finding some text source that is between 5 and 100 MB. If the dataset you find is too large, I recommend sampling a subset of it.

Here is a very non-exhaustive list of ideas:

However, feel free to get as creative as you like, and pick a dataset that interests you!

For your chosen dataset, write a script to process the dataset into three files: train.txt, valid.txt, and test.txt. About 90\% of your data should go into train and the rest can be evenly split between valid and test.

6. Did you have to tweak any of the flags passed to run_language_modeling.py to get finetuning working on your datasret? If so, which ones did you have to change?
Submit a file report.pdf with your answers to the above questions.