{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# L2: Language modelling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab you will experiment with $n$-gram models. You will test various parameters that influence these models’ quality and train to estimate models with additive smoothing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following lines of code import the Python modules needed for this lab:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lt2\n", "import ngrams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data for this lab consists of Arthur Conan Doyle’s stories about Sherlock Holmes: *The Adventures of Sherlock Holmes*, *The Memoirs of Sherlock Holmes*, *The Return of Sherlock Holmes*, *His Last Bow* and *The Case-Book of Sherlock Holmes*. The next piece of code loads the first three of these as training data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data = lt2.read_data(\"/courses/729G17/labs/l2/data/advs.txt\",\n", " \"/courses/729G17/labs/l2/data/mems.txt\",\n", " \"/courses/729G17/labs/l2/data/retn.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is represented as a list of sentences, where one sentence is represented as a list of tokens (strings). The next line prints the 101th sentence:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(training_data[100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relation between a model’s quality and its order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first part of this lab you will examine the relation between an $n$-gram model’s quality and its **order**, i.e. the value of $n$. You will do both a qualitative and quantitative evaluation with the help of the entropy measure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Qualitative evaluation\n", "\n", "The following line trains a bigram-model of the class `ngrams.Model` on the training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = lt2.train(ngrams.Model, 2, training_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this model you are able to generate random sentences. Every time you run the following code cell a new sentence is generated." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\" \".join(model.generate()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the sentences. Do they sound natural?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "Write a new implementation of the method `prob()`, such that it estimates probabilities with additive smoothing.
\n", "\n", " \n", "Evaluate the system on `test.txt` with new new class using the entropy measure from Problem 2. Choose the following values for the smoothing constant $k$: 0.00, 0.01, 0.10, 1.00. For $k=0$ you should get the same results as in Problem 2.\n", "
\n", "\n", " \n", "Why and how does the smoothing constant influence the model’s entropy? Provide an explanation based on your understanding of what smoothing does to the distribution of the probability mass among observed and hallucinated occurrences.\n", "
\n", "k = 0.00 | k = 0.01 | k = 0.10 | k = 1.00 | |
n = 1 | to fill | to fill | to fill | to fill |
n = 2 | to fill | to fill | to fill | to fill |
n = 3 | to fill | to fill | to fill | to fill |
n = 4 | to fill | to fill | to fill | to fill |