# L3: Part-of-speech tagging

## Introduction

Part-of-speech (POS) tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb. In this lab you will experiment with POS taggers trained on the [Stockholm Umeå Corpus (SUC)](http://spraakbanken.gu.se/eng/resources/suc), a Swedish text corpus containing more than 74,000 sentences (1.1&nbsp;million tokens), which were manually tagged with, among others, parts of speech. The corpus is divided into two files:

<table align="left">
<tr><td><code>suc-train.txt</code></td><td style="text-align: right">72,594 sentences</td><td style="text-align: right">1,142,802 tokens</td></tr>
<tr><td><code>suc-test.txt</code></td><td style="text-align: right">1,569 sentences</td><td style="text-align: right">23,319 tokens</td></tr>
</table>

Start by importing the Python module that is required for this lab:

In [None]:
import lt3

The next code cell loads the data:

In [None]:
training_data = lt3.read_data("/courses/729G17/labs/l3/data/suc-train.txt")
test_data = lt3.read_data("/courses/729G17/labs/l3/data/suc-test.txt")

Both data sets consist of tagged sentences. In Python, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word and the second component represents a part-of-speech tag. Run the following code cell to see an example:

In [None]:
training_data[42]

The next code cell extracts all unique tags from the training data and stores them for future use. The tags are explained and exemplified in Table&nbsp;12 (page&nbsp;20) of the [SUC 2.0 Manual](https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf).

In [None]:
suc_tags = set()
for tagged_sentence in training_data:
    for word, tag in tagged_sentence:
        suc_tags.add(tag)
suc_tags = sorted(suc_tags)
print(" ".join(suc_tags))

Now train a tagger with these tags by running the next code cell. (This will take some time.)

In [None]:
tagger = lt3.PerceptronTagger(suc_tags)
tagger.train(training_data)

## Implement methods for evaluation

Just like other systems that solve a classification task, a POS tagger can be evaluated using accuracy, precision, and recall relative to gold standard data. All three measures can be calculated from a confusion matrix. The following line of code creates a confusion matrix for the trained tagger relative to the test data.

In [None]:
matrix = lt3.confusion_matrix(tagger, test_data)

In Python, a confusion matrix is represented as a two-layer dictionary where the first layer corresponds to the matrix rows (tags according to the gold standard) and the second layer corresponds to the matrix columns (automatically predicted tags). For example, the next line of code prints the number of times that the trained tagger wrongly tagged a noun (tag `NN`) as a verb (tag `VB`):

In [None]:
print(matrix['NN']['VB'])

The evaluation measures themselves are implemented via the functions `lt3.accuracy()`, `lt3.precision()`, and `lt3.recall()`.

In [None]:
a = lt3.accuracy(matrix)
print("accuracy = {:.2%}".format(a))
for tag in tagger.tags:
    p = lt3.precision(matrix, tag)
    r = lt3.recall(matrix, tag)
    print("tag {}: precision = {:.2%}, recall = {:.2%}".format(tag, p, r))

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Write your own implementation of the three evaluation functions. Test your code by evaluating the trained tagger with the new functions. You should get the same results as before.
</div>
</div>

To solve this problem you should change the function definitions in the code cells below. Initially, these functions call their respective counterparts from the module `lt3`; replace these function calls with your own code.

In [None]:
def accuracy(matrix):
    """Computes accuracy based on the specified confusion matrix."""
    # TODO: Replace the following line with your own code.
    return lt3.accuracy(matrix)

The function `accuracy()` takes as its argument a confusion matrix `matrix` (a dictionary as defined above) and computes the tagger&rsquo;s accuracy as a floating-point number between&nbsp;0 and&nbsp;1. If the measure is not defined it returns `float('NaN')`.

In [None]:
def precision(matrix, tag):
    """Computes precision for tag `tag` based on the specified confusion matrix."""
    # TODO: Replace the following line with your own code.
    return lt3.precision(matrix, tag)

In [None]:
def recall(matrix, tag):
    """Computes recall for tag `tag` based on the specified confusion matrix."""
    # TODO: Replace the following line with your own code.
    return lt3.recall(matrix, tag)

The functions `precision()` and `recall()` take as their arguments a confusion matrix `matrix` and a tag `tag` (a string) and compute the tagger&rsquo;s precision/recall for that tag as a floating-point number between&nbsp;0 and&nbsp;1. As before, they return `float('NaN')` if the measure is not defined.

To test your code you can run this codecell:

In [None]:
a = accuracy(matrix)
print("accuracy = {:.2%}".format(a))
for tag in tagger.tags:
    p = precision(matrix, tag)
    r = recall(matrix, tag)
    print("tag {}: precision = {:.2%}, recall = {:.2%}".format(tag, p, r))

## Create a confusion matrix

In the next problem you will create the confusion matrix that you have used above by writing your own implementation of the function `lt3.confusion_matrix()`. This function takes as its argument a tagger and a gold-standard dataset and returns the appropriate matrix.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
    
Provide your own implementation of the function `confusion_matrix()`. Test your implementations by re-doing the evaluation from Problem&nbsp;1. You should get the same results as before.
</div>
</div>

To solve this problem you should change the code in the following cell:

In [None]:
def confusion_matrix(tagger, tagged_sentences):
    """Returns the confusion matrix for the specified gold-standard data."""
    matrix = {}
    for g in tagger.tags:
        matrix[g] = {}
        for p in tagger.tags:
            matrix[g][p] = 0
    # TODO: Fill the matrix with the appropriate counts.
    return matrix

In order to create the confusion matrix you have to compare the gold-standard tags and the tagger&rsquo;s predictions for every tagged sentence in the gold standard data (`tagged_sentences`). To obtain the predicted tags, call the method `tagger.tag()`. This method expects a list of tokens as its argument and returns a tagged sentence, i.e. a list of token&ndash;tag pairs (see above). Here is an example that uses the tagger from Problem&nbsp;1:

In [None]:
tagger.tag(["Smygrustning", "av", "raketvapen"])

Test your solution to Problem&nbsp;2 by running the following code cell:

In [None]:
# create the confusion matrix
our_matrix = confusion_matrix(tagger, test_data)

# Compute accuracy, precision, recall
a = lt3.accuracy(our_matrix)
print("accuracy = {:.2%}".format(a))
for tag in tagger.tags:
    p = lt3.precision(our_matrix, tag)
    r = lt3.recall(our_matrix, tag)
    print("tag {}: precision = {:.2%}, recall = {:.2%}".format(tag, p, r))

## A detailed error analysis

The confusion matrix is an important tool if you want to do a detailed error analysis for a tagger, e.g. if you want to know which parts-of-speech in particular the tagger struggles with. The following line of code prints the confusion matrix from Problem&nbsp;1 in a readable format:

In [None]:
lt3.show_matrix(matrix)

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Use the confusion matrix to do a detailed error analysis for the trained tagger. Which three pairs of part-of-speech tags does the tagger confuse the most, based on absolute numbers? Why does the tagger struggle especially with those pairs of part-of-speech tags? Illustrate the errors that the tagger makes by providing concrete example sentences from the data. 
</div>
</div>

You can use the function `lt3.show_examples()` for your analysis. This function takes a trained tagger (`tagger`), a gold-standard dataset (`gold_data`), and two tags, `gold` and `pred`, and returns those sentences in the dataset where a word was tagged with `gold` while the tagger predicted `pred`. Use the optional parameter `n_examples` to limit the number of sentences that is shown; the returned sentences are chosen randomly. The following code cell shows an example where an adjective (`JJ`) was incorrectly tagged as a proper noun (`PM`):

In [None]:
lt3.show_examples(tagger, test_data, 'JJ', 'PM', n_examples=3)

*TODO: Insert your answers to Problem 3 here*

## Implement the Most Frequent Class-baseline

A common baseline for classification problems is Most Frequent Class. In connection with POS tagging, this baseline always predicts the tag that appears most frequently in the training data – independent of which word is being tagged.

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Implement a tagger for the Most Frequent Class-baseline and evaluate it. How do you interpret the results?
</div>
</div>

In order to solve this problem you should modify the next code cell:

In [None]:
class MostFrequentClassTagger(lt3.BaseTagger):
    
    def train(self, tagged_sentences):
        """Identifies the most frequent tag in the specified gold-standard data."""
        # TODO: Replace the following line with your own code
        self.most_frequent_tag = suc_tags[0]
    
    def tag(self, words):
        """Tags the specified words with the most frequent tag."""
        return [(w, self.most_frequent_tag) for w in words]

Evaluate the tagger by calculating its accuracy:

In [None]:
mfc_tagger = MostFrequentClassTagger(suc_tags)
mfc_tagger.train(training_data)
mfc_tagger_matrix = lt3.confusion_matrix(mfc_tagger, test_data)
print("accuracy = {:.2%}".format(lt3.accuracy(mfc_tagger_matrix)))

*TODO: Insert your interpretation of the results here*

## Implement a more advanced baseline

Your next task is to implement a refined version of the Most Frequent Class baseline. This builds on the observation that for most words in the data, the distribution of their tags is very skewed. For example, the word *en* is tagged as a determiner (`DT`) 16,179&nbsp;times, but as a noun (`NN`) just 1&nbsp;time.

To implement the more advanced baseline, you should tag every word with the tag that it occurs with most often in the training data. For words that do not appear in the training data, you should use the previous, global Most Frequent Class-strategy.

<div class="panel panel-primary">
<div class="panel-heading">Problem 5</div>
<div class="panel-body">
Implement a tagger for the more advanced baseline and evaluate it by computing its accuracy.
</div>
</div>

To solve this problem, you should modify the following code cell. We suggest that you use the method `train()` to create a **tag dictionary** that maps each word to its most frequent tag. You can then use this tag dictionary in the `tag()` method.

In [None]:
class TagdictTagger(lt3.BaseTagger):
    
    def train(self, tagged_sentences):
        """Creates a tag dictionary from the specified gold-standard data."""
        # TODO: Insert your code
        self.tagdict = {}
        self.most_frequent_tag = suc_tags[0]
    
    def tag(self, words):
        """Tags the specified words using the tag dictionary."""
        def tag_word(word):
            if word in self.tagdict:
                return self.tagdict[word]
            else:
                return self.most_frequent_tag
        return [(word, tag_word(word)) for word in words]

Evaluate the tagger by calculating its accuracy:

In [None]:
tagdict_tagger = TagdictTagger(suc_tags)
tagdict_tagger.train(training_data)
tagdict_matrix = lt3.confusion_matrix(tagdict_tagger, test_data)
print("accuracy = {:.2%}".format(lt3.accuracy(tagdict_matrix)))

*TODO: Insert your discussion of the results here*

## Reflection

Now that you have completed the lab, take a step back and think about what you have learned.

<div class="panel panel-primary">
<div class="panel-heading">Problem 6 (Reflection)</div>
<div class="panel-body">
    <p>Pick one of the problems that you worked on in this lab and write a short reflection piece about your experience. Use the following structure:</p>
    <ul>
        <li>What did you do? Which results did you get? Target a reader who has not worked on the problem herself.</li>
        <li>Which specific concepts or skills are relevant to your experience? How did they help you make sense of your results?</li>
        <li>What new things did you learn? How, exactly, did you learn them? Why does this learning matter?</li>
    </ul>
</div>
</div>

The guide on Reflection Papers (available from the course web site) provides more prompts that you may find useful.

<div class="alert alert-info">
    Do not forget to read the General Information section on the lab web page before submitting this notebook!
</div>