# L1: Text classification

## Introduction

Text classification is the task of sorting text documents into predefined classes. The concrete problem you will be working on in this lab is the classification of texts with respect to their affiliation to a political block (right-wing/left-wing). The specific texts you are going to classify are speeches held in the Swedish parliament. The classifier will read in a speech and predict if the speaker belongs to a right-wing or a left-wing party.

As usual, start by import the Python module required for this lab:

In [None]:
import lt1

## Read the data

The data used in this lab consists of all speeches held in the Swedish parliament in the 2016/2017 and 2017/2018 sessions. The raw data is taken from [Riksdag's open data](http://data.riksdagen.se/). Speeches are divided into two files:
* `anforande-201617.txt` with 12,637 speeches
* `anforande-201718.txt` with 12,343 speeches

In order to read the data files, we define a helper function. The function opens a given file and returns a list with speeches.

In [None]:
import bz2

def read_data(filename):
    speeches = []
    with bz2.open(filename, 'rt') as f:
        for line in f:
            tokens = line.split()
            speeches.append((tokens[0], tokens[1:]))
    return speeches

speeches_201617 = read_data('/courses/729G17/labs/l1/data/anforande-201617.txt.bz2')
speeches_201718 = read_data('/courses/729G17/labs/l1/data/anforande-201718.txt.bz2')

In Python, a speech is represented as a pair consisting of

* a string representing the gold-standard class for the speech: either `'L'` (left) or `'R'` (right)
* a list of strings representing the words in the speech (already tokenised)

Here is an example:

In [None]:
sample = speeches_201617[42]
print(sample)

## Train and evaluate a classifier

The next code cell creates a new Naive Bayes classifier and trains it on the speeches from the 2016/17 session:

In [None]:
classifier1 = lt1.Classifier.train(speeches_201617)

You can use the trained classifier for predicting the class for a new speech:

In [None]:
classifier1.predict(sample[1])

Was it correct? Your first task is to evaluate the classifier with respect to accuracy, precision, and recall.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Evaluate the trained classifier by computing its accuracy, precision, recall, and F1-score on the speeches from the 2017/2018 session. Enter the results into Table&nbsp;1 below.
</div>
</div>

To solve this problem you can use the function `lt1.evaluate()` from the lab module. This function takes a classifier and a list of gold-standard samples and prints out the evaluation measures on this data.

In [None]:
# Enter your code for Problem 1 into this cell

**Table 1: Evaluation of the classifier on the speeches from 2017/2018**

<table class="table">
<thead>
<tr><th>total</th><th colspan="3">L (left)</th><th colspan="3">R (right)</th></tr>
<tr><th>accuracy</th><th>precision</th><th>recall</th><th>F1</th><th>precision</th><th>recall</th><th>F1</th></tr>
</thead>
<tbody>
<tr><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td></tr>
</tbody>
</table>

## Implement functions for evaluation

The function `lt1.evaluate()` you used above internally calls three functions from the module `lt1`, one for each of the three evaluation measures. Your next task is to do your own implementation of these three functions.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Do your own implementation of the evaluation functions (accuracy, precision, recall). Test your implementation by redoing the evaluation from Problem&nbsp;1 with your new functions. You should get the same results as earlier.
</div>
</div>

Write the code for the three functions in the cells below. The only method of the classifier you will have to call in your implementation is `predict()`. (Scroll up in case you do not remember how to call it.)

### Accuracy

This function takes a classifier (`classifier`) and a list of gold-standard samples (`samples`) and returns the classifier&rsquo;s accuracy on the samples as a floating-point number between 0 and 1. If the measure is not defined, the function should return 0 instead.

In [None]:
def accuracy(classifier, samples):
    """Compute the accuracy of a classifier on a list of gold-standard samples."""
    # TODO: Implement this method to solve Problem 2
    return 0

### Precision

This function takes a classifier  (`classifier`), a target class `c`, and a list of gold-standard samples (`samples`) and computes the classifier&rsquo;s precision at predicting documents with class `c` as a floating-point number between 0 and 1. If the measure is not defined, the function should return 0 instead.

In [None]:
def precision(classifier, c, samples):
    """Compute the class-specific precision of a classifier on a list of gold-standard samples."""
    # TODO: Implement this method to solve Problem 2
    return 0

### Recall

This function should do the same as the previous one, but instead compute the recall.

In [None]:
def recall(classifier, c, samples):
    """Compute the class-specific recall of a classifier on a list of gold-standard samples."""
    # TODO: Implement this method to solve Problem 2
    return 0

### Putting it all together

Use this version of `evaluate()` in order to test your implementation. Note that you will have to change the code below to compute the F1-score.

In [None]:
def our_evaluate(classifier, samples):
    print("accuracy = {:.2%}".format(accuracy(classifier, samples)))
    for c in sorted(classifier.classes):
        p = precision(classifier, c, samples)
        r = recall(classifier, c, samples)
        # TODO: Change the next line to compute the F1-score
        f = 0
        print("class {}: precision = {:.2%}, recall = {:.2%}, f1 = {:.2%}".format(c, p, r, f))

In [None]:
our_evaluate(classifier1, speeches_201718)

## Compare to a baseline

Accuracy, precision, and recall should not be understood as absolute performance measures &ndash; they make sense only when we use them to compare a classifier against a **baseline**. When other classifiers are not available, a simple baseline for text classification is *Most Frequent Class*. This method predicts that class which appears most often in the training data documents â€“ without even looking at the words that appear in these documents. We would hope that a Naive Bayes-classifier has higher accuracy, precision, and recall than this simple baseline.

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
What are the accuracy, precision, and recall for the Most Frequent Class baseline when computed on the speeches on 2017/2018? In order to answer this question, you will have to write code that inspects both those speeches and the speeches from 2016/2017.
</div>
</div>

In [None]:
# Enter your code for Problem 3 into this cell

**Table 2: Baseline results (Most Frequent Class) for the speeches from 2017/2018**

<table class="table">
<thead>
<tr><th>total</th><th colspan="3">L (left)</th><th colspan="3">R (right)</th></tr>
<tr><th>accuracy</th><th>precision</th><th>recall</th><th>F1</th><th>precision</th><th>recall</th><th>F1</th></tr>
</thead>
<tbody>
<tr><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td></tr>
</tbody>
</table>

## Implement the predict function

In the main part of this lab you will implement the classification rule for a Naive Bayes classifier.

### How a classifier is represented in Python

Remember that the core of a Naive Bayes classifier is a probabilistic model with four components: a set of possible classes, $C$, a vocabulary, $V$, a number of class probabilities, $P(c)$, and a number of word probabilities, $P(w \mid c)$. In order to implement the classification rule, you will need to know how these components are represented in Python. Each one of those exists as an attribute of objects from the class `lt1.Classifier`:

#### The set of possible classes

This set is represented as a set of strings. The following cell sorts the classes of `classifier1` and prints them:

In [None]:
# classes: Set[str]
print(sorted(classifier1.classes))

#### The vocabulary

The vocabulary is also represented as a set of strings. The following cell prints 20&nbsp;words from the vocabulary of `classifier1`:

In [None]:
# vocabulary: Set[str]
print(sorted(classifier1.vocabulary)[20000:20020])

#### Class probabilities

For every possible class $c \in C$ there is a probability $P(c)$ that specifies how likely it is that a given document has class $c$. These class probabilities are represented as a dictionary that maps classes (strings) to log probabilities (floating-point numbers). As an example, the following cell prints the class probability for the class `'L'` (left):

In [None]:
# pc: Mapping[str, float]
print(classifier1.pc['L'])

<div class="alert alert-danger">
Note that the implementation uses log probabilities!
</div>

#### Word probabilities

For every possible class $c \in C$ and every word&nbsp;$w$ in the vocabulary, there is a probability $P(w \mid c)$ that specifies how likely it is for $w$ to occur in a document with class&nbsp;$c$. These word probabilities are represented as a nested dictionary that maps classes (strings) to class-specific dictionaries mapping words (strings) to log probabilities (floating-point numbers). As an example, the following cell prints the word probabilities for the word *behandlingsbehov* for the two possible classes.

In [None]:
# pw: Mapping[str, Mapping[str, float]]
print(classifier1.pw['L']['behandlingsbehov'])
print(classifier1.pw['R']['behandlingsbehov'])

### The classification rule

Remember that for a Naive Bayes classifier, the predicted class $\hat{c}$ for a document $d$ is given by the equation

$$
\hat{c} = \mathop{\text{arg max}}_{c \in C} P(c) \cdot \prod_{w \in V} P(w\mid c)^{\#(w)}
$$

where $\#(w)$ denotes the number of occurrences of word $w$ in document $d$.

Note that this equation uses normal probabilities, while the implementation uses log probabilities. In order to implement the prediction rule, you will need to think about how the equation would look like if it was formulated for log probabilities.

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
    
Do your own implementation of the method `predict()`. Test your implementation by redoing the evaluation from Problem&nbsp;1 with the new implementation. You should get the same results as previously.
</div>
</div>

In [None]:
class OurClassifier(lt1.Classifier):
    
    # TODO: Implement this method to solve Problem 4
    def predict(self, words):
        """Predict the class of the specified document."""
        return super().predict(words)

Note that you do not need to implement the training method; this functionality is inherited from `lt1.Classifier`.

In [None]:
classifier2 = OurClassifier.train(speeches_201617)
lt1.evaluate(classifier2, speeches_201718)

## Reflection

In the last problem of this lab, you are asked to reflect about general machine learning methodology.

<div class="panel panel-primary">
<div class="panel-heading">Problem 5 (Reflection)</div>
<div class="panel-body">
    <p>Redo the evaluation from Problem&nbsp;1 on the speeches from 2016/2017 and enter the results into Table&nbsp;3 below. Write a short reflection piece about your experience. Use the following prompts:</p>
    <ul>
        <li>What did you do? What were the results? Refer to Table&nbsp;3.</li>
        <li>How and why do the results differ from the first evaluation? How does this relate to general machine learning methodology?</li>
        <li>What did you learn from this problem? How, exactly, did you learn it? Why does this learning matter?</li>
    </ul>
</div>
</div>

In [None]:
# Enter your code for Problem 5 into this cell

**Table 3: Evaluation of the classifier on the speeches from 2016/2017**

<table class="table">
<thead>
<tr><th>total</th><th colspan="3" style="width: auto">L (left)</th><th colspan="3">R (right)</th></tr>
<tr><th>accuracy</th><th>precision</th><th>recall</th><th>F1</th><th>precision</th><th>recall</th><th>F1</th></tr>
</thead>
<tbody>
<tr><td>fil this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td><td>fill this cell</td></tr>
</tbody>
</table>

*TODO: Insert your text for Problem 5 here*

<div class="alert alert-info">
    Do not forget to read the General Information section on the lab web page before submitting this notebook!
</div>