# L1X: Text classification (Level B)

In this lab you will complete your implementation of a Naive Bayes classifier with a training method. You will also see what effect different document representations have on classification accuracy.

The data set that you will use in this lab is different from the one you used in lab&nbsp;L1; it is the [review polarity data set](https://www.cs.cornell.edu/people/pabo/movie-review-data/) first used by [Pang and Lee (2004)](http://www.aclweb.org/anthology/P04-1035). This data set consists of 2,000 movie reviews, each of which has been tagged as either positive or negative towards the movie at hand. The data is originally distributed as a collection of text files. For this lab we have put all files into two JSON files, one for training and one for testing.

## Introduction

Start by importing the module for this lab.

In [None]:
import lt1x

The next cell loads the training data and the test data:

In [None]:
training_data = lt1x.load_data("/courses/729G17/labs/l1x/data/review_polarity.train.json")
test_data = lt1x.load_data("/courses/729G17/labs/l1x/data/review_polarity.test.json")

As you will see, each data instance is a dictionary whose first component is a document (accessible using the key `words`), represented as a list of tokens, and whose second component (accessible using the key `class`) is the gold-standard polarity of the review&nbsp;&ndash; either positive (`pos`) or negative (`neg`).

In [None]:
print(training_data[813])

## Evaluation

The first thing that you will have do is to implement a function

`accuracy(classifier, data)`

that computes the accuracy of a classifier on test data. This function will be essentially identical to the corresponding function from lab&nbsp;L1.

In [None]:
def accuracy(classifier, data):
    # TODO: Replace the following line with your own code to solve Problem 1
    return lt1x.accuracy(classifier, data)

You can test this function by computing the accuracy of a Naive Bayes classifier on the test data:

In [None]:
classifier = lt1x.Classifier.train(training_data)
print(accuracy(classifier, test_data))

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
    <p>Provide your own implementation of the <code>accuracy()</code> function. Test your implementation by redoing the evaluation. You should get exactly the same results as before.</p>
</div>
</div>

## A complete Naive Bayes classifier

To implement the Naive Bayes classifier, you should complete the following code:

In [None]:
class OurClassifier(object):

    def predict(self, d):
        # TODO: Replace the following line with your own code to solve Problem 2
        return None

    @classmethod
    def train(cls, data, k=1):
        # The following line creates a new object of type OurClassifier:
        classifier = cls()
        # The next few lines initialise the four attributes of the classifier:
        classifier.classes = set()
        classifier.vocabulary = set()
        classifier.pc = {}
        classifier.pw = {}
        # TODO: Insert code to solve Problem 2
        return classifier

The `predict()` metod will be essentially identical to the corresponding method from lab&nbsp;L1.

The class method `train()` should return a new classifier (an object of type `OurClassifier`) that has been trained on the specified training data using maximum likelihood estimation with add-$k$ smoothing. This method will have to compute the four attributes that you use in the `predict()` method: the set of possible classes, the vocabulary, the dictionary containing the class probabilities, and the dictionary containing the word probabilities. (In the code skeleton, the four attributes are initialised to appropriate empty values.)

Note that the method `train()` is a so-called class method: it is not tied to a specific object but to the class itself. The `train()` method implements a design pattern known as Factory Method. If you want to read more about this, we recommend this [guide on how to use static, class or abstract methods in Python](https://julien.danjou.info/blog/2013/guide-python-static-class-abstract-methods) (scroll down to the section on class methods).

To test your implementation, you can re-do the evaluation from above:

In [None]:
classifier1 = OurClassifier.train(training_data)
print(accuracy(classifier1, test_data))

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
    <p>Implement the two methods in <code>OurClassifier</code>. Test your implementation by evaluating on the test data. Your results should be very similar to the ones that you got when you evaluated your accuracy function in Problem&nbsp;1.</p>
</div>
</div>

## Changing the document representation

In the lab so far, a document is represented as a list of the words that occur in it. For sentiment classification, several authors have suggested that a *binary* document representation, where each word is represented as either occurring or not occurring in the document (instead of representing the number of times it occurs), can produce better results. In the last problem you will try to confirm this finding.

Your task is to implement a function `binarize()` that converts data into the binary representation:

In [None]:
def binarize(data):
    # TODO: Replace the next line to solve Problem 3
    return []

The function is to be used in the following context:

In [None]:
binarized_training_data = binarize(training_data)
binarized_test_data = binarize(test_data)

classifier3 = OurClassifier.train(binarized_training_data)
print(accuracy(classifier3, binarized_test_data))

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
    <p>Implement the <code>binarize()</code> function and run the evaluation. What do you observe? Summarise your results in one or two sentences.</p>
</div>
</div>

*TODO: Insert the summary of your evaluation from Problem 3 here*