{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# L1X: Text classification (Level B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab you will complete your implementation of a Naive Bayes classifier with a training method. You will also see what effect different document representations have on classification accuracy.\n", "\n", "The data set that you will use in this lab is different from the one you used in lab L1; it is the [review polarity data set](https://www.cs.cornell.edu/people/pabo/movie-review-data/) first used by [Pang and Lee (2004)](http://www.aclweb.org/anthology/P04-1035). This data set consists of 2,000 movie reviews, each of which has been tagged as either positive or negative towards the movie at hand. The data is originally distributed as a collection of text files. For this lab we have put all files into two JSON files, one for training and one for testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by importing the module for this lab." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lt1x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell loads the training data and the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data = lt1x.load_data(\"/courses/729G17/labs/l1x/data/review_polarity.train.json\")\n", "test_data = lt1x.load_data(\"/courses/729G17/labs/l1x/data/review_polarity.test.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you will see, each data instance is a dictionary whose first component is a document (accessible using the key `words`), represented as a list of tokens, and whose second component (accessible using the key `class`) is the gold-standard polarity of the review – either positive (`pos`) or negative (`neg`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(training_data[813])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing that you will have do is to implement a function\n", "\n", "`accuracy(classifier, data)`\n", "\n", "that computes the accuracy of a classifier on test data. This function will be essentially identical to the corresponding function from lab L1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def accuracy(classifier, data):\n", " # TODO: Replace the following line with your own code to solve Problem 1\n", " return lt1x.accuracy(classifier, data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can test this function by computing the accuracy of a Naive Bayes classifier on the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classifier = lt1x.Classifier.train(training_data)\n", "print(accuracy(classifier, test_data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Provide your own implementation of the accuracy()
function. Test your implementation by redoing the evaluation. You should get exactly the same results as before.
Implement the two methods in OurClassifier
. Test your implementation by evaluating on the test data. Your results should be very similar to the ones that you got when you evaluated your accuracy function in Problem 1.
Implement the binarize()
function and run the evaluation. What do you observe? Summarise your results in one or two sentences.