# L0: Text segmentation

## Introduction

From a computer’s perspective, a text is a sequence of characters, such as letters and digits. Before we can process a text with language technology tools, we need to segment it into linguistically more meaningful units, such as paragraphs, sentences, or words. When the target units are words, segmentation is called **tokenisation**. In this lab you will implement and evaluate a simple tokeniser for running text.

Start by running the following cell, which will import the Python module for this lab:

In [None]:
import lt0

## Background: Data

The text you will be working with is an article from Swedish Wikipedia: [Gustav III](https://sv.wikipedia.org/wiki/Gustav_III). Have a look at the web page to see how it is structured.

### Look at the extracted text

A Wikipedia page does not only contain text but also other data, such as pictures and tables. Before you can tokenise the text, you would usually need to extract it from the page using a tool like [Scrapy](https://scrapy.org). For this lab this part has been already done for you. The extracted text is available in the following file:

    /courses/729G17/labs/l0/data/text1.txt

Look at the extracted text in a text editor. You may notice that the text has a somewhat peculiar formatting; this is due to the fact that it has been automatically extracted from the web page’s [HTML](https://en.wikipedia.org/wiki/HTML) code. Take a moment to think about which problems this peculiar formatting may create downstream.

### Load the extracted text

In order to help you load the extracted text into Python, we have defined a helper function `read_data()`. This function opens a given text file and returns its content as a list of strings, one string for each line in the file. The text file uses newline characters to end each line; these are removed using Python’s [`rstrip()`](https://docs.python.org/3.5/library/stdtypes.html#str.rstrip) function.

In [None]:
def read_data(filename):
    with open(filename) as f:
        return [line.rstrip() for line in f]

With the helper function, you can read in the extracted text as follows:

In [None]:
text1 = read_data("/courses/729G17/labs/l0/data/text1.txt")

The next cell prints a list with the first 50&nbsp;lines of the text:

In [None]:
print(text1[:50])

The next cell recreates the content from the text file in lines 51 to 60, glueing lines together using the newline character:

In [None]:
print("\n".join(text1[50:60]))

### Read the gold standard

We provide a gold-standard tokenisation for the extracted text. This tokenisation follows the rules used in the [Stockholm–Umeå Corpus (SUC)](https://spraakbanken.gu.se/swe/resurs/suc3), a standard corpus for Swedish. The file containing the gold-standard tokenisation consists of all tokens from the extracted text, with one token per line.

In [None]:
gold1 = read_data("/courses/729G17/labs/l0/data/gold1.txt")

Look at the gold standard and try to understand the principles it is based on. Most tokens are normal words or punctuation marks, but note that abbreviations are handled as one token.

In [None]:
print(gold1[:50])

## Whitespace tokenisation

The next cell contains a very simple tokeniser:

In [None]:
def tokenize_ws(lines):
    tokens = []
    for line in lines:
        for token in line.split():
            tokens.append(token)
    return tokens

This function takes a list with text lines, splits every line on whitespace using the function [`split()`](https://docs.python.org/3.5/library/stdtypes.html#str.split), and collects the resulting strings in a list `tokens`.

### Compare the tokenisation with the gold standard

Test the tokeniser on the first 50 lines of the text:

In [None]:
print(tokenize_ws(text1[:50]))

Compare this tokenisation to the gold standard. Which differences do you find?

Most differences can be explained as **undersegmentation**, where the tokeniser has missed to split a token. The opposite situation, where the tokeniser has split a character sequence that should really be one token, is called **oversegmentation**.

For examining the differences between the automatically tokenised text and the gold standard, the lab module provides a function `diff()`. This function expects two arguments: one list with gold-standard tokens and one list with automatically predicted tokens. It returns a new list that shows the differences between the two tokenisations in a compact way. The following command shows the first ten differences:

In [None]:
lt0.diff(gold1, tokenize_ws(text1))[:10]

The list contains pairs whose first component is a list of tokens that appear in the gold standard but not in the automatic tokenisation, and whose second component is a list of tokens that appear in the automatic tokenisation but not in the gold standard. The following code snippet prints this list in a slightly more readable way, also showing the numbers of tokens printed:

In [None]:
column_width = 40

# Helper function that formats a list of tokens
def fmt_tokens(tokens):
    return "{} {}".format(len(tokens), " ".join(tokens).ljust(column_width))

# Print out information about divergent subsequences
print("Gold tokens".ljust(column_width), "Predicted tokens".ljust(column_width))
print()
for gold_tokens, pred_tokens in lt0.diff(gold1, tokenize_ws(text1)):
    print(fmt_tokens(gold_tokens), fmt_tokens(pred_tokens))

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
    <p>Examine the differences between the gold standard and the whitespace-based tokenisation.</p>
    <p>Provide at least three examples of undersegmentation. Try to find examples of different types, where segmentation goes wrong in different ways or for different reasons. For each example, briefly describe what goes wrong and how this particular type of undersegmentation could be &lsquo;fixed&rsquo;.</p>
    <p>Provide at least one example of oversegmentation.</p>
</div>
</div>

In order to solve this problem, you can either examine the output from the previous code cell by hand or write code to solve this task for you.

In [None]:
# You might want to write some code here.

*TODO: Insert your answer to Problem&nbsp;1 here*

Examples for different types of under-segmentation:

* Example 1
* Example 2
* Example 3

Example for over-segmentation:

### Compute precision and recall

Problem&nbsp;1 asks you to do what is called a *qualitative* evaluation of the whitespace tokeniser. One way to do a *quantitative* evaluation of the tokeniser is to compute its **precision** and its **recall**. With respect to tokenisation, precision is defined as the percentage of correctly predicted tokens among all tokens the tokeniser has predicted. Recall is defined as the percentage of correctly predicted tokens among all tokens in the gold standard.

The next code cell computes precision and recall for the whitespace tokeniser:

In [None]:
tokens_ws = tokenize_ws(text1)

print("Errors: {}".format(lt0.n_errors(gold1, tokens_ws)))
print("Precision: {:.2%}".format(lt0.precision(gold1, tokens_ws)))
print("Recall: {:.2%}".format(lt0.recall(gold1, tokens_ws)))

## Tokenisation based on regular expressions

In the second part of this lab you will replace the simple whitespace-based tokenisation with a more advanced tokenisation based on **regular expressions**. Before you can use regular expressions in Python you have to first load the relevant module:

In [None]:
import re

A simple tokeniser based on regular expressions looks like this:

In [None]:
def tokenize_re(regex, lines):
    output = []
    for line in lines:
        for match in re.finditer(regex, line):
            output.append(match.group(0))
    return output

This function finds all longest, non-overlapping occurrences of the pattern `regex` in each line of text and returns them as a list. Each line is scanned from left to right, and the matching substrings are returned in the same order.

You can use the regular expression-based tokeniser to simulate the whitespace-based tokeniser as follows:

In [None]:
# Regular expression that splits on whitespace
regex = r'\S+'

tokens_re = tokenize_re(regex, text1)

print("Errors: {}".format(lt0.n_errors(gold1, tokens_re)))
print("Precision: {:.2%}".format(lt0.precision(gold1, tokens_re)))
print("Recall: {:.2%}".format(lt0.recall(gold1, tokens_re)))

# In order to debug the regex, you might want to comment in the next line.
# lt0.diff(gold1, tokens_re)

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Find a regular expression that eliminates as many differences between the gold standard and the automatic tokenisation as possible. Use the insights from your qualitative evaluation in Problem&nbsp;1. Your final tokeniser should have at least 99.5% precision and recall.
</div>
</div>

Here are some hints that may help you solve this problem:

* Read the [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) and [the documentation for the module  `re`](https://docs.python.org/3.5/library/re.html).

* If you want to use grouping sub-expressions, you might want to use *non-capturing* groups.

* If your expression gets too long and hard to read, have a look at [`re.VERBOSE`](https://docs.python.org/3.5/library/re.html#re.VERBOSE) for writing the expression over multiple lines.

* If you want to practice your regex skills a little more, hop over to [RegexOne](https://regexone.com) or [RegExr](http://regexr.com).

## Evaluate the tokeniser on new text

Your last task is to evaluate your tokeniser on another article from Swedish Wikipedia: [Katarina II av Ryssland](https://sv.wikipedia.org/wiki/Katarina_II_av_Ryssland). (She was Gustav&nbsp;III&rsquo;s cousin.)

The raw text and the gold standard tokenisation for this second text are loaded like this:

In [None]:
text2 = read_data("/courses/729G17/labs/l0/data/text2.txt")
gold2 = read_data("/courses/729G17/labs/l0/data/gold2.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 3 (Reflection)</div>
<div class="panel-body">
    <p>Apply your regular expression tokeniser from Problem&nbsp;2 to the new text and analyse the results. Write a short (ca. 150&nbsp;words) report about your experience. Use the following prompts:</p>
    <ul>
        <li>What did you do to solve this problem? What were the results?</li>
        <li>Which concepts from the other problems relate to this new problem? Which new issues do arise?</li>
        <li>What did you learn from this new problem? How, exactly, did you learn it? Why does this learning matter?</li>
    </ul>
</div>
</div>

In [None]:
# TODO: Add code here to solve Problem 3

*TODO: Insert your report for Problem&nbsp;3 here (ca. 150 words)*

<div class="alert alert-info">
    Do not forget to read the General Information section on the lab web page before submitting this notebook!
</div>