Hide menu

TDDE09 Natural Language Processing


This page contains the instructions for the lab assignments, and specifies the central concepts and procedures that you are supposed to master after each assignment. For more information about how these contents are examined, see the page on Examination.

General information

Lab assignments should be done in pairs. Please contact the examiner in case you want to work on your own. Unfortunately, we do not generally have the resources necessary to tutor and give feedback on one-person labs.

Instructions: Submit your labs according to the instructions below. Please also read the Rules for hand-in assignments. Before you submit your first lab, you and your lab partner need to sign up in Webreg.

Format of the subject line: TDDE09-2019 lab code your LiU-ID your partner’s LiU-ID your lab assistant’s LiU-ID

Example: TDDE09-2019 L1 marjo123 erika456 fooba99

Lab assistants for this course:

  • Marco Kuhlmann: marku61
  • Riley Capshaw: rilca16
  • Robin Kurtz: robku08

Feedback: For each lab there are a number of scheduled hours where you can get oral feedback on your work from the lab assistants. If you submit in time for the first due date, you will get also get written feedback. In addition, you can always get feedback from the examiner (drop-in office hours Thursdays 13:15-15 in building E, room 3G.476 – or book an appointment).

Information about notebooks

This course uses Jupyter notebooks for some of the lab assignments. Notebooks let you write and execute Python code in a web browser, and they make it very easy to mix code and text.

Lab environment. To work on a notebook, you need to be logged into LiU’s Linux system, which is installed in the computer labs at IDA, among others. Remote access via ThinLinc is currently not recommended. At the start of each lab session, you have to activate the course’s lab environment by writing the following at the terminal prompt:

source /courses/TDDE09/labs/environment/bin/activate

Download and open the notebook. To start a new notebook, say L1.ipynb, download the notebook file to your computer and issue the following command at the terminal prompt.

jupyter notebook L1.ipynb

This will show the notebook in your web browser.

Rename the notebook. One of the first things that you should do with a notebook is to rename it, such that we can link the file to your LiU-IDs. Click on the notebook name (next to the Jupyter logo at the top of the browser page) and add your LiU-IDs, like so:

L1-marjo123-erika456

How to work with a notebook. Each notebook consists of a number of so-called cells, which may contain code or text. During the lab you write your own code or text into the cells according to the instructions. When you ‘run’ a code cell (by pressing Shift+Enter), you execute the code in that cell. The output of the code will be shown immediately below the cell.

Check the notebook and submit it. When you are done with a notebook, you should click on Kernel > Restart & Run All to run the code in the notebook and verify that everything works as expected and there are no errors. After this check you can save the notebook and submit it. Before doing so, please read the Rules for hand-in assignments.

Topic 0: Text segmentation

Text segmentation is the task of segmenting a text into linguistically meaningful units, such as paragraphs, sentences, or words.

Level A

When the target units of text segmentation are words or word-like units, text segmentation is called tokenisation. In this lab you will implement a simple tokeniser for text extracted from Wikipedia articles. The lab also gives you an opportunity to acquaint yourself with the technical framework that we will be using for the remainder of the lab series.

Lab L0: Text segmentation (due 2019-01-25)

After this lab you should be able to explain and apply the following concepts:

  • tokenisation
  • undersegmentation, oversegmentation
  • precision, recall

After this lab you should be able to perform the following procedures:

  • segment text into tokens (words and word-like units) using regular expressions
  • compare an automatic tokenisation with a gold standard

Topic 1: Text classification

Text classification is the task of categorising text documents into predefined classes.

Level A

In this lab you will implement and evaluate two text classifiers: one based on the averaged perceptron, and one based on logistic regression. You will evaluate these classifiers using accuracy. The concrete task that you will be working with is sentiment analysis, and more specifically the task to classify movie reviews as either positive or negative.

Lab L1: Text classification (due 2019-02-01)

After this lab you should be able to explain and apply the following concepts:

  • accuracy, baseline
  • averaged perceptron classifier
  • logistic regression classifier

After this lab you should be able to perform the following procedures:

  • evaluate a text classifier based on accuracy
  • train an averaged perceptron classifier from data
  • train a logistic regression classifier from data

Level B

In the level-A lab you implemented the logistic regression classifier using the PyTorch library. While this library is very useful (and will be even more useful when we move to more complicated models), it is important to realise that there is some ‘magic’ going on behind the scenes. In this lab you will implement logistic regression without this magic, using only a library for vector operations (NumPy).

Lab L1X: The logistic regression classifier in detail (due 2019-02-01)

After this lab you should be able to explain and apply the following concepts:

  • logistic regression classifier
  • gradient-based learning

After this lab you should be able to perform the following procedures:

  • train a logistic regression classifier from data

Topic 2: Language modelling

Language modelling is about building models of what words are more or less likely to occur in some language.

Level A

In this lab you will implement and experiment with n-gram models. You will test various various parameters that influence these model’s quality and estimate models using maximum likelihood estimation with additive smoothing. The data set that you will be working on is the set of Arthur Conan Doyle’s novels about Sherlock Holmes.

Lab L2: Language modelling (due 2019-02-08)

After this lab you should be able to explain and apply the following concepts:

  • n-gram model
  • entropy
  • additive smoothing

After this lab you should be able to perform the following procedures:

  • estimate n-gram probabilities using the Maximum Likelihood method
  • estimate n-gram probabilities using additive smoothing

Level C

To enter Chinese text, most younger users type the pronunciation of each character in Pinyinon a standard QWERTY keyboard. This became possible with the advent of language models that automatically guess what the right character is. In this assignment, you will write a program that reads pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.

Lab L2X: Chinese character prediction (due 2019-02-08)

After this lab you should be able to explain and apply the following concepts:

  • n-gram model
  • Witten–Bell smoothing

After this lab you should be able to perform the following procedures:

  • implement an n-gram model with Witten–Bell smoothing
  • implement the core parts of an autocompletion algorithm
  • evaluate a language model in the context of an autocompletion application

Topic 3: Part-of-speech tagging

Part-of-speech tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb.

Level A

In this lab you will implement part-of-speech taggers based on the averaged perceptron and a neural network architecture and evaluate them on the English treebank from the Universal Dependencies Project, a corpus containing more than 16,000 sentences (254,000 tokens) annotated with, among others, parts of speech.

Lab L3: Part-of-speech tagging (due 2019-02-15 2019-02-20)

After this lab you should be able to explain and apply the following concepts:

  • part-of-speech tagging as sequence labelling
  • averaged perceptron classifier
  • neural network architecture for part-of-speech tagging

After this lab you should be able to perform the following procedures:

  • implement a part-of-speech tagger based on the averaged perceptron
  • implement a part-of-speech tagger based on a neural network architecture
  • evaluate a part-of-speech tagger based on accuracy

Level B

In the advanced part of this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system – in this case the part-of-speech tagger that you implemented in the Level A-lab.

Lab L3X: Feature engineering for part-of-speech tagging (due 2019-02-15 2019-02-22) (same file as for Level A)

After this lab you should be able to explain and apply the following concepts:

  • averaged perceptron classifier
  • feature engineering

After this lab you should be able to perform the following procedures:

  • improve a part-of-speech tagger using feature engineering
  • evaluate a part-of-speech tagger based on accuracy

Topic 4: Syntactic analysis

Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure.

Level A

In this lab you will implement two transition-based dependency parsers, one based on the averaged perceptron and one based on a neural network, and evaluate them on the English Web Treebank from the Universal Dependencies Project.

Lab L4: Syntactic analysis (due 2019-02-22)

After this lab you should be able to explain and apply the following concepts:

  • transition-based dependency parsing
  • averaged perceptron classifier
  • neural network architecture for dependency parsing
  • unlabelled attachment score

After this lab you should be able to perform the following procedures:

  • implement a transition-based dependency parser based on the averaged perceptron
  • implement a transition-based dependency parser based on a neural network architecture
  • evaluate a dependency parser based on unlabelled attachment score

Level C

In this lab you will implement the Eisner algorithm for projective dependency parsing and use it to transform (possibly non-projective) dependency trees to projective trees. This transformation is necessary to be able to apply algorithms for projective dependency parsing to treebanks that may contain non-projective trees.

Lab L4X: Projectivisation (due 2019-02-22)

After this lab you should be able to explain and apply the following concepts:

  • Eisner algorithm
  • projectivisation, lifting

After this lab you should be able to perform the following procedures:

  • implement the Eisner algorithm

Topic 5: Semantic analysis

These labs focus on word embeddings. A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of some linguistic property.

Level A

In this lab you will implement a simple word embedding algorithm via truncated singular value decomposition. You will use the algorithm to construct a concrete embedding from the English Wikipedia and explore the resulting word vectors.

Lab L5: Semantic analysis (due 2019-03-01)

After this lab you should be able to explain and apply the following concepts:

  • word embedding, cosine distance
  • co-occurrence matrix, positive pointwise mutual information
  • truncated singular value decomposition

After this lab you should be able to perform the following procedures:

  • derive a PPMI matrix from a document collection
  • derive word embeddings using truncated singular value decomposition

Level B

In this lab you will use state-of-the-art NLP libraries to train word space models on text data and evaluate them on a standard task, the synonym part of the Test of English as a Foreign Language (TOEFL).

Lab L5X: Synonym prediction (due 2019-03-01)

After this lab you should be able to explain and apply the following concepts:

  • word space model, cosine distance, semantic similarity
  • accuracy

After this lab you should be able to perform the following procedures:

  • train a word embedding on text data
  • use a word embedding to solve a synonym prediction task advanced

Page responsible: Marco Kuhlmann
Last updated: 2019-01-14