Hide menu

TDP030 Language Technology


This page contains the instructions for the lab assignments, and specifies the central concepts and procedures that you are supposed to master after each assignment. For more information about how these contents are examined, see the page on Examination.

General information

Lab assignments should be done in pairs. Please contact the examiner in case you want to work on your own. Unfortunately, we do not generally have the resources necessary to tutor and give feedback on one-person labs.

Come prepared! We strongly recommend to have a look at the lab instructions before you come to the tutored lab sessions. Otherwise you will have to spend time on reading the instructions on-site, and will have less time to ask questions and get help.

Instructions: Submit your labs according to the instructions below. Please also read the general rules for hand-in assignments. Before you submit your first lab, you and your lab partner need to sign up in Webreg.

Format of the subject line: TDP030-2018 lab code your LiU-ID your partner’s LiU-ID your lab assistant’s LiU-ID

Example: TDP030-2018 L1 marjo123 erika456 fooba99

Lab assistants for this course:

  • Alice Reinaudo: alire41
  • Fabian Isaksson: fabis254
  • Marco Kuhlmann: marku61
  • Riley Capshaw: rilca426
  • Robin Kurtz: robku08
  • Wiktor Strandqvist: wikst813

Feedback: For each lab there are a number of scheduled hours where you can get oral feedback on your work from the lab assistants. If you submit in time for the first due date, you will get also get written feedback. In addition, you can always get feedback from the examiner (drop-in office hours Thursdays 13:15-15 in building E, room 3G.476 – or book an appointment).

Information about notebooks

This course uses Jupyter notebooks for some of the lab assignments. Notebooks let you write and execute Python code in a web browser, and they make it very easy to mix code and text.

Lab environment. To work on a notebook, you need to be logged into one of IDA’s computers, either on-site or via ThinLinc. At the start of each lab session, you have to activate the course’s lab environment by writing the following at the terminal prompt:

source /home/TDP030/labs/environment/bin/activate

Download and open the notebook. To start a new notebook, say L1.ipynb, download the notebook file to your computer and issue the following command at the terminal prompt.

jupyter notebook L1.ipynb

This will show the notebook in your web browser.

Rename the notebook. One of the first things that you should do with a notebook is to rename it, such that we can link the file to your LiU-IDs. Click on the notebook name (next to the Jupyter logo at the top of the browser page) and add your LiU-IDs, like so:

L1-marjo123-erika456

How to work with a notebook. Each notebook consists of a number of so-called cells, which may contain code or text. During the lab you write your own code or text into the cells according to the instructions. When you ‘run’ a code cell (by pressing Shift+Enter), you execute the code in that cell. The output of the code will be shown immediately below the cell.

Check the notebook and submit it. When you are done with a notebook, you should click on Kernel > Restart & Run All to run the code in the notebook and verify that everything works as expected and there are no errors. After this check you can save the notebook and submit it according to the instructions below.

Topic 0: Text segmentation

Text segmentation is the task of segmenting a text into linguistically meaningful units, such as paragraphs, sentences, or words.

Level A

When the target units of text segmentation are words or word-like units, text segmentation is called tokenisation. In this lab you will implement a simple tokeniser for text extracted from Wikipedia articles. The lab also gives you an opportunity to acquaint yourself with the technical framework that we will be using for the remainder of the lab series.

Lab L0: Text segmentation (due 2018-01-22)

Content

After this lab you should be able to explain and apply the following concepts:

  • tokenisation
  • undersegmentation, oversegmentation
  • precision, recall

After this lab you should be able to perform the following procedures:

  • segment text into tokens (words and word-like units) using regular expressions
  • compare an automatic tokenisation with a gold standard

Topic 1: Text classification

Text classification is the task of categorising text documents into predefined classes.

Level A

In this lab you will evaluate a text classifier using accuracy, precision, and recall; compare a trained classifier to a simple baseline; and implement the Naive Bayes classification rule. The concrete task that you will be working with is to classify speeches from the Swedish parliament as either right-wing or left-wing.

Lab L1: Text classification (Level A) (due 2018-01-29)

Content

After this lab you should be able to explain and apply the following concepts:

  • accuracy, precision, recall
  • baseline
  • Naive Bayes classifier

After this lab you should be able to perform the following procedures:

  • evaluate a text classifier based on accuracy, precision, and recall
  • compare a text classifier to a baseline
  • apply the classification rule of the Naive Bayes classifier to a text

Level B

In this lab you will complete your implementation of the Naive Bayes classifier with a training procedure. You will also se how different document representations affect the accuracy of the classifier. The classification task that you will be working with is to classify movie reviews as either positive or negative.

Lab L1X: Text classification (Level B) (due 2018-01-29)

Content

After this lab you should be able to explain and apply the following concepts:

  • accuracy
  • Naive Bayes classifier
  • different document representations advanced

After this lab you should be able to perform the following procedures:

  • evaluate a text classifier based on accuracy
  • learn a Naive Bayes classifier from data
  • compare different document representations for text classification advanced

Topic 2: Language modelling

Language modelling is about building models of what words are more or less likely to occur in some language.

Level A

In this lab you will experiment with n-gram models. You will test various various parameters that influence these model’s quality and estimate models using maximum likelihood estimation with additive smoothing. The data set that you will be working on is the set of Arthur Conan Doyle’s novels about Sherlock Holmes.

Lab L2: Language modelling (due 2018-02-05)

Content

After this lab you should be able to explain and apply the following concepts:

  • n-gram model
  • entropy
  • additive smoothing

After this lab you should be able to perform the following procedures:

  • estimate n-gram probabilities using the Maximum Likelihood method
  • estimate n-gram probabilities using additive smoothing

Level C

In this lab you will implement a simple spelling corrector. The core of the implementation is the Wagner–Fischer algorithm for computing the Levenshtein distance between two input words. The data set that you will be working with is the same collection of Sherlock Holmes novels as for the Level A lab.

Lab L2X: Spelling correction (due 2018-02-05)

Materials
Content

After this lab you should be able to explain and apply the following concepts:

  • n-gram model
  • Levenshtein distance
  • Wagner–Fischer algorithm advanced

After this lab you should be able to perform the following procedures:

  • implement a limited-size language technology system advanced
  • implement the Wagner–Fisher algorithm advanced

Topic 3: Part-of-speech tagging

Part-of-speech tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb.

Level A

In this lab you will experiment with POS taggers trained on the Stockholm Umeå Corpus (SUC), a Swedish corpus containing more than 74,000 sentences (1.1 million tokens), which were manually annotated with, among others, parts of speech.

Lab L3: Part-of-speech tagging (due 2018-02-12)

Content

After this lab you should be able to explain and apply the following concepts:

  • part-of-speech tagging as a sequence labelling task
  • accuracy, precision, recall
  • confusion matrix, error analysis
  • baseline

After this lab you should be able to perform the following procedures:

  • evaluate a part-of-speech tagger based on accuracy, precision, and recall
  • establish a baseline for a part-of-speech tagger

Level B

In the advanced part of this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system – in this case the part-of-speech tagger that you worked with in the Level A-lab.

Lab L3X: Feature engineering for part-of-speech tagging (due 2018-02-12)

Material
Content

After this lab you should be able to explain and apply the following concepts:

  • averaged perceptron classifier advanced
  • feature engineering advanced

After this lab you should be able to perform the following procedures:

  • improve a part-of-speech tagger using feature engineering advanced
  • evaluate a part-of-speech tagger based on accuracy

Topic 4: Syntactic analysis

Syntactic analysis is the task to map a sentence to a formal representation of its syntactic structure.

Level A

In this lab you will experiment with MaltParser, a standard tool for syntactic analysis. You will learn how to train MaltParser on treebank data, write code to evaluate the trained parser using standard evaluation measures, and reflect on how these evaluation measures change when we use automatically predicted tags instead of gold-standard tags for training.

Lab L4: Syntactic analysis (due 2018-02-19)

Concepts
  • dependency tree, dependency parsing
  • attachment score, exact match
Procedures
  • train a state-of-the-art dependency parser on treebank data
  • evaluate a dependency parser based on attachment score and exact match

Level C

In this lab you will use two freely available NLP tools, Stagger and MaltParser, to implement a simple system for information extraction.

Lab L4X: Information extraction (due 2018-02-19)

Concepts
  • part-of-speech tagging, dependency parsing
  • named entities, semantic relations advanced
  • IOB-tags advanced
Procedures
  • train a state-of-the-art part-of-speech tagger on treebank data
  • train a state-of-the-art dependency parser on treebank data
  • extract semantic triples from running text advanced

Topic 5: Semantic analysis

These labs focus on word space models and semantic similarity.

Level A

In this lab you will explore a word space model which trained on the Swedish Wikipedia using Google’s word2vec tool. You will learn how to use the model to measure the semantic similarity between words and apply it to solve a simple word analogy task.

Lab L5: Semantic analysis (due 2018-02-26)

Concepts
  • word space model, cosine distance, semantic similarity
  • accuracy
Procedures
  • use a pre-trained word space model to measure the semantic similarity between two words
  • use a pre-trained word space model to solve word analogy tasks

Level B

In this lab you will use state-of-the-art NLP libraries to train word space models on text data and evaluate them on a standard task, the synonym part of the Test of English as a Foreign Language (TOEFL).

Lab L5X: Semantic analysis (due 2018-02-26)

Concepts
  • word space model, cosine distance, semantic similarity
  • accuracy
Procedures
  • train a word space model on text data (advanced)
  • use a word space model to solve a synonym prediction task (advanced)

Reflection paper

After having completed all lab assignments, you are asked to write an individual reflection paper. The purpose of this assignment is to give you an opportunity to think about what you have learned from the labs. The paper should have three components:

  • your description of your work with the labs, with a focus on those aspects that you consider most important
  • your analysis of your experience based on concepts from the course
  • your conclusions regarding the question what you take away from this part of the course

For more detailed information, see the guide on Reflection papers.

Instructions: Write a paper according to the given specifications. The length of your paper should be around 1,000 words (approximately 2 pages). Submit your paper as a PDF document named as follows: TDP030-2018-LR-your LiU-ID.pdf

Due date: 2018-03-17

Format of the subject line: TDP030-2018 LR your LiU-ID marku61

Example: TDP030-2018 LR marjo123 marku61

Feedback: You will get written feedback on your paper from the examiner.


Page responsible: Marco Kuhlmann
Last updated: 2018-01-15