TDP030 Language Technology

Labs

This course website is no longer being maintained. Please refer to Lisam from 2024 onwards.

This page contains the instructions for the lab assignments, and specifies the central concepts and procedures that you are supposed to master after each assignment. For more information about how these contents are examined, see the page on Examination.

General information

Lab assignments are done in pairs. This is an explicit exception to the rules on Cheating and plagiarism. Before submitting your first lab, you and your lab partner must sign up in Webreg. If you cannot find a lab partner, let us know before the first lab session, and we will pair you up with a random student.

Submission and grading: Submit the required files through Lisam. Your lab assistant will grade your labs within 5 working days.

Feedback: For each lab, there are a number of scheduled hours where you can get feedback on your work from the lab assistants. Unless you submit late, you will also get written feedback. In addition, you can always get feedback from the examiner. Book an appointment with the examiner.

Technical information

This course uses Jupyter notebooks for the lab assignments. To work on these notebooks, you can use the course’s lab environment, set things up on your personal computer, or use an external service such as Colab.

Using the lab environment. The course’s lab environment is available on computers connected to LiU’s Linux system, including those in the B-building computer labs. To activate the environment and start the Jupyter server, issue the following commands at the terminal prompt. This should open your web browser, where you can select the relevant notebook file.

source /courses/TDP030/venv/bin/activate
jupyter notebook

Topic 0: Text segmentation

Text segmentation is the task of segmenting a text into linguistically meaningful units, such as paragraphs, sentences, or words.

Level A

When the target units of text segmentation are words or word-like units, text segmentation is called tokenisation. In this lab you will implement a simple tokeniser for text extracted from Wikipedia articles. The lab also gives you an opportunity to acquaint yourself with the technical framework that we will be using for the remainder of the lab series.

Lab L0: Text segmentation (due 2023-01-23)

Learning objectives

After this lab you should be able to explain and apply the following concepts:

tokenisation
undersegmentation, oversegmentation
precision, recall

After this lab you should be able to perform the following procedures:

segment text into tokens (words and word-like units) using regular expressions
compare an automatic tokenisation with a gold standard

Topic 1: Text classification

Text classification is the task of categorizing text documents into predefined classes.

Level A

In this lab you will evaluate a text classifier using accuracy, precision, and recall; compare a trained classifier to a simple baseline; and implement the Naive Bayes classification rule. The concrete task that you will be working with is to classify speeches from the Swedish parliament as either right-wing or left-wing.

Lab L1: Text classification (Level A) (due 2023-01-30)

Learning objectives

After this lab you should be able to explain and apply the following concepts:

accuracy, precision, recall
baseline
Naive Bayes classifier

After this lab you should be able to perform the following procedures:

evaluate a text classifier based on accuracy, precision, and recall
compare a text classifier to a baseline
apply the classification rule of the Naive Bayes classifier to a text

Level B

In this lab you will complete your implementation of the Naive Bayes classifier with a training procedure. You will also se how different document representations affect the accuracy of the classifier. The classification task that you will be working with is to classify movie reviews as either positive or negative.

Lab L1X: Text classification (Level B) (due 2023-01-30)

Topic 2: Language modelling

Language modelling is about building models of what words are more or less likely to occur in some language.

Level A

In this lab you will experiment with n-gram models. You will test various various parameters that influence these model’s quality and estimate models using maximum likelihood estimation with additive smoothing. The data set that you will be working on is the set of Arthur Conan Doyle’s novels about Sherlock Holmes.

Lab L2: Language modelling (due 2023-02-06)

Learning objectives

After this lab you should be able to explain and apply the following concepts:

n-gram model
entropy
additive smoothing

After this lab you should be able to perform the following procedures:

estimate n-gram probabilities using the Maximum Likelihood method
estimate n-gram probabilities using additive smoothing

Level C

In this lab you will implement a simple spelling corrector. The core of the implementation is the Wagner–Fischer algorithm for computing the Levenshtein distance between two input words. The data set that you will be working with is the same collection of Sherlock Holmes novels as for the Level A lab.

Lab L2X: Spelling correction (due 2023-02-06)

Topic 3: Part-of-speech tagging

Part-of-speech tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb.

Level A

In this lab you will experiment with POS taggers trained on the Stockholm Umeå Corpus (SUC), a Swedish corpus containing more than 74,000 sentences (1.1 million tokens), which were manually annotated with, among others, parts of speech.

Lab L3: Part-of-speech tagging (due 2023-02-13)

Learning objectives

After this lab you should be able to explain and apply the following concepts:

part-of-speech tagging as a sequence labelling task
accuracy, precision, recall
confusion matrix, error analysis
baseline

After this lab you should be able to perform the following procedures:

evaluate a part-of-speech tagger based on accuracy, precision, and recall
establish a baseline for a part-of-speech tagger

Level B

In the advanced part of this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system – in this case the part-of-speech tagger that you worked with in the Level A-lab.

Lab L3X: Feature engineering for part-of-speech tagging (due 2023-02-13)

Topic 4: Syntactic analysis

Syntactic analysis is the task to map a sentence to a formal representation of its syntactic structure.

Level A

In this lab you will experiment with MaltParser, a standard tool for syntactic analysis. You will learn how to train MaltParser on treebank data, write code to evaluate the trained parser using standard evaluation measures, and reflect on how these evaluation measures change when we use automatically predicted tags instead of gold-standard tags for training.

Lab L4: Syntactic analysis (due 2023-02-20)

Learning objectives

After this lab you should be able to explain and apply the following concepts:

dependency tree, dependency parsing
attachment score, exact match

After this lab you should be able to perform the following procedures:

train a state-of-the-art dependency parser on treebank data
evaluate a dependency parser based on attachment score and exact match

Level C

In this lab you will use two freely available NLP tools, Stagger and MaltParser, to implement a simple system for information extraction.

Lab L4X: Information extraction (due 2023-02-20)

Topic 5: Semantic analysis

These labs focus on word space models and semantic similarity.

Level A

In this lab you will explore a word space model which trained on the Swedish Wikipedia using Google’s word2vec tool. You will learn how to use the model to measure the semantic similarity between words and apply it to solve a simple word analogy task.

Lab L5: Semantic analysis (due 2023-03-27)

Learning objectives

After this lab you should be able to explain and apply the following concepts:

word space model, cosine distance, semantic similarity
accuracy

After this lab you should be able to perform the following procedures:

use a pre-trained word space model to measure the semantic similarity between two words
use a pre-trained word space model to solve word analogy tasks

Level B

In this lab you will use state-of-the-art NLP libraries to train word space models on text data and evaluate them on a standard task, the synonym part of the Test of English as a Foreign Language (TOEFL).

Lab L5X: Semantic analysis (due 2023-02-27)

Page responsible: Marco Kuhlmann
Last updated: 2023-01-14

IDA - Department of Computer and Information Science

TDP030 Language Technology

Labs

General information

Technical information

Topic 0: Text segmentation

Level A

Learning objectives

Topic 1: Text classification

Level A

Learning objectives

Level B

Topic 2: Language modelling

Level A

Learning objectives

Level C

Topic 3: Part-of-speech tagging

Level A

Learning objectives

Level B

Topic 4: Syntactic analysis

Level A

Learning objectives

Level C

Topic 5: Semantic analysis

Level A

Learning objectives

Level B