TDP030 Language Technology

Lectures

This course website is no longer being maintained. Please refer to Lisam from 2024 onwards.

This page contains the study materials for the lectures and specifies the central concepts and procedures that you are supposed to master after each section. For more information about how these contents are examined, see the page on Examination.

Course introduction

Welcome to the course! This introductory lecture introduces you to language technology as an application area and to the content and organization of the course, and reviews some basic concepts from the area of text segmentation.

Materials

Slides: Course introduction
Detailed information about the course organisation and examination is available on this site!

Learning objectives

At the end of this unit you should be able to explain and apply the following concepts:

ambiguity, combinatorial explosion
tokenization, normalization, stop word
word token, word type

Topic 1: Text classification

Text classification is the task of categorizing text documents into predefined classes. This unit will introduce you to text classification and its applications. You will learn how to evaluate text classifiers using standard validation methods and get to know the Naive Bayes classifier, a simple but effective probabilistic model for text classification.

Learning objectives

At the end of this unit you should be able to explain and apply the following concepts:

accuracy, precision, recall
Naive Bayes classifier, log probabilities
maximum likelihood estimation, additive smoothing

At the end of this unit you should be able to perform the following procedures:

evaluate a text classifier based on accuracy, precision, and recall
apply the classification rule of the Naive Bayes classifier to a text
estimate the probabilities of a Naive Bayes classifier using maximum likelihood estimation and additive smoothing

Lectures

Slides: Text classification

1.1 Introduction to text classification (video)
1.2 Evaluation of text classifiers (video)
1.3 Text classification with Naive Bayes (video)
1.4 Learning a Naive Bayes classifier (video)

Teaching session 2023-01-24 (slides)

Reading

Naive Bayes and Sentiment Classification, chapter 4 of Jurafsky and Martin (2021), sections 4.1–4.5, 4.7–4.8

Topic 2: Language modelling

Language models predict which sentences are more or less probable in a given language. This unit focuses on n-gram models, which have a wide range of applications such as predictive text input and machine translation. You will learn how to apply, estimate, and evaluate n-gram models. The last part of this unit introduces the concept of edit distance.

Learning objectives

At the end of this unit you should be able to explain and apply the following concepts:

n-gram model, unigram, bigram
maximum likelihood estimation, additive smoothing
perplexity, entropy
Levenshtein distance

At the end of this unit you should be able to perform the following procedures:

learn an n-gram model using additive smoothing
evaluate an n-gram model using entropy
compute the Levenshtein distance between two words
simulate the Wagner–Fisher algorithm advanced

Lectures

Slides: Language modelling

2.1 Introduction to language modelling (video)
2.2 Probabilistic language models (video)
2.3 Learning probabilistic language models (video)
2.4 Evaluation of language models (video)
2.5 Edit distance (video)
2.6 The Wagner–Fisher algorithm (video) advanced

Reading

N-gram Language Models, chapter 3 of Jurafsky and Martin (2021), sections 3.1–3.5.2
Minimum Edit Distance, section 2.5 of Jurafsky and Martin (2021)

Topic 3: Part-of-speech tagging

A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech, such as noun, adjective, or verb. In this lecture you will learn how to evaluate part-of-speech taggers, and be introduced to two methods for part-of-speech tagging: hidden Markov models (which generalize the Markov models that you encountered in the lecture on language modelling) and the multi-class perceptron.

Learning objectives

At the end of this unit you should be able to explain and apply the following concepts:

part of speech, part-of-speech tagger
accuracy, precision, recall
hidden Markov model
multi-class perceptron, feature window

At the end of this unit you should be able to perform the following procedures:

evaluate a part-of-speech tagger based on accuracy, precision, and recall
compute the probability of a tagged sentence in a hidden Markov model
simulate a part-of-speech tagger based on the multi-class perceptron

Lectures

Slides: Part-of-speech tagging

3.1 Introduction to part-of-speech tagging (video)
3.2 Evaluation of part-of-speech taggers (video)
3.3 Part-of-speech tagging with Hidden Markov Models (video)
3.4 Part-of-speech tagging with multi-class perceptrons (video)
3.5 The Viterbi algorithm (slides, videos) advanced

Reading

Part-of-Speech Tagging, chapter 8 of Jurafsky and Martin (2021), sections 8.1–8.4.4
The Viterbi Algorithm, section 8.4.5 of Jurafsky and Martin (2021) advanced

Topic 4: Syntactic analysis

Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure. In this unit you will learn about parsing to two target representations: phrase structure trees and dependency trees. The central model for the parsing to phrase structure trees is that of a probabilistic context-free grammar. For parsing to dependency trees, you will learn about the transition-based dependency parsing algorithm, which is also used by Google.

Learning objectives

At the end of this unit you should be able to explain and apply the following concepts:

phrase structure tree, dependency tree
probabilistic context-free grammar
transition-based dependency parser

At the end of this unit you should be able to perform the following procedures:

learn a probabilistic context-free grammar from a treebank
simulate a transition-based dependency parser

Lectures

Slides: Syntactic analysis

4.1 Introduction to syntactic analysis (video)
4.2 Context-free grammars (video)
4.3 Parsing with probabilistic context-free grammars (video)
4.4 Transition-based dependency parsing (video)

Reading

Constituency Parsing, chapter 13 of Jurafsky and Martin (2021), section 13.1
Statistical Constituency Parsing, appendix C of Jurafsky and Martin (2021), sections C.1, C.3
Dependency Parsing, chapter 14 of Jurafsky and Martin (2021), sections 14.1–14.4 (excluding 14.4.1 and onwards)
Constituency Parsing, chapter 13 of Jurafsky and Martin (2021), section 13.2 advanced
Statistical Constituency Parsing, appendix C of Jurafsky and Martin (2021), section C.2 advanced

Topic 5: Semantic analysis

In this unit you will learn about word senses and the problems they posed for language technology, as well as about two important problems in semantic analysis: word sense disambiguation and word similarity. For each task you will learn about both knowledge-based and data-driven methods, including the popular continuous skip-gram model used in Google’s word2vec software.

Learning objectives

At the end of this unit you should be able to explain and apply the following concepts:

word sense, homonymy, polysemy
synonymy, antonymy, hyponymy, hypernymy, WordNet
Simplified Lesk algorithm
word similarity, distributional hypothesis, co-occurrence matrix

At the end of this unit you should be able to perform the following procedures:

simulate the Simplified Lesk algorithm
compute the path length-based similarity of two words
derive a co-occurrence matrix from a document collection

Lectures

Slides: Semantic analysis

5.1 Introduction to semantic analysis (video)
5.2 Word senses (video)
5.3 Word sense disambiguation (video)
5.4 Word similarity (video)

Reading

Vector Semantics and Embeddings, chapter 6 of Jurafsky and Martin (2021), sections 6.1–6.4
Word Senses and WordNet, chapter 18 of Jurafsky and Martin (2021), sections 18.1–18.5, excluding 18.4.2

Page responsible: Marco Kuhlmann
Last updated: 2023-01-14

IDA - Department of Computer and Information Science

TDP030 Language Technology

Lectures

Course introduction

Materials

Learning objectives

Topic 1: Text classification

Learning objectives

Lectures

Reading

Topic 2: Language modelling

Learning objectives

Lectures

Reading

Topic 3: Part-of-speech tagging

Learning objectives

Lectures

Reading

Topic 4: Syntactic analysis

Learning objectives

Lectures

Reading

Topic 5: Semantic analysis

Learning objectives

Lectures

Reading