TDP030 Language Technology
Welcome to the course! This introductory module consists of a series of lectures that introduce you to language technology as an application area, the content and organisation of the course, and some essentials in linguistics and machine learning.
Detailed information about the course organisation and examination is available on this webpage.
- ambiguity, contextuality, multilinguality, combinatorial explosion
- tokenisation, word tokens, word types, normalisation, stop words
- morpheme, lexeme, lemma
- part-of-speech, constituent, syntactic head, phrase structure tree, dependency tree, treebank
- supervised and unsupervised machine learning
Topic 1: Text classification
Text classification is the task of categorising text documents into predefined classes. In this module you will be introduced to text classification and its applications, learn how to evaluate text classifiers using standard validation methods, and get to know the Naive Bayes classifier, a simple but effective probabilistic model for text classification.
- Slides: Text classification
- Videos: Text classification
- Naive Bayes and Sentiment Classification, chapter 6 in Jurafsky and Martin (2017), Sections 6.1–6.3, 6.6–6.7
- accuracy, precision, recall
- Naive Bayes classifier
- maximum likelihood estimation, additive smoothing
- evaluate a text classifier based on accuracy, precision, and recall
- apply the classification rule of the Naive Bayes classifier to a text
- learn the probabilities of a Naive Bayes classifier using maximum likelihood estimation and additive smoothing
In the advanced section you will be introduced to the multi-class perceptron classifier and how to implement it.
- Slides: Text classification (material on multi-class perceptrons)
- Videos: Text classification (advanced material)
- averaged perceptron classifier (advanced)
- apply the classification rule of the averaged perceptron classifier to a text (advanced)
- learn an averaged perceptron classifier from data (advanced)
Topic 2: Language modelling
Language modelling is about building models of what words are more or less likely to occur in some language. This section focuses on n-gram models, which have a wide range of applications such as in predictive text input, automatic spelling correction, and machine translation. You will be introduced to techniques for training n-gram models on data and the evaluation of n-gram models. The final part of the section introduces edit distance, which in connection with language models can be used for automatic spelling correction.
- Slides: Language Modelling
- Videos: Language Modelling
- Language Modeling with N-grams, chapter 4 in Jurafsky and Martin (2017), Sections 4.1–4.2, 4.4.1–4.4.2
- n-gram model
- additive smoothing
- perplexity, entropy
- learn an n-gram model using additive smoothing
- evaluate an n-gram model using entropy
The advanced material for this section is the Wagner–Fisher algorithm for computing the Levenshtein distance between two words.
- Slides: Language Modelling (material on the Wagner–Fisher algorithm)
- Videos: Language Modelling (advanced material)
- Levenshtein distance
- Wagner–Fisher algorithm
- simulate the Wagner–Fisher algorithm
Topic 3: Part-of-speech tagging
A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech, such as noun, adjective, or verb. In this section you will learn how to evaluate part-of-speech taggers, and be introduced to two methods for part-of-speech tagging: hidden Markov models (which generalise the Markov models that you encountered in the section on language modelling) and the multi-class perceptron.
- Slides: Part-of-Speech Tagging
- Videos: Part-of-Speech Tagging
- Part-of-Speech Tagging, chapter 10 in Jurafsky and Martin (2017), Sections 10.1–10.4
- part of speech, part-of-speech tagger
- accuracy, precision, recall
- hidden Markov model
- multi-class perceptron, feature window
- evaluate a part-of-speech tagger based on accuracy, precision, and recall
- compute the probability of a tagged sentence in a hidden Markov model
The advanced material for this section is the Viterbi algorithm for computing the most probable tag sequence for a sentence under a hidden Markov model.
- Slides: Part-of-Speech Tagging (material on the Viterbi algorithm)
- Videos: Part-of-Speech Tagging (advanced material)
- Viterbi algorithm
- simulate the Viterbi algorithm
Topic 4: Syntactic analysis
Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure. In this lecture you will learn about parsing to two target representations: phrase structure trees and dependency trees. The central model for the parsing to phrase structure trees is that of a probabilistic context-free grammar. For parsing to dependency trees, you will learn about the transition-based dependency parsing algorithm, which is also used by Google.
- phrase structure tree, dependency tree
- probabilistic context-free grammar
- transition-based dependency parser
- learn a probabilistic context-free grammar from a treebank
- simulate a transition-based dependency parser
Topic 5: Semantic analysis
In this lecture you will learn about word senses and the problems they posed for language technology, as well as about two important problems in semantic analysis: word sense disambiguation and word similarity. For each task you will learn about both knowledge-based and data-driven methods, including the popular continuous bag-of-words model used in Google’s word2vec software.
- Slides: Semantic Analysis
- Computing with Word Senses, chapter 17 in Jurafsky and Martin (2017) (excluding 17.7–17.9)
- word sense, homonymy, polysemy
- synonymy, antonymy, hyponymy, hypernymy, WordNet
- Simplified Lesk algorithm
- word similarity, distributional hypothesis, co-occurrence matrix
- simulate the Simplified Lesk algorithm
- compute the path length-based similarity of two words
- derive a co-occurrence matrix from a document collection
Page responsible: Marco Kuhlmann
Last updated: 2017-09-22