This page links to the video lectures and study materials for the interactive sessions (notebooks and additional reading) and lists the central concepts, models, and algorithms that you are expected to master after each unit.
Course introduction
Welcome to the course! This unit introduces you to natural language processing and to written language as a type of data, presents the course logistics, and reviews basic concepts from linguistics and machine learning. You will also learn how to implement a simple sentiment classifier based on the bag-of-words representation and softmax regression.
Teaching session
Video lectures (review)
Reading
- Eisenstein (2019), chapter 1 and sections 2.1, 2.5–2.7, 4.3
- Detailed information about the course logistics is available on this website.
Concepts, models, and algorithms
- search and learning, Zipf’s law, Heaps’ law
- lexeme, lemma, part-of-speech, dependency tree, synonym
- tokenization, vocabulary, word token, word type, normalization
- sentiment analysis, bag-of-words
- softmax regression, cross-entropy loss, gradient-based optimization
Unit 1: Word representations
To process words using neural networks, we need to represent them as vectors of numerical values. In this unit you will learn different methods for learning these representations from data, including the widely-used skip-gram model. The unit also introduces the idea of subword representations, and in particular character-level representations, which can be learned using convolutional neural networks.
Video lectures and quizzes
- 1.1 Introduction to word representations (slides, video, quiz)
- 1.2 Learning word embeddings via singular value decomposition (slides, video, quiz)
- 1.3 Learning word embeddings with neural networks (slides, video, quiz, notebook, live notebook)
- 1.4 The skip-gram model (slides, video, quiz)
- 1.5 Subword models (slides, video, quiz)
- 1.6 Convolutional neural networks (slides, video, quiz)
Reading
Concepts, models, and algorithms
- one-hot vectors, word embeddings, distributional hypothesis, co-occurrence matrix
- truncated singular value decomposition, positive pointwise mutual information
- embedding layers, continuous bag-of-words classifier, representation learning, transfer learning
- skip-gram model, negative sampling
- word piece tokenization, byte pair encoding algorithm, character-level word representations, word dropout
- convolutional neural network, CNN architecture for text classification
Unit 2: Language modelling
Language modelling is the task of predicting which word comes next in a sequence of words. This unit presents two types of language models: n-gram models and neural models, with a focus on models based on recurrent neural networks. You will also learn how these language models can be used to learn more powerful, contextualized word representations.
Lectures
- 2.1 Introduction to language modelling (slides, video, quiz)
- 2.2 N-gram language models (slides, video, quiz)
- 2.3 Recurrent neural networks (slides, video, quiz, live notebook)
- 2.4 Long Short-Term Memory (LSTM) networks (slides, video, quiz)
- 2.5 Recurrent neural network language models (slides, video, quiz, live notebook)
- 2.6 Contextualized word embeddings (slides, quiz, video)
Reading
Concepts, models, and algorithms
- language modelling as a prediction task and as a probability model, perplexity
- n-gram language models, maximum likelihood estimation, smoothing, interpolation
- recurrent neural networks, backpropagation through time, encoder/transducer/decoder architectures
- Long Short-Term Memory (LSTM) architecture, gating mechanism
- fixed-window language model, recurrent language model
- polysemy, contextualized word embeddings, bidirectional LSTM, ELMo architecture
Unit 3: Large language models
Machine translation is one of the classical problems in artificial inteligence. In this unit you will learn about neural machine translation and one of its standard models, the encoder–decoder architecture. A crucial ingredient in this architecture is the mechanism of attention. This concept is also the key to some of the most recent developments in the field of NLP, the Transformer architecture, which we will cover in the last lectures of this unit.
Lectures
- 3.1 Introduction to machine translation (slides, video, quiz)
- 3.2 Neural machine translation (slides, video, quiz)
- 3.3 Attention (slides, video, quiz)
- 3.4 The Transformer architecture (slides, video, quiz)
- 3.5 Decoder-based language models (GPT) (slides, video, quiz)
- 3.6 Encoder-based language models (BERT) (slides, video, quiz)
Reading
- Eisenstein (2019), chapter 18
Concepts, models, and algorithms
- interlingual machine translation, noisy channel model, word alignments, BLEU score
- encoder–decoder architecture
- recency bias, attention, Bahdanau attention, scaled dot-product attention, multi-head attention
- Transformer architecture, self-attention
- GPT, pre-training and fine-tuning, zero-shot behaviour
- BERT, masked language modelling task, next sentence prediction task
Unit 4: Sequence labelling
Sequence labelling is the task of assigning a class label to each item in an input sequence. Many tasks in natural language processing can be cast as sequence labelling problems over different sets of output labels, including part-of-speech tagging, word segmentation, and named entity recognition. This unit introduces several models for sequence labelling, both with local and global search.
Lectures
- 4.1 Introduction to sequence labelling (slides, video, quiz)
- 4.2 Sequence labelling with local search (slides, video, quiz)
- 4.3 Part-of-speech tagging with the perceptron (slides, video, quiz)
- 4.4 The perceptron learning algorithm (slides, video, quiz)
- 4.5 Sequence labelling with global search (slides, video, quiz)
- 4.6 The Viterbi algorithm (slides, video, quiz)
Reading
- Eisenstein (2019), chapters 7–8, sections 2.3.1–2.3.2
- Daumé, A Course in Machine Learning, section 4.6 (link)
Concepts, models, and algorithms
- different types of sequence labelling tasks: tagging, segmentation, bracketing; accuracy, precision, recall
- fixed-window model and bidirectional RNN model for sequence labelling, autoregressive models, teacher forcing
- perceptron, features in part-of-speech tagging, feature templates
- perceptron learning algorithm, averaged perceptron
- Maximum Entropy Markov Model (MEMM), label bias problem, Conditional Random Field (CRF)
- Viterbi algorithm, backpointers
Unit 5: Syntactic analysis
Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure. In this lecture you will learn about two approaches to dependency parsing, where the target representations take the form of dependency trees: the Eisner algorithm, which casts dependency parsing as combinatorial optimisation over graphs, and transition-based dependency parsing, which is the algorithm also used by Google.
Lectures
- 5.1 Introduction to dependency parsing (slides, video, quiz)
- 5.2 The arc-standard algorithm (slides, video, quiz)
- 5.3 The Eisner algorithm (slides, video, quiz)
- 5.4 Neural architectures for dependency parsing (slides, video, quiz)
- 5.5 Dynamic oracles (slides, video, quiz)
(Note that there is no Lecture 5.6.)
Reading
- Eisenstein (2019), chapter 11
Concepts, models, and algorithms
- dependency tree, head, dependent, graph-based parsing, transition-based parsing
- arc-standard algorithm, projective/non-projective dependency trees, static oracle
- Eisner algorithm, backpointers
- parsing architectures of Chen and Manning (2014), Kiperwasser and Goldberg (2016), Dozat and Manning (2017)
- dynamic oracle, transition cost, arc reachability, arc-hybrid algorithm