# NLP Natural Language Processing

### Lectures

This page links to the video lectures and study materials for the interactive sessions (notebooks and additional reading) and lists the central concepts, models, and algorithms that you are expected to master after each unit.

### Course introduction

Welcome to the course! This unit introduces you to natural language processing and to written language as a type of data, presents the course logistics, and reviews basic concepts from linguistics and machine learning. You will also learn how to implement a simple sentiment classifier based on the bag-of-words representation and softmax regression.

#### Lectures

- 0.0 Introduction to natural language processing (slides, video)
- 0.1 Course overview (slides, video)
- 0.2 Course logistics (slides, video)
- 0.3 Essentials of linguistics (slides, video)
- 0.4 Basic text processing (notebook, live notebook, video)
- 0.5 Softmax regression (slides, video)
- 0.6 Implementing softmax regression (notebook, live notebook, video)

#### Reading

- Eisenstein (2019), chapter 1 and sections 2.1, 2.5–2.7, 4.3
- Detailed information about the course logistics is available on this website.

#### Concepts, models, and algorithms

- search and learning, Zipf’s law, Heaps’ law
- lexeme, lemma, part-of-speech, dependency tree, synonym
- tokenization, vocabulary, word token, word type, normalization
- sentiment analysis, bag-of-words
- softmax regression, cross-entropy loss, gradient-based optimization

### Unit 1: Word representations

To process words using neural networks, we need to represent them as vectors of numerical values. In this unit you will learn different methods for learning these representations from data, including the widely-used skip-gram model. The unit also introduces the idea of subword representations, and in particular character-level representations, which can be learned using convolutional neural networks.

#### Lectures

- 1.1 Introduction to word representations (slides, video)
- 1.2 Learning word embeddings via singular value decomposition (slides, video)
- 1.3 Learning word embeddings with neural networks (slides, video, live notebook)
- 1.4 The skip-gram model (slides, video)
- 1.5 Subword models (slides, video)
- 1.6 Convolutional neural networks (slides, video)

#### Reading

- Eisenstein (2019), chapter 14
- Research article: Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes

#### Teaching session

2022-01-24: Word representations (video, padlet)

#### Concepts, models, and algorithms

- one-hot vectors, word embeddings, distributional hypothesis, co-occurrence matrix
- truncated singular value decomposition, positive pointwise mutual information
- embedding layers, continuous bag-of-words classifier, representation learning, transfer learning
- skip-gram model, negative sampling
- word piece tokenization, byte pair encoding algorithm, character-level word representations, word dropout
- convolutional neural network, CNN architecture for text classification

### Unit 2: Language modelling

Language modelling is the task of predicting which word comes next in a sequence of words. This unit presents two types of language models: n-gram models and neural models, with a focus on models based on recurrent neural networks. You will also learn how these language models can be used to learn more powerful, contextualized word representations.

#### Lectures

- 2.1 Introduction to language modelling (slides, video)
- 2.2 N-gram language models (slides, video)
- 2.3 Recurrent neural networks (slides, video, live notebook)
- 2.4 Long Short-Term Memory (LSTM) networks (slides, video)
- 2.5 Recurrent neural network language models (slides, video, live notebook)
- 2.6 Contextualized word embeddings (slides, video)

#### Reading

- Eisenstein (2019), chapter 6
- Blog article: Understanding LSTM Networks

#### Teaching session

2022-01-31: Language modelling (video, padlet)

#### Concepts, models, and algorithms

- language modelling as a prediction task and as a probability model, perplexity
- n-gram language models, maximum likelihood estimation, smoothing, interpolation
- recurrent neural networks, backpropagation through time, encoder/transducer/decoder architectures
- Long Short-Term Memory (LSTM) architecture, gating mechanism
- fixed-window language model, recurrent language model
- polysemy, contextualized word embeddings, bidirectional LSTM, ELMo architecture

### Unit 3: Sequence labelling

Sequence labelling is the task of assigning a class label to each item in an input sequence. Many tasks in natural language processing can be cast as sequence labelling problems over different sets of output labels, including part-of-speech tagging, word segmentation, and named entity recognition. This unit introduces several models for sequence labelling, both with local and global search.

#### Lectures

- 3.1 Introduction to sequence labelling (slides, video)
- 3.2 Sequence labelling with local search (slides, video)
- 3.3 Part-of-speech tagging with the perceptron (slides, video)
- 3.4 The perceptron learning algorithm (slides, video)
- 3.5 Sequence labelling with global search (slides, video)
- 3.6 The Viterbi algorithm (slides, video)

#### Reading

- Eisenstein (2019), chapters 7–8, sections 2.3.1–2.3.2
- Daumé, A Course in Machine Learning, section 4.6 (link)

#### Teaching session

2022-02-07: Sequence labelling (video, padlet)

#### Concepts, models, and algorithms

- different types of sequence labelling tasks: tagging, segmentation, bracketing; accuracy, precision, recall
- fixed-window model and bidirectional RNN model for sequence labelling, autoregressive models, teacher forcing
- perceptron, features in part-of-speech tagging, feature templates
- perceptron learning algorithm, averaged perceptron
- Maximum Entropy Markov Model (MEMM), label bias problem, Conditional Random Field (CRF)
- Viterbi algorithm, backpointers

### Unit 4: Syntactic analysis

Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure. In this lecture you will learn about two approaches to dependency parsing, where the target representations take the form of dependency trees: the Eisner algorithm, which casts dependency parsing as combinatorial optimisation over graphs, and transition-based dependency parsing, which is the algorithm also used by Google.

#### Lectures

- 4.1 Introduction to dependency parsing (slides, video)
- 4.2 The arc-standard algorithm (slides, video)
- 4.3 The Eisner algorithm (slides, video)
- 4.4 Neural architectures for dependency parsing (slides, video)
- 4.5 Dynamic oracles (slides, video)

#### Reading

- Eisenstein (2019), chapter 11

#### Teaching session

2022-02-14: Syntactic analysis (video, padlet)

#### Concepts, models, and algorithms

- dependency tree, head, dependent, graph-based parsing, transition-based parsing
- arc-standard algorithm, projective/non-projective dependency trees, static oracle
- Eisner algorithm, backpointers
- parsing architectures of Chen and Manning (2014), Kiperwasser and Goldberg (2016), Dozat and Manning (2017)
- dynamic oracle, transition cost, arc reachability, arc-hybrid algorithm

### Unit 5: Machine translation & current research

Machine translation is one of the classical problems in artificial inteligence. In this unit you will learn about neural machine translation and one of its standard models, the encoder–decoder architecture. A crucial ingredient in this architecture is the mechanism of *attention*. This concept is also the key to some of the most recent developments in the field of NLP, the Transformer architecture, which we will cover in the last lectures of this unit.

#### Lectures

- 5.1 Introduction to machine translation (slides, video)
- 5.2 Neural machine translation (slides, video)
- 5.3 Attention (slides, video)
- 5.4 The Transformer architecture (slides, video)
- 5.5 BERT and other pre-trained transformer models (slides, video)

#### Reading

- Eisenstein (2019), chapter 18

#### Teaching session

2022-02-21: Machine translation (video, no padlet available)

#### Concepts, models, and algorithms

- interlingual machine translation, noisy channel model, word alignments, BLEU score
- encoder–decoder architecture
- recency bias, attention, context vector
- Transformer architecture, self-attention, scaled dot-product attention
- BERT, masked language modelling task, next sentence prediction task

Page responsible: Marco Kuhlmann

Last updated: 2022-01-13