732A81 Text Mining
Lectures
Course introduction
Welcome to the course! This lecture provides you with a bird’s-eye view of text mining and explains the course logistics. It also presents some concrete examples of text mining techniques to give you an idea of what’s to come.
Interactive session
Detailed information about the organisation and examination of the course is available on this website.
Reading
Zai and Massung (2016), pages 3–13
Topic 1: Information retrieval
Information retrieval (IR) is about finding relevant documents in large collections of text – the first component of many text mining systems. This lecture introduces you to some of the basic concepts in IR, including two standard models of retrieval: the Boolean retrieval model and the vector space model. You will also learn about standard measures for the evaluation of IR systems.
Lectures
- 1.1 Introduction to information retrieval (video)
- 1.2 Index construction (video)
- 1.3 Ranked retrieval (video)
- 1.4 The vector space model (video)
- 1.5 Evaluation of information retrieval systems (video)
Reading
Manning, Raghavan, and Schütze (2008), sections 1.1–1.3, 2.1–2.2, 6.1–6.5, 8.1–8.4
Interactive session
Topic 2: Text classification
Text classification is the task of categorising text documents into predefined classes. In this module you will be introduced to text classification and its applications, and learn about some effective classification algorithms: Naive Bayes, logistic regression, and Support Vector Machines. You will also learn how to evaluate text classifiers using standard validation methods.
Lectures
- 2.1 Introduction to text classification (video)
- 2.2 Evaluation of text classifiers (video)
- 2.3 The Naive Bayes classifier (video)
- 2.4 The Logistic regression classifier (video)
- 2.5 Support vector machines (video)
Reading
Jurafsky and Martin (2021), chapters 4 and 5
Interactive session
Topic 3: Text clustering and topic modelling
Whereas text classification sorts documents into predefined categories, clustering discovers categories by grouping documents into subsets of mutually similar texts. The first half of this lecture surveys algorithms for both flat clustering and hierarchical clustering. The second half covers topic modelling, a form of soft clustering particularly relevant for text mining.
Lectures
Slides: Text clustering and topic modelling
- 3.1 Introduction to text clustering (video)
- 3.2 Similarity measures (video)
- 3.3 An overview of hard clustering methods (video)
- 3.4 Evaluation of hard clustering (video)
- 3.5 Soft clustering: topic models (video)
Reading
- Manning, Raghavan, and Schütze (2008), chapters 16 and 17
- Blei (2012)
Interactive session
Topic 4: Word embeddings
A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. This lecture reviews standard models for deriving word embeddings from text (including Google’s word2vec), and discusses the current generation of contextualised word embeddings, which build on deep learning methods.
(This unit is largely identical to Unit 1 in the course TDDE09 Natural Language Processing.)
Lectures
- 4.1 Introduction to word embeddings (video)
- 4.2 Learning word embeddings via matrix factorisation (video)
- 4.3 Learning word embeddings with neural networks (video)
- 4.4 The skip-gram model (video)
- 4.5 Subword models (video)
- 4.6 Contextualised word embeddings (video)
Reading
- Jurafsky and Martin (2019), chapter 6
- Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes (Garg et al., 2018)
Interactive session
Topic 5: Information extraction
Information extraction is the task of identifying named entities and semantic relations between entities in text data. This information can be used to populate structured databases, or to answer questions based on a text. The lecture presents different approaches to information extraction and linking extracted entities to structured information.
Lectures
Slides: Information extraction
- 5.1 Introduction to information extraction (video)
- 5.2 Named entity recognition (video)
- 5.3 Entity linking (video)
- 5.4 Relation extraction (video)
Reading
Jurafsky and Martin (2019), chapters 8 and 17
Interactive session
Interactive session: Information extraction
Additional material
- Entity Recognition and Transfer Learning with Prodigy (real life example of data annotation)
- Natural Language Annotation (book by Pustejovsky and Stubbs, 2012)
Page responsible: Marco Kuhlmann
Last updated: 2022-10-26