Hide menu

TDDE16 Text Mining


This course website is no longer being maintained. Please refer to Lisam for HT2023.

This page contains an overview of the lectures and links to material related to them. The concepts and techniques presented in the lectures are examined in the labs and the final project; see the Examination page for more information about this.

Course introduction

Welcome to the course! This lecture provides you with a bird’s-eye view of text mining and explains the course logistics. It also presents some concrete examples of text mining techniques to give you an idea of what’s to come.

Interactive session

Detailed information about the organisation and examination of the course is available on this website.

Reading

Zai and Massung (2016), pages 3–13

Topic 1: Information retrieval

Information retrieval (IR) is about finding relevant documents in large collections of text – the first component of many text mining systems. This lecture introduces you to some of the basic concepts in IR, including two standard models of retrieval: the Boolean retrieval model and the vector space model. You will also learn about standard measures for the evaluation of IR systems.

Lectures

Slides: Information retrieval

  • 1.1 Introduction to information retrieval (video)
  • 1.2 Index construction (video)
  • 1.3 Ranked retrieval (video)
  • 1.4 The vector space model (video)
  • 1.5 Evaluation of information retrieval systems (video)

Reading

Manning, Raghavan, and Schütze (2008), sections 1.1–1.3, 2.1–2.2, 6.1–6.5, 8.1–8.4

Interactive session

Interactive session: Information retrieval

Topic 2: Text classification

Text classification is the task of categorising text documents into predefined classes. In this module you will be introduced to text classification and its applications, and learn about some effective classification algorithms: Naive Bayes, logistic regression, and Support Vector Machines. You will also learn how to evaluate text classifiers using standard validation methods.

Lectures

Slides: Text classification

  • 2.1 Introduction to text classification (video)
  • 2.2 Evaluation of text classifiers (video)
  • 2.3 The Naive Bayes classifier (video)
  • 2.4 The Logistic regression classifier (video)
  • 2.5 Support vector machines (video)

Reading

Jurafsky and Martin (2021), chapters 4 and 5

Interactive session

Interactive session: Text classification

Topic 3: Text clustering and topic modelling

Whereas text classification sorts documents into predefined categories, clustering discovers categories by grouping documents into subsets of mutually similar texts. The first half of this lecture surveys algorithms for both flat clustering and hierarchical clustering. The second half covers topic modelling, a form of soft clustering particularly relevant for text mining.

Lectures

Slides: Text clustering and topic modelling

  • 3.1 Introduction to text clustering (video)
  • 3.2 Similarity measures (video)
  • 3.3 An overview of hard clustering methods (video)
  • 3.4 Evaluation of hard clustering (video)
  • 3.5 Soft clustering: topic models (video)

Reading

Interactive session

Interactive session: Text clustering and topic modelling

Topic 4: Word embeddings

A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. This lecture reviews standard models for deriving word embeddings from text (including Google’s word2vec), and discusses the current generation of contextualised word embeddings, which build on deep learning methods.

(This unit is largely identical to Unit 1 in the course TDDE09 Natural Language Processing.)

Lectures

Slides: Word embeddings

  • 4.1 Introduction to word embeddings (video)
  • 4.2 Learning word embeddings via matrix factorisation (video)
  • 4.3 Learning word embeddings with neural networks (video)
  • 4.4 The skip-gram model (video)
  • 4.5 Subword models (video)
  • 4.6 Contextualised word embeddings (video)

Reading

Interactive session

Interactive session: Word embeddings

Topic 5: Information extraction

Information extraction is the task of identifying named entities and semantic relations between entities in text data. This information can be used to populate structured databases, or to answer questions based on a text. The lecture presents different approaches to information extraction and linking extracted entities to structured information.

Lectures

Slides: Information extraction

  • 5.1 Introduction to information extraction (video)
  • 5.2 Named entity recognition (video)
  • 5.3 Entity linking (video)
  • 5.4 Relation extraction (video)

Reading

Jurafsky and Martin (2019), chapters 8 and 17

Interactive session

Interactive session: Information extraction

Additional material


Page responsible: Marco Kuhlmann
Last updated: 2022-10-26