Hide menu

TDDE16 Text Mining

This page contains an overview of the lectures and links to material related to them. The concepts and techniques presented in the lectures are examined in the labs and the final project; see the Examination page for more information about this.

Course introduction

Welcome to the course! This lecture provides you with a bird’s-eye view of text mining and explains the course logistics. It also presents some concrete examples of text mining techniques to give you an idea of what’s to come.


Detailed information about the organisation and examination of the course is available on this website.

Topic 1: Information retrieval

Information retrieval (IR) is about finding relevant documents in large collections of text – the first component of many text mining systems. This lecture introduces you to some of the basic concepts in IR, including two standard models of retrieval: the Boolean retrieval model and the vector space model. You will also learn about standard measures for the evaluation of IR systems.


Topic 2: Text classification

Text classification is the task of categorising text documents into predefined classes. In this module you will be introduced to text classification and its applications, and learn about some effective classification algorithms: Naive Bayes, logistic regression, and Support Vector Machines. You will also learn how to evaluate text classifiers using standard validation methods.


Topic 3: Text clustering and topic modelling

Whereas text classification sorts documents into predefined categories, clustering discovers categories by grouping documents into subsets of mutually similar texts. The first half of this lecture surveys algorithms for both flat clustering and hierarchical clustering. The second half covers topic modelling, a form of soft clustering particularly relevant for text mining.


Topic 4: Word embeddings

A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. This lecture reviews standard models for deriving word embeddings from text (including Google’s word2vec), and discusses the current generation of contextualised word embeddings, which build on deep learning methods.


Topic 5: Information extraction

Information extraction is the task of identifying named entities and semantic relations between entities in text data. This information can be used to populate structured databases, or to answer questions based on a text. The lecture presents different approaches to information extraction and linking extracted entities to structured information.


Page responsible: Marco Kuhlmann
Last updated: 2020-11-02