Hide menu

732A92 Text Mining


This page contains an overview of the lectures and links to material related to them. The concepts and techniques presented in the lectures are examined in the labs and the final project; see the Examination page for more information about this.

Course introduction

Welcome to the course! This lecture provides you with a bird’s-eye view of text mining and explains the organisation of the course. The goal of the second half is to give you some more concrete idesa about what’s to come and introduces several useful Python libraries.

Teacher: Marco Kuhlmann

Unit 1: Information Retrieval

The goal of this unit is to enable you to use basic methods from the area of Information Retrieval (IR). These methods can be used to find relevant documents in large collections – the first component of many text mining systems.

Lecture 1: Information Retrieval 1

This lecture introduces you to some of the basic concepts in information retrieval such as the inverted index, and to three standard models for document retrieval: the boolean model, the vector space model, and the probabilistic model. The lecture also presents conceptual models for how retrieved information can be stored and queried efficiently.

Teacher: Patrick Lambrix

Lecture 2: Information Retrieval 2

This lecture completes the presentation of the three standard models for information retrieval and introduces precision and recall as two standard measures for evaluating IR systems. The second half of the lecture presents the lab assignment, in which you will construct a simple search engine for Android apps.

Teachers: Patrick Lambrix and Huanyu Liu

Unit 2: Natural Language Processing

This unit presents basic models and techniques from natural language processing (NLP). These can be used to uncover the linguistic structure of text documents and to extract relevant information expressed in natural language.

Lecture 3: Language Modelling

Language models are probability distributions over sequences of words. These models are useful in applications such as language detection and measuring textual similarity. This lecture starts by presenting the simplest language model, the bag-of-words, proceeds to more complex n-gram-models, and shows how these models can be effectively learned from data.

Teacher: Marco Kuhlmann

Lecture 4: Word Embeddings

The basic idea behind word embeddings is to represent words as points in a high-dimensional space in such a way that nearby words (points) have similar meanings. This idea has proven to be very fruitful in a wide range of applications. This lecture presents some standard word space models, including the models at the core of Google’s popular word2vec software.

Teacher: Marco Kuhlmann

Lecture 5: Information Extraction

Information extraction is the task of identifying named entities and semantic relations between entities in text data. This information can be used to populate structured databases, or to answer questions based on a text. The lecture presents different approaches to information extraction, as well as the related tasks of part-of-speech tagging and dependency parsing.

Teacher: Marco Kuhlmann

Unit 3: Probabilistic Modelling of Textual Data

The final unit of this course presents three of the most useful techniques for analysing text data: classification, clustering, and topic modelling.

Lecture 6: Introduction to Probabilistic Modelling of Textual Data. Text Classification

This lecture introduces the probabilistic perspective in text mining and gives examples of how previous methods can be presented from this perspective. The second part of the lecture concerns text classification, the task of categorising text documents into predefined classes.

Teacher: Måns Magnusson

Lecture 7: Document Clustering

Whereas text classification sorts documents into predefined categories, clustering discovers categories by grouping documents into subsets of mutually similar texts. This lecture surveys algorithms for flat clustering, where the document clusters have no explicit relation to one another, and hierarchical clustering, where one cluster may be contained in another one.

Teacher: Måns Magnusson

Lecture 8: Topic Modelling

Topic models are mixed membership clustering models that allow a document to belong to more than one cluster. They are very successful in revealing the topical structure of a document collection – what the documents are about. This lecture introduces the standard topic model, Latent Dirichlet Allocation (LDA), and presents techniques for learning topic models from data using the Gibbs sampling MCMC algorithm.

Teacher: Måns Magnusson

Introduction to the project

This lecture kicks off the project part of the course. It presents the formal requirements for the project report and explains how the project is graded. The lecture also features a presentation of projects from previous year, which will hopefully give you ideas for what to do in your own project.

Teacher: Marco Kuhlmann


Page responsible: Marco Kuhlmann
Last updated: 2017-10-11