732A92 Text Mining
Welcome to the course! This lecture provides you with a bird’s-eye view of text mining and explains the course logistics. It also presents some concrete examples of text mining techniques to give you an idea of what’s to come.
- Slides: Course introduction
- Notebook: Some simple examples (Launch on Binder)
- Video available on Stream
- Reading: Zai and Massung (2016), pages 3–13
Detailed information about the organisation and examination of the course is available on this website.
Topic 1: Information retrieval
Information retrieval (IR) is about finding relevant documents in large collections of text – the first component of many text mining systems. This lecture introduces you to some of the basic concepts in IR, including two standard models of retrieval: the Boolean retrieval model and the vector space model. You will also learn about standard measures for the evaluation of IR systems.
Topic 2: Text classification
Text classification is the task of categorising text documents into predefined classes. In this module you will be introduced to text classification and its applications, and learn about some effective classification algorithms: Naive Bayes, logistic regression, and Support Vector Machines. You will also learn how to evaluate text classifiers using standard validation methods.
Topic 3: Text clustering and topic modelling
Whereas text classification sorts documents into predefined categories, clustering discovers categories by grouping documents into subsets of mutually similar texts. The first half of this lecture surveys algorithms for both flat clustering and hierarchical clustering. The second half covers topic modelling, a form of soft clustering particularly relevant for text mining.
Topic 4: Word embeddings
A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. This lecture reviews standard models for deriving word embeddings from text (including Google’s word2vec), and discusses the current generation of contextualised word embeddings, which build on deep learning methods.
- Slides: Word embeddings
- Reading: Jurafsky and Martin (2019), chapters 6–7, 9–10
- Reading: Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes (Garg et al., 2018)
Topic 5: Information extraction
Information extraction is the task of identifying named entities and semantic relations between entities in text data. This information can be used to populate structured databases, or to answer questions based on a text. The lecture presents different approaches to information extraction and linking extracted entities to structured information.
Page responsible: Marco Kuhlmann
Last updated: 2020-11-02