No of lectures
1 lecture on Python, 2 lectures on information retrieval, 3 lectures on natural language processing + 3 lectures on statistical methods for text mining.
PhD students in Statistics, Computer Science and the Engineering sciences.
The course was last given
The overall aim of the course is to provide an introduction to quantitative
analysis of text, with special focus on applying machine learning methods to
text documents. In particular, the student should learn all the main steps when
working with text: i) efficient extraction of text, ii) natural language
processing of text in a form suitable for iii) statistical machine learning
methods which are subsequently used for iv) text prediction.
After completing the course the student should be able to:
• use basic methods for information extraction and retrieval of textual data.
• apply text processing techniques to prepare documents for statistical modelling
• apply relevant machine learning models for analyzing textual data and correctly interpreting the results
• use machine learning models for text prediction
• evaluate the performance of machine learning models for textual data
Introduction to machine learning or equivalent. At least one course in probability and statistics.
The course consists of lectures, lab exercises and a text mining project. The
lectures are devoted to presentations of concepts, and methods. The computer
exercises are devoted to practical application of text mining tools. In the
project work, the student will get hands-on experience in solving a text mining
Language of instruction: English.
Introduction and overview of quantitative text analysis and its applications. Information extraction. Web crawling. Information retrieval. Tf-idf. Vector space models. Text preprocessing. Bag of words. N-grams. Sparsity and smoothing for text. Document classification. Sentiment analysis. Model evaluation. Topic models.
Text mining project report. Written reports on lab assignments.
Page responsible: Director of Graduate Studies
Last updated: 2012-05-03