732A92 Text Mining
Intended learning outcomes
On completion of the course, you should be able to:
- use basic methods for information extraction and retrieval of textual data
- apply text processing techniques to prepare documents for statistical modelling
- apply relevant machine learning models for analyzing textual data and correctly interpret the results
- use machine learning models for text prediction
- evaluate the performance of machine learning models for textual data
For each learning objective, there is a set of more specific knowledge requirements that outline what you need to demonstrate in order to earn a certain grade. These knowledge requirements are listed on the Examination page.
The course covers the following content:
- introduction and overview of quantitative text analysis and its applications
- information extraction
- web crawling
- information retrieval (tf-idf, vector space models)
- text preprocessing (bag-of-words, n-grams, sparsity and smoothing for text)
- document classification and sentiment analysis
- topic models
- model evaluation
Teaching and working methods
The course is taught in the form of lectures, lab sessions, and supervision in connection with an individual project. You are also expected to study independently, both individually and in groups. When you plan your time for the course, you should calculate approximately
- 53 hrs to prepare for, attend, and follow-up on the lectures
- 27 hrs to prepare for, carry out, and follow-up on the labs
- 80 hrs to plan, carry out, and follow-up on the project
The course is co-taught with TDDE16 Text Mining at the Faculty of Science and Engineering.
There is no obligatory textbook for the course. Reading consists of individual chapters from the following books:
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Second edition. ACM Press Books, 2011. Relevant chapters will be distributed in class.
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009.
Daniel Jurafsky and James H. Martin. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft chapters of 3rd edition, November 2016.
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. The complete book is available on-line.
James Pustejovsky and Amber Stubbs. Natural Language Annotation for Machine Learning. A Guide to Corpus-Building for Applications. O’Reilly Media, 2013.
What you can expect from us. We try our best to give you prompt, constructive, and meaningful feedback on how well you meet the knowledge requirements set out for the course. We offer feedback in various forms; you can find detailed information about this on the Examination page. Our focus is on non-examinatory, formative feedback, which you can use to improve your learning (and we can use to improve our teaching!) while the course is ongoing.
What we expect from you. We expect you to familiarise yourself with the knowledge requirements set out for the course, and to actively seek our feedback on how well you meet these requirements. We also expect you to reflect on the feedback that we provide, and to grasp opportunities to put it to good use.
What we expect from you. This webpage is the primary source of information about the course, and we expect you to keep yourself up-to-date with what we publish here. We also send out information via the University’s email list for the course, and we expect you to subscribe to this list and read your email on a regular basis while the course is ongoing. Check whether you are subscribed
What you can expect from us. When you contact us via email, you can expect an answer during standard working hours, 8–17. (We do not respond to email in the evening or on a weekend.) For a more personal contact, you can drop by during the examiner’s office hours (Wednesdays 13-17 in Building E, Room 3G.476) or book an appointment.
Page responsible: Marco Kuhlmann
Last updated: 2017-09-22