Hide menu

732A47 Text Mining

Course information

Course sections

The three introductory modules are meant to give you the necessary background for the rest of the course.
You need to pass two out of the three introductory modules, and you are free to choose which module (if any) to skip.

Course literature
The following books will be used, in parts, during the course:

  • Natural Language Processing with Python.
    This book contains a lot of practical hands-on material using the NLTK toolkit for Python.
    The book's website is here, where the book can be read for free in HTML format. The publisher O'Reilly also sells the book in PDF format.
  • Foundations of Statistical Natural Language Processing.
    This book describes the background theory for computational linguistics and statistical analysis of text data.
    It available electronically for free here (for LiU students, but and probably also for students at most other Swedish universities).
    The book's website is here.
  • Modern Information Retrieval by R Baeza-Yates and B Rebeiro-Neto, Addison-Wesley, 1999.
  • Extra material
Note: We will not order the books to the book stores on campus. Both books are available at all major internet book stores.

Course Introduction


Introduction to Python Programming

Recommended literature Slides Useful stuff Computer lab

Introduction to Statistical Modeling

Computer lab

Introduction to Computational Linguistics

Computer lab

Data models and Information Retrieval for Textual Data

Recommended literature
  • Chapter 2 in Modern Information Retrieval by R Baeza-Yates and B Rebeiro-Neto, Addison-Wesley, 1999. Copies distributed at lectures.
Computer lab

Statistical Models for Textual Data

Recommended literature Slides Code Computer lab

Text Mining Project

Form:The project should be performed and reported individually.
Extent:The project comprises 3 credit points.
Grading: ECTS scale (A-F) for masters students, Pass/Fail for PhD students.
Examination: Written report + Oral presentation.

Suggested projects Your are encouraged to select your own topic for the project.
Here is a list with some directions for possible projects.
See also the list below with links to some publically available corpora.

Public corpora
  • UCI Machine learning repository has a collection of text datasets
  • 20 Newsgroups data is a collection of approximately 20,000 newsgroup documents from 20 newsgroups
  • Språkbanken is a collection Swedish corpora from many different sources, ranging from blogs to August Strindberg's personal letters.
  • Google ngrams - text from millions of books scanned by Google.
  • Wikipedia texts can be downloaded. See also the download instructions here. Maybe only for the truly brave ...

Project presentations from June 3, 2013

Page responsible: Mattias Villani
Last updated: 2014-05-06