732A47 Text Mining
- Introductory modules
- Introduction to Python Programming
- Introduction to Statistical Modeling
- Introduction to Computational Linguistics
You need to pass two out of the three introductory modules, and you are free to choose which module (if any) to skip.
Course literatureThe following books will be used, in parts, during the course:
- Natural Language Processing with Python (NLTK).
This book contains a lot of practical hands-on material using the NLTK toolkit for Python.
The book's website is here, where the book can be read for free in HTML format. The publisher O'Reilly also sells the book in PDF format.
- Foundations of Statistical Natural Language Processing (MS).
This book describes the background theory for computational linguistics and statistical analysis of text data.
It available electronically for free here (for LiU students, but and probably also for students at most other Swedish universities).
The book's website is here.
- Modern Information Retrieval (BYRN) by R Baeza-Yates and B Rebeiro-Neto, Addison-Wesley, 1999.
- Extra material
Course IntroductionLecture 1: Course info. General introduction to text mining. Motivating applications.
Teacher: Mattias Villani
Read: MS 1 and NLTK 1-2 | Slides
Introduction to Python ProgrammingLecture 1: Introduction to the Python programming language.
Teacher: Johan Falkenjack
Read: Chapter 4 in NLTK | Chapters 1-13 in Learning to Program Using Python | Cheat sheet | Slides
Other material: Interactive Python web tutorial | Python code visualization | Python tutorial | Codacademy
| Infographic on R vs Python
Introduction to Statistical ModelingLecture 1: Basic statistics. Regression. Classification.
Teacher: Oleg Sysoev
Read: MS 1 and NLTK 1-2 | Slides
Introduction to Computational LinguisticsLecture 1: Basic linguistics.
Teacher: Marco Kuhlmann
Read: NLTK Chapters 3,5 and 7 | Slides
Data models and Information Retrieval for Textual DataLecture 1: Basic linguistics.
Teachers: Patrick Lambrix and Zlatan Dragisic
Read: Modern Information Retrieval Chapter 2 (distributed in class) | Slides PDF | Slides Powerpoint
Lab: Lab | Additional instructions
Statistical Models for Textual DataTeacher: Måns Magnusson
Lecture 1: n-grams. Part-of-speech tagging.
Read: NLTK 5.1-5.2 and MS 2.2 and 6 | Slides
Lecture 2: Document classification.
Read: NLTK 6 | Slides
Code: TM package in R - demo
Other material: Introduction to the tm package in R and slides on using it for classification.
Lecture 3: Topic models.
Read: NLTK Chapters 3,5 and 7 | Article on topic models | Slides
Code: Topic models in R - demo
Lab: Lab. The lab is to be submitted in LISAM, where you also find the submission deadline.
Text Mining Project
Form:The project should be performed and reported individually.
Extent:The project comprises 3 credit points.
Grading: ECTS scale (A-F) for masters students, Pass/Fail for PhD students.
Examination: Written report + Oral presentation.
Deadline for submitting the written report: Jan 17, 2016.
Suggested projects Your are encouraged to select your own topic for the project.
Here is a list with some directions for possible projects.
See also the list below with links to some publically available corpora.
- UCI Machine learning repository has a collection of text datasets
- 20 Newsgroups data is a collection of approximately 20,000 newsgroup documents from 20 newsgroups
- Språkbanken is a collection Swedish corpora from many different sources, ranging from blogs to August Strindberg's personal letters.
- Google ngrams - text from millions of books scanned by Google.
- Wikipedia texts can be downloaded. See also the download instructions here. Maybe only for the truly brave ...
Project presentations from previous years
- Are they lovin' it? Sentiment Classification on #mcdonalds. Presentation
- Retweets Presentation
- Movie Reviews Scores Presentation Ph.D. students
- Topic Selection of Wikipedia-Articles Presentation
- Twitter sentiment vs. Stock price Presentation
- Twitter_opinions Presentation
Page responsible: Mattias Villani
Last updated: 2015-11-28