732A47 Text Mining
- Introductory modules
- Introduction to Python Programming
- Introduction to Statistical Modeling
- Introduction to Computational Linguistics
You need to pass two out of the three introductory modules, and you are free to choose which module (if any) to skip.
The following books will be used, in parts, during the course:
- Natural Language Processing with Python.
This book contains a lot of practical hands-on material using the NLTK toolkit for Python.
The book's website is here, where the book can be read for free in HTML format. The publisher O'Reilly also sells the book in PDF format.
- Foundations of Statistical Natural Language Processing.
This book describes the background theory for computational linguistics and statistical analysis of text data.
It available electronically for free here (for LiU students, but and probably also for students at most other Swedish universities).
The book's website is here.
- Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
- Extra material
- Chapter 4 in Natural Language Processing with Python
- Chapters 1-13 in Learning to Program Using Python by Cody Jackson.
- Cheat sheet that translates between Matlab, R and Python commands.
- Interactive Python web tutorial
- Python code visualization lets you see what happens at each step of your code.
- Python tutorial from the official Python.org site.
- Chapters 1, 2, 6 and 7 of Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
- Since some of you had some problems with the web crawling part of the lab, here is a zip file consisting of 1400+ app description text files, which is the result data from the crawling step.
- Chapter 5.1-5.2 and Chapter 6 in Natural Language Processing with Python
- Chapter 2.2 and 6 of Foundations of Statistical Natural Language Processing
- Article on probabilistic topic models
- Introduction to the tm package in R and slides on using it for classification.
- Here is the code from my demo of the tm package in R.
- Here is the code from my demo of the topicmodels package in R.
- Slides L1 - 1 per page | Slides L1 - 4 per page
- Slides L2 - 1 per page | Slides L2 - 4 per page
- Slides L3 - 1 per page | Slides L3 - 4 per page
Form:The project should be performed and reported individually.
Extent:The project comprises 3 credit points.
Grading: ECTS scale (A-F) for masters students, Pass/Fail for PhD students.
Examination: Written report + Oral presentation.
Suggested projects Your are encouraged to select your own topic for the project.
Here is a list with some directions for possible projects.
See also the list below with links to some publically available corpora.
- UCI Machine learning repository has a collection of text datasets
- 20 Newsgroups data is a collection of approximately 20,000 newsgroup documents from 20 newsgroups
- Språkbanken is a collection Swedish corpora from many different sources, ranging from blogs to August Strindberg's personal letters.
- Google ngrams - text from millions of books scanned by Google.
- Wikipedia texts can be downloaded. See also the download instructions here. Maybe only for the truly brave ...
Page responsible: Mattias Villani
Last updated: 2013-05-11