Hide menu

TDDE16 Text Mining

This page contains the instructions for the lab assignments, as well as general information about how to work on and how to submit labs. For more information about the examination of the lab component, see the Examination page.

General information

Lab assignments are done in pairs. This is an explicit exception to the rules on Cheating and plagiarism. Both you and your lab partner must work and be prepared to be examined on each individual lab.

Signup: Before the first scheduled lab session, you and your lab partner must sign up in Webreg. The deadline for the sign-up is 2020-11-03. If you do not sign up in Webreg, we will pair you up with a lab partner.

Remote labs: During the pandemic, labs will be supervised remotely via Teams. To this end, you and your lab partner will be assigned your own private channel in your lab assistant’s supervision team. You can use this channel to collaborate and ask questions to the assistant during the scheduled lab sessions. Please make sure that you can use Teams before the first lab.

Submission and grading: Submit the required files through Lisam. Make sure to specify both your own name and LiU-ID and the name and LiU-ID of your lab partner. Your labs will be graded by one of the lab assistants.

Lab assistants for this course:

  • Hao Chi Kiang
  • Jenny Kunz
  • Riley Capshaw

Feedback: For each lab there are a number of scheduled hours where you can get feedback on your work from the lab assistants. Unless you submit late, you will also get written feedback. In addition, you can always get feedback from the examiner. Book an appointment with the examiner now.

Technical information

This course uses Jupyter notebooks for the lab assignments. To work on these notebooks, you can either use the course’s lab environment, set things up on your own computer, or use an external service such as Colab.

Using the lab environment. The course’s lab environment is available on computers connected to LiU’s Linux system, which includes those in the computer labs in the B-building. To activate the environment and start the Jupyter server, issue the following commands at the terminal prompt. This should open your web browser, where you will be able to select the relevant notebook file.

source /courses/TDDE16/venv/bin/activate
jupyter notebook

Using your own computer or an external service. To work on your own computer, you will need a suitable software stack. To simplify your installation, you can have a look at the files here. Alternatively, you can use an external service such as Colab, which has all of the necessary Python packages pre-installed.

L1: Information retrieval

In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the Google Play Store. From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.

L2: Text classification

Text classification is the task of sorting text documents into predefined classes. The concrete problem you will be working on in this lab is the classification of texts with respect to their political affiliation. The specific texts you are going to classify are speeches held in the Riksdag, the Swedish national legislature.

L3: Text clustering and topic modelling

Text clustering groups documents in such a way that documents within a group are more ‘similar’ to other documents in the cluster than to documents not in the cluster. In this lab you will experiment with both hard and soft clustering techniques. In the first part you will be using the k-means algorithm; in the second part you will be using a Latent Dirichlet Allocation (LDA) topic model.

L4: Word embeddings

A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. In this lab you will use the word vectors that come with state-of-the-art NLP libraries to find similar words, and evaluate their usefulness in a natural language inference task.

L5: Information extraction

Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, named entity recognition (identifying mentions of entities) and entity linking (matching these mentions to entities in a knowledge base).

Page responsible: Marco Kuhlmann
Last updated: 2020-11-02