TDDE16 Text Mining

Labs

This course website is no longer being maintained. Please refer to Lisam for HT2023.

This page contains the instructions for the lab assignments, as well as general information about how to work on and how to submit labs. For more information about the examination of the lab component, see the Examination page.

General information

Lab assignments are done in pairs. This is an explicit exception to the rules on Cheating and plagiarism. Before submitting your first lab, you and your lab partner must sign up in Webreg. If you cannot find a lab partner, let us know before the first lab session, and we will pair you up with a random student.

Submission and grading: Submit the required files through Lisam. Your lab assistant will grade your labs within 5 working days.

Feedback: For each lab, there are a number of scheduled hours where you can get feedback on your work from the lab assistants. Unless you submit late, you will also get written feedback. In addition, you can always get feedback from the examiner. Book an appointment with the examiner.

Technical information

This course uses Jupyter notebooks for the lab assignments. To work on these notebooks, you can use the course’s lab environment, set things up on your personal computer, or use an external service such as Colab.

Using the lab environment. The course’s lab environment is available on computers connected to LiU’s Linux system, including those in the B-building computer labs. To activate the environment and start the Jupyter server, issue the following commands at the terminal prompt. This should open your web browser, where you can select the relevant notebook file.

source /courses/TDDE16/venv/bin/activate
jupyter notebook

L1: Information retrieval

In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the Google Play Store. From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.

Lab L1: Information retrieval (due 2022-11-08)
Complete lab on GitHub (including data)

L2: Text classification

Text classification is the task of sorting text documents into predefined classes. The concrete problem you will be working on in this lab is the classification of texts with respect to their political affiliation. The specific texts you are going to classify are speeches held in the Riksdag, the Swedish national legislature.

Lab L2: Text classification (due 2022-11-15)
Complete lab on GitHub (including data)

L3: Text clustering and topic modelling

Text clustering groups documents in such a way that documents within a group are more ‘similar’ to other documents in the cluster than to documents not in the cluster. In this lab you will experiment with both hard and soft clustering techniques. In the first part you will be using the k-means algorithm; in the second part you will be using a Latent Dirichlet Allocation (LDA) topic model.

Lab L3: Topic modelling (due 2022-11-22)
Complete lab on GitHub (including data)

L4: Word embeddings

A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. In this lab you will use the word vectors that come with state-of-the-art NLP libraries to find similar words, and evaluate their usefulness in a natural language inference task.

Lab L4: Word embeddings (due 2022-11-29)
Complete lab on GitHub (including data)

L5: Information extraction

Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, named entity recognition (identifying mentions of entities) and entity linking (matching these mentions to entities in a knowledge base).

Lab L5: Information extraction (due 2022-12-06)
Complete lab on GitHub (including data)

Diagnostic test

At the end of the lab series, you take a short diagnostic test. The purpose of this test is to assess your ability to analyse and summarise the results of the experiments that you ran in the labs (learning outcome 2).

Format: The diagnostic test is an oral test taken on-campus and lasts approximately 10 minutes. It is assessed by two teachers. You will get three questions sampled from a pool of questions about the labs. These may include questions related to your lab solutions and the feedback you received on your solutions.

Preparations: To prepare for the test, you should review the reflection questions in the labs, your own solutions, and the feedback you received from your lab assistant.

You need to book a time! To take the diagnostic test, you need to book a time in advance. Pick a time from one of the following Doodles. Please do not use ‘If need be’ option!

Page responsible: Marco Kuhlmann
Last updated: 2022-10-26

IDA - Department of Computer and Information Science