Hide menu

732A92 Text Mining


This page contains detailed information about the project module. For general information about the examination of this module, see the Examination page.

Slides from the project kick-off (2020-12-07)

Overview

The main purpose of the project module is to provide you with an opportunity to apply the text mining techniques covered in the course to real-world problems. You will also have opportunity to deepen the knowledge that you have acquired in the lab module.

Project structure

Like other applied projects, the project can be structured into four phases as follows:

The project runs over the full course, but most of the work is concentrated during the second half, including the self-study period between Christmas and Epiphany. When you plan your time for the project, you should calculate with approximately 88 hours. A rough breakdown of this time over the four phases could look as follows:

  • Phase 1: 8 hours (W45–W49)
  • Phase 2: 32 hours (W50–W51)
  • Phase 3: 32 hours (W52–W1)
  • Phase 4: 16 hours (W2)

Feedback

At all times during the course, you can get personalised feedback on your project work from the examiner. Book an appointment with the examiner now.

Examination

The project is examined by a written report. Detailed information about the suggested structure, formal requirements, and assessment criteria for the report are available in the Instructions for the project report.

Phase 1: Identify your problem

In this phase, you identify a text mining problem that you want to work on.

A problem can either be a specific task (‘build a system for sentiment classification of movie reviews’) or the answering of a limited-scale research question (‘which text segmentation strategy yields the most coherent topic models’).

Tips for this phase

  • A good way to identify a problem is to start with data set that you find interesting. There are many websites that host public data sets (and associated software). One widely-known such website is Kaggle.

  • You could also have a look at the project abstracts from previous years: 2019, 2018, 2017. It is perfectly acceptable to do a project that is based on a previous one. Perhaps you can find a better approach?

  • In previous years, several students have used their Text Mining project to test an idea for a thesis project. Have a look at CareerGate or Thesis opportunities at IDA to find interesting thesis proposals (and project ideas).

Phase 2: Design your approach

In this phase, you design an approach for solving the problem that you identified in Phase 1. This typically means

  • to select a text data set (either ready-made or self-compiled)
  • to write code to process the data (typically based on existing libraries, such as the ones used in the labs)
  • to choose a method for evaluating your results (evaluation measures, baselines)

To get ideas and points of comparison for your approach, you should review previous work related to your problem. Browsing the University Library and the Internet, you will find a lot of material both in the form of primary sources (research articles) and secondary sources (tutorials, teaching materials). Two well-known, specialised repositories for relevant research articles are:

At the end of this phase, you should draft the Data and Methods sections of the written report.

Tips for this phase

  • Think about how you will evaluate your approach. How will you know whether your system does what it is supposed to do? What arguments can you give to motivate your conclusions? Think about these questions before implementing your approach.

  • Do not get lost in code. While coding can give many technical insights, to pass this module you will have to show that you not only can apply the techniques covered in the course, but also interpret the results obtained with them.

  • Collect references. When using and discussing ideas, code, or text from others, you must appropriately cite your sources. It is a good idea to collect proper references right from the start.

Phase 3: Evaluate your approach

In this phase, you evaluate your approach, typically by running relevant experiments and interpreting the results.

Running experiments on large data sets can take a lot of time. Think about what exactly you want to get out from an experiment before starting it. It is okay to be selective about which experiments you run (for example, you do not need to do extensive hyperparamter tuning), as long as you discuss the limitations of your results in your report.

When evaluating your approach, you should remember that most evaluation methods are relative in nature: You can never know that your solution performs well, you can only know that it performs better or worse than some other solution. A proper evaluation requires suitable baselines and points of comparison (perhaps from related work).

At the end of this phase, you should draft the Results section.

Tips for this phase

  • Plan ahead. Many experiments can run over night, but to really save time on that, you need to plan them carefully. Before starting a long computation, check that it will run without errors by doing a pilot on a smaller sample of the data.

  • Services such as Kaggle and Colab offer cloud-based Jupyter notebook environments and powerful computing resources. This can be an excellent alternative to running experiments on your own computer.

  • Document your work. Before moving on, it is a good idea to write down the main results of each experiment, as well as a sketch of your interpretation of these results. You can use the text when you finalise your report in Phase 4.

Phase 4: Produce your report

In this phase, you finalise and submit your report.

The suggested structure for the report consists of the following sections: Introduction, Theory, Data, Method, Results, Discussion, Conclusion. In the previous phases you should already have produced drafts of the sections on Data, Method, and Results.

The purpose of the Introduction is to introduce the task or research question that you have addressed in your project. What did you do, and why did you do it? The Theory section should briefly present relevant theoretical background for your project, in particular machine learning models that were not covered in the course (if any).

Most of your work should go into the Discussion and the Conclusion. In the Discussion you present your analysis of the results that you obtained, discuss the possibilities and limitations of your approach, and compare your study to related work. The Conclusion should build on the Discussion and answer the question what new knowledge you take away from your project.

Submission

When submitting your report, you must follow the same instructions as for other hand-in assignments in the course. Please read the Rules for hand-in assignments.

Before submission, you should check whether your report meets the formal requirements set out in the Instructions for the project report.

Instructions: Submit the following files:

  • your report as a PDF document, named as follows: 732A92-2020-PRA1-your LiU-ID.pdf
  • a plain text file with your project’s abstract (at most 200 words), named as follows: 732A92-2020-PRA1-your LiU-ID.txt

Due date: 2021-01-16

Feedback and examination: The examiner will assess your project report according to the assessment criteria described in the Instructions for the project report. You can read these criteria to get an idea of what the examiner will be looking for in your report. In addition, you can always get personalised feedback from the examiner. Book an appointment with the examiner now.


Page responsible: Marco Kuhlmann
Last updated: 2020-11-02