Hide menu

TDDE09 Natural Language Processing


General information

Lab assignments should be done in pairs. Please contact the examiner in case you want to work on your own. Unfortunately, we do not always have the resources necessary to tutor and give feedback on one-person labs.

Instructions: Submit your labs according to the instructions below. Please also read the general rules for hand-in assignments. Before you submit your first lab, you and your lab partner need to sign up in Webreg.

Format of the subject line: TDDE09-2018 lab code your LiU-ID your partner’s LiU-ID your lab assistant’s LiU-ID

Example: TDDE09-2018 L1 marjo123 erika456 fooba99

Feedback: For each lab there are a number of scheduled hours where you can get oral feedback on your work from the lab assistants. If you submit in time for the first due date, you will get also get written feedback. In addition to that, you can always get feedback from the examiner (office hours: Wednesdays 13-17 in Building E, Room 3G.476).

Topic 0: Text segmentation

Text segmentation is the task of segmenting a text into linguistically meaningful units, such as paragraphs, sentences, or words.

Level A

When the target units of text segmentation are words or word-like units, the process is called tokenisation. In this lab you will implement a simple tokeniser for text extracted from Wikipedia articles. The lab also gives you a chance to acquaint yourself with the general framework that we will be using for the remainder of the lab series.

Lab L0: Text segmentation (due 2017-01-20)

Concepts
  • tokenisation
  • undersegmentation, oversegmentation
  • precision, recall
Procedures
  • segment text into tokens (words and word-like units) using regular expressions
  • compare an automatic tokenisation with a gold standard

Topic 1: Text classification

Text classification is the task of categorising text documents into predefined classes.

Level A

In this lab you will implement two simple text classifiers: the Naive Bayes classifier and the averaged perceptron classifier. You will evaluate these classifiers using accuracy, and experiment with different document representations. The concrete task that you will be working with is to classify movie reviews as either positive or negative.

Lab L1: Text classification (due 2017-01-27)

Concepts
  • accuracy
  • Naive Bayes classifier
  • averaged perceptron classifier
Procedures
  • evaluate a text classifier based on accuracy
  • learn a Naive Bayes classifier from data
  • implement an averaged perceptron classifier

Level B

In this lab you will implement the missing parts of a third text classifier, the maximum entropy classifier. Your main focus will be on the conversion of the data into the matrix format that is required by standard gradient search optimisers. You will be evaluating your implemented model on the same task as for the Level A-lab.

Lab L1X: Maximum entropy classification (due 2017-01-27 2017-02-03), skeleton code

Concepts
  • accuracy
  • maximum entropy classifier (advanced)
Procedures
  • evaluate a text classifier based on accuracy
  • implement the core parts of a maximum entropy classifier (advanced)

Topic 2: Language modelling

Language modelling is about building models of what words are more or less likely to occur in some language.

Level A

In this lab you will experiment with n-gram models. You will test various various parameters that influence these model’s quality and estimate models using maximum likelihood estimation with additive smoothing. The data set that you will be working on is the set of Arthur Conan Doyle’s novels about Sherlock Holmes.

Lab L2: Language modelling (due 2017-02-03)

Concepts
  • n-gram model
  • entropy
  • additive smoothing
Procedures
  • estimate n-gram probabilities using the Maximum Likelihood method
  • estimate n-gram probabilities using additive smoothing

Level C

Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of language models that automatically guess what the right character is. In this assignment, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.

Lab L2X: Chinese character prediction (due 2017-02-03), data

Concepts
  • n-gram model
  • Witten–Bell smoothing
Procedures
  • implement an n-gram model with Witten–Bell smoothing
  • implement the core parts of an autocompletion algorithm
  • evaluate a language model in the context of an autocompletion application

Topic 3: Part-of-speech tagging

Part-of-speech tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb.

Level A

In this lab you will implement a part-of-speech tagger based on the averaged perceptron and evaluate it on the Stockholm Umeå Corpus (SUC), a Swedish corpus containing more than 74,000 sentences (1.1 million tokens), which were manually annotated with, among others, parts of speech.

Lab L3: Part-of-speech tagging (due 2017-02-10)

Concepts
  • sequence labelling
  • averaged perceptron classifier
Procedures
  • implement a part-of-speech tagger based on the averaged perceptron
  • evaluate a part-of-speech tagger based on accuracy

Level B

In the advanced part of this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system – in this case the part-of-speech tagger that you implemented in the Level A-lab.

Lab L3X: Feature engineering for part-of-speech tagging (due 2017-02-10) (same file as for the Level A-lab)

Concepts
  • averaged perceptron classifier
  • feature engineering (advanced)
Procedures
  • improve a part-of-speech tagger using feature engineering (advanced)
  • evaluate a part-of-speech tagger based on accuracy

Topic 4: Syntactic analysis

Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure.

Level A

In this lab you will implement a simple transition-based dependency parser based on the averaged perceptron and evaluate it on the English Web Treebank from the Universal Dependencies Project.

Lab L4: Syntactic analysis (due 2017-02-17)

Concepts
  • averaged perceptron classifier
  • transition-based dependency parsing
  • accuracy, attachment score
Procedures
  • implement a transition-based dependency parser based on the averaged perceptron
  • evaluate a dependency parser based on unlabelled attachment score

Level C

In this lab you will implement the Eisner algorithm for projective dependency parsing and use it to transform (possibly non-projective) dependency trees to projective trees. This transformation is necessary to be able to apply algorithms for projective dependency parsing to treebanks that may contain non-projective trees.

Lab L4X: Projectivisation (due 2017-02-17)

Concepts
  • Eisner algorithm
  • projectivisation, lifting (advanced)
Procedures
  • implement the Eisner algorithm (advanced)

Topic 5: Semantic analysis

These labs focus on word space models and semantic similarity.

Level A

In this lab you will explore a word space model which trained on the Swedish Wikipedia using Google’s word2vec tool. You will learn how to use the model to measure the semantic similarity between words and apply it to solve a simple word analogy task.

Lab L5: Semantic analysis (due 2017-02-24)

Concepts
  • word space model, cosine distance, semantic similarity
  • accuracy
Procedures
  • use a pre-trained word space model to measure the semantic similarity between two words
  • use a pre-trained word space model to solve word analogy tasks

Level B

In this lab you will use state-of-the-art NLP libraries to train word space models on text data and evaluate them on a standard task, the synonym part of the Test of English as a Foreign Language (TOEFL).

Lab L5X: Semantic analysis (due 2017-02-24)

Concepts
  • word space model, cosine distance, semantic similarity
  • accuracy
Procedures
  • train a word space model on text data (advanced)
  • use a word space model to solve a synonym prediction task (advanced)

Reflection paper

After having completed all labs, you are asked to write an individual reflection paper. The purpose of this assignment is to give you an opportunity to think about what you have learned from the lab assignments. The paper should have three parts:

  • a summary of the content of the labs (in your own words)
  • reflections on the knowledge and skills that you have developed or trained
  • reflections on the collaboration in your lab group

Questions that you may want to discuss include the following:

  • What have you learned by doing the labs?
  • Which connections do you see between the labs and the other parts of the course?
  • Which connections do you see between the labs and other courses on the programme?
  • Which of the knowledge and skills that you have developed or trained may be most relevant for you in the future?
  • Which parts of the labs did you find the most interesting? Which were less interesting?
  • How did you and your lab partner complement each other?

Instructions: Write a paper addressing the above questions. The length of your paper should be around 1,000 words (approximately 2 pages). Submit your paper as a PDF document. Due date: 2017-03-03

Format of the subject line: TDDE09-2018 LR your LiU-ID marku61

Example: TDDE09-2018 LR marjo123 marku61

Feedback: You will get feedback on your paper from the examiner.


Page responsible: Marco Kuhlmann
Last updated: 2017-09-22