TDDE09 Natural Language Processing
Lab assignments are done in pairs. This is an explicit exception to the rules on Cheating and plagiarism. Before submitting your first lab, you and your lab partner must sign up in Webreg. If you cannot find a lab partner, let us know before the first lab session, and we will pair you up with a random student.
Submission and grading: Submit the required files through Lisam. Your lab assistant will grade your labs within 5 working days.
Feedback: For each lab, there are a number of scheduled hours where you can get feedback on your work from the lab assistants. Unless you submit late, you will also get written feedback. In addition, you can always get feedback from the examiner. Book an appointment with the examiner.
This course uses Jupyter notebooks for the lab assignments. To work on these notebooks, you can use the course’s lab environment, set things up on your personal computer, or use an external service such as Colab.
Using the lab environment. The course’s lab environment is available on computers connected to LiU’s Linux system, including those in the B-building computer labs. To activate the environment and start the Jupyter server, issue the following commands at the terminal prompt. This should open your web browser, where you can select the relevant notebook file.
source /courses/TDDE09/venv/bin/activate jupyter notebook
Lab 0: Introduction to PyTorch
The purpose of this preparatory lab is to introduce you to the basics of PyTorch, the deep learning framework we use for the lab series. Many good introductions to PyTorch are available online, including the 60 Minute Blitz on the official PyTorch website. For this course we have put together a notebook that focuses on those basics that you will encounter in the labs.
Lab 1: Word representations
To process words using neural networks, we need to represent them as vectors of numerical values.
In this lab you will implement the skip-gram model with negative sampling (SGNS) from Lecture 1.4, and use it to train word embeddings on the text of the Simple English Wikipedia.
Lab L1: Word representations (due 2023-01-27)
In Lecture 1.3 you learned about the CBOW classifier. This classifier is easy to implement in PyTorch with its automatic differentiation magic; but it is easy also to forget about what is going on under the hood. Your task in this lab is to implement the CBOW classifier without any magic, using only a library for vector operations (NumPy).
Lab L1X: Under the hood of the CBOW classifier (due 2023-01-27)
Lab 2: Language modelling
Language modelling is about building models of what words are more or less likely to occur in some language.
In this lab you will implement and train two neural language models: the fixed-window model from Lecture 2.3 and the recurrent neural network model from Lecture 2.5. You will evaluate these models by computing their perplexity on a benchmark dataset for language modelling, the WikiText dataset.
Lab L2: Language modelling (due 2023-02-03)
While the neural models that you have seen in the base lab define the state of the art in language modelling, they require substantial computational resources. Where these are not available, the older generation of probabilistic language models can make a strong baseline. Your task in this lab is to evaluate one of these models on the WikiText dataset.
Lab L2X: Interpolated n-gram models (due 2023-02-03)
Lab 3: Large language models
The labs in this unit introduce you to the large language models that power the recent advances in natural language processing. These architectures are based on the Transformer architecture, whose core is formed by a mechanism called attention.
In this lab you will implement a simple encoder–decoder architecture for machine translation, including the extension of this architecture by an attention mechanism. You will evaluate this architecture on a parallel German–English dataset.
Lab L3: Attention (due 2023-02-10)
One of the main selling points of pre-trained language models is that they can be applied to a wide spectrum of different tasks in natural language processing. In this lab you will test this by fine-tuning a pre-trained BERT model on a benchmark task in natural language inference.
Lab L3X: BERT for Natural Language Inference (due 2023-02-10)
Lab 4: Sequence labelling
Sequence labelling is the task of assigning a class label to each item in an input sequence. The labs in this unit will focus on the task of part-of-speech tagging.
In this lab you will implement a simple part-of-speech tagger based on the fixed-window architecture, and evaluate this tagger on the English treebank from the Universal Dependencies Project, a corpus containing more than 16,000 sentences (254,000 tokens) annotated with, among others, parts of speech.
Lab L4: Part-of-speech tagging (due 2023-02-17)
In the advanced part of this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system – in this case a part-of-speech tagger based on the averaged perceptron.
Lab L4X: Feature engineering for part-of-speech tagging (due 2023-02-17)
Lab 5: Syntactic analysis
Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure.
Lab L5: Dependency parsing (due 2023-02-24)
In this lab you will implement the Eisner algorithm for projective dependency parsing and use it to transform (possibly non-projective) dependency trees to projective trees. This transformation is necessary to be able to apply algorithms for projective dependency parsing to treebanks that may contain non-projective trees.
Lab L5X: Projectivization (due 2023-02-24)
Page responsible: Marco Kuhlmann
Last updated: 2022-12-31