729A27 Natural Language Processing
Lab assignments are done in pairs. Before submitting your first lab, you and your lab partner will have to sign up in Webreg. If you have not signed up at the end of the first week of the course, we will pair you up with a lab partner.
Remote labs: Labs will be supervised remotely via Teams. To this end, you and your lab partner will be assigned your own private channel in your lab assistant’s supervision team. You can use this channel to collaborate and ask questions to the assistant during the scheduled lab sessions. Please make sure that you have access to your channel before the first lab session.
Submission and grading: Submit the required files through Lisam. As your group name, use the name of your group in Teams. Example: Ehsan-12. Your labs will be graded by your lab assistant.
Feedback: For each lab there are a number of scheduled hours where you can get feedback on your work from the lab assistants. Unless you submit late, you will also get written feedback. In addition, you can always get feedback from the examiner. Book an appointment with the examiner.
This course uses Jupyter notebooks for the lab assignments. To work on these notebooks, you can either use the course’s lab environment, set things up on your own computer, or use an external service such as Colab.
Using the lab environment. The course’s lab environment is available on computers connected to LiU’s Linux system, which includes those in the computer labs in the B-building. To activate the environment and start the Jupyter server, issue the following commands at the terminal prompt. This should open your web browser, where you will be able to select the relevant notebook file.
source /courses/729A27/labs/environment/bin/activate jupyter notebook
Using your own computer or an external service. To work on your own computer, you will need a suitable software stack. To simplify your installation, you can have a look at the files here. Alternatively, you can use an external service such as Colab, which has all of the necessary Python packages pre-installed.
Introduction to PyTorch
The purpose of this preparatory lab is to introduce you to the basics of PyTorch, the deep learning framework that we will be using for the lab series. Many good introductions to PyTorch are available online, including the 60 Minute Blitz on the official PyTorch website. This notebook is designed to put focus on those basics that you will encounter in the labs.
Unit 1: Word representations
To process words using neural networks, we need to represent them as vectors of numerical values.
In this lab you will implement the skip-gram model with negative sampling (SGNS) from Lecture 1.4, and use it to train word embeddings on the text of the Simple English Wikipedia.
Lab L1: Word representations (due 2022-01-28)
In Lecture 1.3 you learned about the CBOW classifier. This classifier is easy to implement in PyTorch with its automatic differentiation magic; but it is easy also to forget about what is going on under the hood. Your task in this lab is to implement the CBOW classifier without any magic, using only a library for vector operations (NumPy).
Lab L1X: Under the hood of the CBOW classifier (due 2022-01-28)
Unit 2: Language modelling
Language modelling is about building models of what words are more or less likely to occur in some language.
In this lab you will implement and train two neural language models: the fixed-window model from Lecture 2.3 and the recurrent neural network model from Lecture 2.5. You will evaluate these models by computing their perplexity on a benchmark dataset for language modelling, the WikiText dataset.
Lab L2: Language modelling (due 2022-02-04)
While the neural models that you have seen in the base lab define the state of the art in language modelling, they require substantial computational resources. Where these are not available, the older generation of probabilistic language models can make a strong baseline. Your task in this lab is to evaluate one of these models on the WikiText dataset.
Lab L2X: Interpolated n-gram models (due 2022-02-04)
Unit 3: Sequence labelling
Sequence labelling is the task of assigning a class label to each item in an input sequence. The labs in this unit will focus on the task of part-of-speech tagging.
In this lab you will implement a simple part-of-speech tagger based on the fixed-window architecture, and evaluate this tagger on the English treebank from the Universal Dependencies Project, a corpus containing more than 16,000 sentences (254,000 tokens) annotated with, among others, parts of speech.
Lab L3: Part-of-speech tagging (due 2022-02-11)
In the advanced part of this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system – in this case a part-of-speech tagger based on the averaged perceptron.
Lab L3X: Feature engineering for part-of-speech tagging (due 2022-02-11)
Unit 4: Syntactic analysis
Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure.
Lab L4: Syntactic analysis (due 2022-02-18)
In this lab you will implement the Eisner algorithm for projective dependency parsing and use it to transform (possibly non-projective) dependency trees to projective trees. This transformation is necessary to be able to apply algorithms for projective dependency parsing to treebanks that may contain non-projective trees.
Lab L4X: Projectivization (due 2022-02-18)
Unit 5: Machine translation
In this lab you will implement a simple encoder–decoder architecture for machine translation, including the extension of this architecture by an attention mechanism, and you will evaluate this architecture on a parallel German–English dataset.
Lab L5: Machine translation (due 2022-02-25)
One of the main selling points of pre-trained language models is that they can be applied to a wide spectrum of different tasks in natural language processing. In this lab you will test this by fine-tuning a pre-trained BERT model on a benchmark task in natural language inference.
Lab L5X: BERT for Natural Language Inference (due 2022-02-25)
Page responsible: Marco Kuhlmann
Last updated: 2022-01-13