729A27 Natural Language Processing
The main purpose of the project is to give you an opportunity to show that you can seek, assess, and use scientific information within the area of NLP (learning objective 4). You will also have opportunity to deepen the knowledge that you have acquired in the course.
The project should be carried out in groups of approximately 6 students and will center around a concrete task: You will build a syntactic parser that can map sequences of words to dependency trees, and train and evaluate this parser on the treebanks released by the Universal Dependencies project. The minimal project looks as follows:
- Implement a complete tagger–parser pipeline based on labs 3 and 4
- Modify this baseline system by implementing a method described in the NLP literature
- Evaluate your system on the Universal Dependencies treebanks
- Draw conclusions about the usefulness of the chosen method
Simple projects will make one or a few minor modification to the baseline system. More advanced projects will be more varied and/or implement substantial changes, such as the use of a different machine learning method or parsing algorithm. In any case, the focus should be on the implementation of methods described in the NLP literature.
The project runs W3–W10, but most of the work is concentrated during the project week in W9. When you plan your time for the project, you should calculate approximately 48 hrs per group member, that is, 240 hrs for a group with 5 members.
When you plan your time for the project week (W9), you should calculate approximately 16 hrs per group member, that is, 96 hrs in total for a group with 6 members.
While the choice of topic is completely up to your group, the form of the project is rather rigid. In particular, throughout the project you will have to make a number of deliverables (D1–D6), which are designed to keep you on track, and to give you feedback on your progress. The rest of this page contains detailed information about these deliverables.
D1: Group contract
During W3 and W4 you will form your project group. We encourage you to form groups that include students with different backgrounds, skills, and interests, as this can improve the quality of the project. Since you will be building your system on some of the labs, it makes sense to form the project group out of lab groups.
After formation, your group is required to make a group contract that will govern your collaboration. The contract should spell out those behaviours that you expect of all group members, as well as procedures for resolving impasses in the group. Specific questions to think about include:
- How will we communicate with each other? At what times?
- How often and where will we meet?
- How will we make sure that our meetings are productive?
- What will we do if somebody does not show up at a meeting?
- What will we do if somebody breaks any rule set out in this contract?
Instructions: Make a group contract and have it signed by all members of the group. Submit the signed contract as a PDF document. (Please re-read the instructions for the submission of hand-in assignments before doing so.) Due date: 2017-01-27
Format of the subject line: NLP-2017 D1 marku61
Upon receiving your group contract, we will assign your group a group ID that you should use in future submissions (see below).
D2: Pre-project paper
Before starting with the actual project, you should try to get a good understanding of what a syntactic parser actually is, and why such a program could be useful. At this early stage you should focus on high-level aspects rather than algorithmic details; these will covered in the sections on part-of-speech tagging and parsing. We ask you to read the following:
- Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source (Google Research Blog, 2016-05-12)
- Universal Dependencies v1: A Multilingual Treebank Collection (research article, LREC 2016)
You should also skim the homepage of the Universal Dependencies (UD) project, which contains the technical documentation for the data sets that you will be using in the project, and links to UD-related publications.
We ask you to summarise and discuss your findings in a pre-project paper. In this paper you should address the following questions:
- What is a syntactic parser, and for what applications is it useful?
- Why is parsing so hard for computers to get right?
- What role does the Universal Dependencies project play in parser development?
- What is a realistic expectation of what we can achieve with our project?
We encourage you to discuss these questions in your group in order to get feedback and align your views of your project. However, please note that your paper should be your own.
Instructions: Write a paper addressing the above questions. The length of your paper should be around 1,000 words (approximately 2 pages). Submit your paper as a PDF document. Due date: 2017-02-03
Format of the subject line: NLP-2017 D2 your LiU-ID marku61
Example: NLP-2017 D2 marjo123 marku61
Feedback and examination: You will get feedback on your paper from the examiner, who will assess it according to the criteria spelled out in the Project Rubric. This assessment will contribute to your grade for the project component of the examination.
D3: Baseline system
During W6–W8 the task for your group is to implement and evaluate the baseline system. This system should realise a simple pipeline architecture with the following components:
- a part-of-speech tagger (which you will partially implement in lab 3)
- a transition-based dependency parser (which you will partially implement in lab 4)
- code to read and output dependency trees in the CoNLL-U format
You should also write code to train and evaluate your system on any given Universal Dependencies treebank. Your code should report tagging accuracy and unlabelled attachment score.
Some of the Universal Dependencies treebanks contain non-projective trees. To train on these treebanks, you will first have to projectivize them. For this you can use the following Python script (contains usage instructions): projectivize.py
Here are some suggestions for this phase of the project:
- Work in parallel. The different components can largely be developed independently.
- Present your labs to each other and let the group that appears most confident about a certain component implement it.
- Take notes of any ideas that you come up with for how the baseline system could be improved.
- Prepare a couple of slides that present the baseline system. You can later modify them to present your final system.
Instructions: Submit an email containing the following: (a) the tagging accuracy and unlabelled attachment score for your baseline system when trained on the training sections and evaluated on the development sections of the English and the Swedish treebank, (b) a link to a GitLab repository containing your code, and (c) instructions for how to replicate your results using your code. Due date: 2017-02-24
Format of the subject line: NLP-2017 D3 your group ID marku61
Example: NLP-2017 D3 G1 marku61
D4: Modified system
During the project week (W9), your task is to try to modify your baseline system by implementing a method described in the NLP literature. There are many different things that you could try. Here are some suggestions, roughly sorted from simple to complex:
- Try to improve the accuracy of the baseline system on a specific treebank by adding new features.
- Support the parsing to labelled trees, where each dependency arc is labelled with a grammatical function such as subject.
- Implement the arc-hybrid system and a dynamic oracle for choosing the best possible transition in a given configuration.
- Support the parsing to non-projective trees by implementing a transition system with a swapping operation.
- Replace the greedy search in the baseline system with a beam search.
- Replace the averaged perceptron in the baseline system with a neural network.
- Replace the transition-based dependency parser with a dynamic programming parser based on the Eisner algorithm.
- Apply your parser to an extrinsic task such as information extraction, and evaluate its performance.
In order to implement any of these suggestions, you will first need to find relevant articles from the NLP literature. Most of the papers in the field are available for free via the ACL Anthology.
You should evaluate your modified system with respect to tagging accuracy and attachment score and compare it with the baseline system. You may also find it useful to include other evaluation measures. However, note that the purpose of the project is not develop a highly accurate system, but to demonstrate that you can implement and evaluate a method described in the scientific literature.
At the end of the project week, you should write a short abstract for your project. The abstract should summarise what you have done in the project, as well as your main results. The purpose of the abstract is to announce your presentation ahead of the ‘mini-conference’ that will take place in W10.
Instructions: Submit an email containing the following: (a) a short abstract of your project (max. 200 words), and (b) a link to a GitLab repository containing your code. Due date: 2017-03-03
Format of the subject line: NLP-2017 D4 your group ID marku61
Example: NLP-2017 D4 G1 marku61
Feedback and examination: You can get feedback on your project plan from the examiner, for example by coming to his office hours, or by scheduling a separate meeting with him. This feedback will give you an idea to what degree your project meets the project-related assessment criteria in the Project Rubric.
D5: Project presentation
In the week following the project week (W10), your group will present your project at the course’s ‘mini-conference’. You are allotted a 15 minute time slot for this presentation. You are free to choose the presentation’s content and structure, but you should bear in mind that the presentation needs to be understandable to everybody in the course.
In preparing the presentation, you may want to consider the following questions:
- What have you done in this project? What method did you evaluate?
- Why have you chosen this particular project?
- How did you find background information for your project?
- What are your experimental results?
- What are your conclusions regarding the usefulness of the implemented method?
Instructions: Present your project, following the instructions above. Date of the presentation: 2017-03-07, 10–12. The exact schedule for the mini-conference will be announced at the beginning of W10.
Feedback and examination: The examiner will assess your presentation according to the criteria spelled out in the Project Rubric. This assessment will contribute to your grade for the project component of the examination. At the same time, the feedback will be useful to you when preparing your post-project paper.
What if your group cannot present at the mini-conference? The group presentation can be replaced by a written report in which your group presents your project. Please contact the examiner for details about this assignment.
D6: Post-project paper
The final project-related assignment is an individual paper in which you reflect on the project and draw conclusions from it. The form of this paper should be a coherent essay that assesses the usefulness of the method that you have investigated in your project, taking into account the potential and the limitations of the project itself. When writing this paper, you can think about the following questions:
- What is a syntactic parser, and for what applications is it useful?
- What was your group’s project about? What method did you evaluate?
- What experimental results did you get?
- Based on your group’s project, how would you judge the usefulness of the method you evaluated?
- To what extent did your group’s project help you to accomplish learning objective 4?
- What other knowledge and skills did you develop or train during the project?
The paper should not assume the reader to have read your pre-project paper but should be independent. You are allowed to recycle text from your pre-project paper and your project abstract if you find it appropriate.
Instructions: Write a paper according to the above specification. Make sure to take into account both the feedback that you got on your pre-project paper and your group’s presentation. The length of your paper should be around 1,500 words (approximately 3 pages). Submit your report as a PDF document. Due date: 2017-03-17
Format of the subject line: NLP-2017 D6 your LiU-ID marku61
Example: NLP-2017 D6 marjo123 marku61
Examination: The examiner will assess your paper according to the criteria spelled out in the Project Rubric. This assessment will contribute to your grade for the project component of the examination.
Page responsible: Marco Kuhlmann
Last updated: 2017-09-22