Hide menu

Corpora

This page presents spoken and written corpora collected and maintained by NLPLAB.

GES - Gold Standard for English-Swedish Europarl data

An English-Swedish word alignment reference corpus. Sentences are taken from Europarl v.2. There is a development set with 972 sentences pairs. The alignment contains ordinary word links and null links.

The test set is a reference word alignment with 192 sentences. Alignment contains sure, possible and null links. Annotation was done independently by two people and the resulting alignments were combined into one.

> Contact: Maria Holmqvist
> Available for download

LinES - Linköping English-Swedish Parallel Treebank

LinES is a parallel treebank built on the basis of LTC, The Linköping Translation Corpus. LinES has been developed as part of the project "Linguistic micro- and macro analysis of a translation corpus", financed by the Swedish Research Council 2004-2005. LinES currently contains 4000 sentence pairs distributed on six different sub-corpora.

> Contact: Lars Ahrenberg

Gold standards for automatic summarisers

We been using a version of the pyramid method to create gold standards for evaluation of automatic text summarization techniques in the domain of governmental texts. The texts are in Swedish.

> Link to the gold standards (In Swedish)

> Contact: Arne Jönsson

Cars and Travel

This corpus is a collection of human-computer dialogs in natural language collected by means of Wizard of Oz techniques. They cover a number of different applications for which natural-language interfaces have been assumed to be suitable: information systems, expert systems and booking systems. The main part of the corpus consists of 60 dialogs covering two different domains (second-hand cars and tourist travel), two different systems (information system and booking system) and two different scenarios, one in which the user knew the background system was operated by a human, and one in which the user was told that she was interacting with the background system directly.

> Contact: Arne Jönsson
> Read more..

Linköping Translation Corpus

This corpus consists of five different user manuals for computer systems, the English source texts and their Swedish translations. In addition there are two pieces of fiction and some other texts. The corpus is aligned at sentence level and some 500 sentences from each sub-corpus have been tagged for parts of speech. The corpus has been collected and annotated as part of a project on translation aids.

Parts of the corpus are available for research purposes, while other parts are not available without permission from the copyright holders.

Parts of the corpus are now being integrated in the corpus of the PLUG project, a joint effort with the department of linguistics at Uppsala University and the department of Swedish language at Göteborg University.

> Contact: Magnus Merkel

BirdQuest

Human-Machine dialogues between the BirdQuest dialogue system and human users. Both spoken and written corpora are available.

> Contact: Arne Jönsson

Östgötatrafiken

This is a collection of recordings and transcriptions from the information help-desk of the regional bus traffic company. The corpus is not publically available.

> Contact: Arne Jönsson

Thesis proposals

Are you interested in doing your Bachelor or Master's thesis work at CILTLab?

Cognitive Science Seminars

Language Technology Seminar

  • TBA

Page responsible: Lars Ahrenberg
Last updated: 2013-02-20