CorporaThis page presents spoken and written corpora collected and maintained by NLPLAB. GES - Gold Standard for English-Swedish Europarl dataAn English-Swedish word alignment reference corpus. Sentences are taken from Europarl v.2. There is a development set with 972 sentences pairs. The alignment contains ordinary word links and null links. The test set is a reference word alignment with 192 sentences. Alignment contains sure, possible and null links. Annotation was done independently by two people and the resulting alignments were combined into one.
LinES - Linköping English-Swedish Parallel TreebankLinES is a parallel treebank built on the basis of LTC, The Linköping Translation Corpus. LinES has been developed as part of the project "Linguistic micro- and macro analysis of a translation corpus", financed by the Swedish Research Council 2004-2005. LinES currently contains 2400 sentence pairs distributed on four different sub-corpora.
Cars and TravelThis corpus is a collection of human-computer dialogs in natural language collected by means of Wizard of Oz techniques. They cover a number of different applications for which natural-language interfaces have been assumed to be suitable: information systems, expert systems and booking systems. The main part of the corpus consists of 60 dialogs covering two different domains (second-hand cars and tourist travel), two different systems (information system and booking system) and two different scenarios, one in which the user knew the background system was operated by a human, and one in which the user was told that she was interacting with the background system directly.
Linköping Translation CorpusThis corpus consists of five different user manuals for computer systems, the English source texts and their Swedish translations. In addition there are two pieces of fiction and some other texts. The corpus is aligned at sentence level and some 500 sentences from each sub-corpus have been tagged for parts of speech. The corpus has been collected and annotated as part of a project on translation aids. Parts of the corpus are available for research purposes, while other parts are not available without permission from the copyright holders. Parts of the corpus are now being integrated in the corpus of the PLUG project, a joint effort with the department of linguistics at Uppsala University and the department of Swedish language at Göteborg University.
BirdQuestHuman-Machine dialogues between the BirdQuest dialogue system and human users. Both spoken and written corpora are available.
ÖstgötatrafikenThis is a collection of recordings and transcriptions from the information help-desk of the regional bus traffic company. The corpus is not publically available.
|
Thesis proposals
Are you interested in doing your Bachelor or Master's thesis work at NLPLAB?
Lab seminars
- August 27, 11.15. Computational linguist Marina Santini gives a seminar on Automatic web genre identification.
Cognitive Science Seminars
Events
Page responsible: Lars Ahrenberg
Last updated: 2012-08-07
