From parallel corpora to translation databases

The goal of the project is to develop methods and tools for the extraction of classified translation data from parallel corpora. In this way we will extend the use of parallel corpora beyond standard applications such as bilingual concordancing and generation of lexical data by word alignment programs.

The usefulness of the tools will be demonstrated on a corpus of parallel texts. Specific studies relate to the formal characterization of translations and generation of data for computer-aided translation.


  • Definition of a standard XML DTD for source and target texts and link files.
  • Part-of-speech tagging of the project corpus and dependency parsing using the Functional Dependency Grammar Parsers of Connexor Oy.
  • Development, application nd evaluation of an interactive tool, I*Link, for alignment at the word and phrase level.
  • Development of different sets of alignment guidelines for different applications such as lexicography and machine translation.
  • Improvement of existing automatic tools such as the Frasse tool for finding collocations, and the word alignment system LWA, by enabling use of syntactic and morphological analyses.

Current work

    The final report

Funding agency

The Swedish Research Council 2000-2002.


Lars Ahrenberg
Mikael Andersson
Magnus Merkel

Master's theses related to the project

Maria Holmqvist: Identifying translation shifts using a dependency parser and interactive word alignment.

Michael Petterstedt: Interaktiv lšnkning i bitexter - I*Link. (Interactive alignment of bitexts - I*Link).


