LinES: Linköping English-Swedish Parallel Treebank
LinES is a parallel treebank built on the basis of LTC, The Linköping Translation Corpus. LinES has been developed as part of the project "Linguistic micro- and macro analysis of a translation corpus", financed by the Swedish Research Council 2004-2005. On these pages you will find information relating to LinES and a trial search interface to it.
NOTE: This page is subject to change. This version was last updated October 5th, 2007.
Table of Contents
- Versions and availability
- Contents and subcorpora
- Structure and formatting
- Work process
- Linguistic annotation
- Searching LinES
We are currently working on LinES version 1.0. We intend for this version to be searchable over the web and to be available to others for research purposes.
The version described here is termed 0.9. This numbering reflects the fact that the annotation is still incomplete and partially inconsistent, while the time remaining to make it complete and less inconsistent is estimated to be a matter of a few months.
We intend to augment LinES with new data and facilities in future projects.
The LinES parallel treebank consists of aligned and annotated sentence pairs that have been collected and analysed for the purpose of studying how common grammatical constructions and function words are translated from English into Swedish. Samples of 500 to 1500 sentence pairs have been taken from various sub-corpora of Linköping Translation Corpus (Merkel, 1999) and other, recently available corpora. The texts have been collected in projects such as PLUG, Corpus-Based Machine Translation (KOMA) and various smaller projects, including Master thesis projects at NLPLab. Other texts are taken from parallel corpora that are available online, such as the Europarl corpus (Koehn, 2005). An overview of the contents of LinES 0.9 is given in Table 1.
|Subsection||Sentence pairs||No of SL words||Source|
|AccessXP||577||10,174||Microsoft Access online Help files|
|Bellow||603||10,307||Saul Bellow: To Jerusalem and Back: a personal account. New York, 1976.|
The selection of sentences from the sources are somewhat arbitrary. It has been assumed that whatever selection is made, it will provide typical examples of the usage of function words and grammatical constructions and their translation. A few sentences from the samples have also been removed. The reason could be that a large part of the sentence consisted of programming code, a quotation from a foreign language, or words of a non-standard dialect.
A subsection of LinES consists of three files: a source file, a target file, and a link file.
Source and target files of LinES are monolingual files that are formatted in XML according to the document type definition liu-mono.dtd. All elements and attributes allowed by liu-mono.dtd are not used, however, primarily for the reason that the files are viewed as sentence collections rather than as texts. The most important features of this used format are the following:
- A monolingual file is structured in terms of segments and words. Segments are demarcated by <s>-tags and words by <w>-tags.
- A segment need not be a sentence in the grammatical sense; it may be smaller, such as a noun phrase, or larger, such as a sequence of two or more sentences. However, every segment of a source file corresponds to a segment in the target file.
- A word normally corresponds to an orthographic word of the source text. However, punctuation markers and clitics are treated as separate words, and a restricted set of unambiguous multiword units are treated as single words.
- Each segment has a unique identifier, its s-id. The first segment of a file is assigned the identifier s1, and the following segments are assigned identifiers s2, s3, and so on. Corresponding source and target segments are assigned identical segment identifiers, though they occur in different files.
- Each word has a unique identifier, its word-id. Word identifiers consist of a string that starts with a w followed by a number. The numbers are consecutive in the file, so that the first word of the first segment has word identifier w1 and the following words have identifiers w2, w3, and so on.
- Words carry a number of attributes for annotation. The most important of these attributes are listed in Table 2.
|relpos||The position of the word||1, 2, 3, ...|
|pos||Part of speech||A, N, V, ...|
|func||Grammatical function||attr, det, subj, ...|
|fa||Head word||A position, or 0|
The link file is also an XML file formatted according to the document type definition liu-align.dtd.
Extracts from the monolinguals file are shown here.
When the samples originate in earlier projects, they have also been formatted and parsed, although not checked for parsing accuracy. Then, a link file has been in existence, but the alignment criteria have been different from the ones used in LinES, so re-alignment has been necessary.
For parsing, the Machinese Syntax parsers for English and Swedish from Connexor Oy, are used. In addition, a post-processor adds morpho-syntactic information that is not provided by these parsers to the Swedish analyses. Parsing output has been converted to linguistic annotation in the style of liu-mono.dtd.
Source and target files have been re-aligned using the interactive word alignment system I*Link (Merkel et al., 2003) as the primary tool. The automatic word aligner I*Trix has been used to speed up the process, but all I*Trix output has been manually inspected and corrected with the aid of I*Link.
In the process of word alignment segment pairs have sometimes been discarded for reasons of being mis-aligned, mis-translated or judged irrelevant, e.g., when the segment consists of a SQL Query or an utterance spoken in a dialect. After deletions of this kind, segments and words have been renumbered so that the identifier is kept relative to the contents of the LinES file, not to the original file or the entire source document.
In the alignment process, and, in particular, after completion of alignment, the files have been searched for errors and omissions in linguistic annotation. Errors have been corrected and incomplete analyses been completed. Also, annotation has been normalised to agree with the standards set for LinES (see below).
The linguistic annotation is primarily carried by the attributes base, pos, msd, and func.A detailed overview of the linguistic annotation can be found in the LinES 1.0 Annotation: Format, contents and guidelines.
The base form of a word is identified with one of its actually occurring forms. In the case of verbs, the base form coincide with the infinitive; in the case of nouns, it coincides with the singular, nominative form, and in the case of Swedish adjectives, it coincides with the positive, utral, singular form. For English adjectives the positive form is used.
The Connexor parsers attempt to identify the border between the different parts making up a Swedish compound. For instance, the base form of the word filformatet (Eng. file format) is given as fil#format. These borders are removed in the LinES annotation.
Multi-word units are represented with their parts being joined by a hyphen (-). This is applied also in their base form. If any part of a multi-word unit is inflected, its base form uses the standard base form of that part. This approach sometimes leads to awkward base forms but these forms nevertheless serve their purpose as unique identifiers for a set of inflectional forms.
Note that the base form generally is not a proper lemma, as words of different parts of speech, and words of the same parts of speech with different inflections, may have the same base form.
The parts of speech used in LinES includes the common lexical categories noun, verb, adjective, and so on with some extra categories for special text tokens.
The parts of speech closely conforms to those used by the Connexor parsers. However, in contrast to the parsers, LinES uses a common set of parts of speech for both languages, and also makes some distinctions not made by the parsers, e.g., the distinction between proper and common nouns. Moreover, some individual words have been given different parts-of-speech in LinES. In particular, several words such as English 'when', Swedish 'när', 'som' are classified as subjunctions rather than adverbs or pronouns when they introduce subordinate clauses.
For more details on the parts-of-speech we refer the reader to the Guidelines.
While there is a single attribute for morpho-syntactic subcategorisation, the value may denote a complex of properties. For instance, the noun news is annotated as follows:
<w ... base="news" pos="N" msd="NOM-PL" ... >news</w>
The value of the msd-attribute is a concatenation of the two values NUM (for nominative case), and PL (for plural number), with a hyphen (-) as the concatenation marker. Note that the property dimensions are left implicit in the annotation. Note also that no value is required on the msd-attribute, even if the part of speech is one that is inflected or sub-categorized in other ways.
Note that some values may be relevant only for one of the languages. The same holds for the categories. For instance, the category Definiteness is used only for Swedish nouns.
In agreement with the assumptions of dependency grammar we assume that the hierarchical structure of a syntactic complex is grounded in assymetric dependency relations between words. For each segment one token is identified as the topmost governor and identified by the attribute main. The annotation distinguishes general functions such as main and cc for coordination, clause level functions for subjects, objects, adverbials and verb chain items, phrase level functions such as determiners, attributes and modifiers, and, a small set of functions that are used to represent non-projective dependencies in a projective fashion.
For more details on the use of dependency analysis, see the Guidelines.
The basic rule for alignment in LinES is the same as the one used in the ARCADE project, namely:
Align as short segments as possible, and as long segments as necessary.
This guideline means that if we cannot find a good link for a word that we are looking at, we try to find a segment that includes that word that has a better correspondent syntactically and semantically. However, we make that segment as small as possible, i.e., bigrams are generally preferred to trigrams, and so on.
Another general rule is to treat level shifts as part of divergent or convergent alignments. Thus, English the map is aligned as a whole to Swedish kartan, and of the map to kartans. As far as possible we restrict the use of null alignments to voluntary deletions and additions. A deletion occurs, then, when the translator could have translated the source word in question but chose not to, and an addition, when the translator have used a word or phrase that she could have done without, and the translation would still be regarded as adequate. However, there are some deliberate exceptions to this rule. Thus, auxiliary forms of do are treated as deleted when corresponding interrogatives and negations in Swedish.
There is, though, an important exception to these two rules; this is that
a segment of an alignment must be connected. This exception is really
not desirable, but we have found it necessary, since this restriction
applies to the tools that we use, and we so far have had no tools
available to split such alignments readily. An effect of this
exception that we cannot align 'is ... smiling' with 'ler' in the
following pair, although this would be preferrable.
Is he smiling ?
Ler han ?
Instead, the auxiliary is treated as deleted, and aligned with nothing. The correspondence of the English complex verb chain with the Swedish single verb can be retrieved at the phrasal level, however.
A search interface is under construction. A trial version can be found by following this link.
NLPLab / Human-Centered Systems
Department of Computer and Information Science
Ahrenberg, Lars (2007). LinES 1.0 Annotation: Format, contents and guidelines. Manuscript, in PDF
Ahrenberg, Lars (2007). Lars Ahrenberg. LinES: An English-Swedish Parallel Treebank. Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA, 2007).
Koehn, Philipp (2005). Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005.
Merkel, Magnus (1999). Understanding and enhancing translation by parallel text processing. Linköping Studies in Science and Technology, Dissertation No. 607, Department of Computer and Information Science, Linköpings universitet.
Magnus Merkel, Michael Petterstedt, Lars Ahrenberg: Interactive Word Alignment for Corpus Linguistics. Proceedings of Corpus Linguistics 2003. UCREL Technical Paper No 16.
Page responsible: Lars Ahrenberg
Last updated: 2007-10-05