Translation of documentation

Background

A characteristic feature of many types of manuals, in particular computer manuals, is the high frequency of recurrent (or repeated) translation units. This is a fact that can be exploited in translation with quite simple tools, e.g. a computerised phrase-book where recurrent units are stored with their translations. In a pre-study (Merkel 1992) we found cases where up to 43 per cent of the total text in a handbook was made up of recurring identical sentences and (translatable) phrases. The same study showed that 20 per cent of the text in another handbook was made up of sentences that were already translated in the previous version. Recurrent phrases found in both documents comprised 15 per cent of the text. Taken together this meant that 31 per cent of the handbook could have been automatically translated with the aid of a translation memory of sentences and phrases of the previous version of that same handbook. When the recurrent phrases and sentences within the handbook were also taken into consideration, the analysis yielded that 52 per cent of the text was repetitious, either internally or externally.

Aims and tasks

One basic hypothesis of the project was that an interactive memory-based translation system, i.e. a system providing a terminology database, a database of previously translated units (sentences, phrases and perhaps paragraphs) coupled with a set of tools that derive internal and external recurrence profiles for a given text material would give substantial improvements to the translation process, particularly in speed and certain aspects of quality, such as terminological and stylistic consistency. An important aim of the project was to determine the advantages and drawbacks of memory-based systems and to suggest good designs for them. Another aim was to design and develop other translation tools that fit into a translator's or editor's workbench.

The translation support tools we investigated were of four different categories: 1. Diagnostic tools that characterize texts and text-types in terms of parameters that have a direct bearing on the performance and usability of various computer-supported methods of translation; by applying such tools to a representative sample of texts for a given text-type, a set of text profiles is obtained that reveal characteristics of the text type and that can support decisions as to what kind of computer support should be used in the translation process;

2. Alignment tools that establish correspondences between source and target texts, on various levels such as chapters, divisions, headings, paragraphs, sentences, phrases and words;

3. Data acquisition tools that retrieve data from bilingual corpora, which can be exploited in the actual translation process; and

4. Evaluation tools that are used in the evaluation of translations and checking of properties such as consistency of terminology, variation in phraseology and conformity with a given style-guide.

Another task of the project was to study the effects on the target text of the translation method used and compare translations made by means of memory-based systems with manual translations. This required the development of methods and tools for evaluation of translation. To test design alternatives we need to consider the format and content of the databases as well as the matching algorithms used in database search. In this connection we investigate different methods and tools for data acquisition and the possibility of making database search sensitive to language-independent information encoded in the source document, e.g. semantic or functional properties of a paragraph encoded descriptors in SGML (Standard Generalized Mark-up Language). The analysis tools primarily support the identification of translation units of a text body, where strings as well as linguistically more interesting units such as lemmas, terms, phrases, patterns and constructions are considered. The analysis tools can be used diagnostically in several ways. For example, a text profile can be generated showing what parts of it are covered to what extent by recurrent items at various levels of abstraction, and the recurrent items can be checked for counterparts in an existing translation memory. Both kinds of information are relevant for deciding what efforts and resources are needed for the translation of the given text body.

Results

An English­Swedish Translation Corpus

A prerequisite for the exploration and use of bi-texts is that you have one at your disposal. In a Swedish context, English and Swedish are by far the two most common source and target languages to consider. As there were no such English-Swedish translation corpus available at the start of the project, an important part of it was to create one and align the texts at least at the paragraph level.

The text corpus consists of seven different English-Swedish bitexts, all of which are aligned at the sentence level. About 500 sentences from six of the bitexts have been marked up with parts-of-speech information. Four of the texts are computer program manuals from two companies (1 and 2) where the major difference lies in the method of translation. Two manuals have been translated manually and another two translated with the aid of a memory-based translation tool. In addition, there are two novels and a set of sentences from the ATIS domain that have been translated automatically.

The DAVE workbench

The DAVE workbench is a PC-based system consisting of modules for bi-text alignment, phrase retrieval, recurrency analysis, discrepancy analysis and a bilingual concordance generator.

Effects of translation memories

(to be completed)

List of publications


Back to Current Research

Updated 980417