I*Link - Readme

I*Link

Readme

version 0.9

CONTENTS

Background - What is I*Link
Versions - Current and future versions
How it works - Basic ideas
Requirements - Software and hardware
Installation - Instructions for Windows and Unix platforms
Run I*Link - Instructions about starting I*Link when installed
Bundled resources - Description of example resources
Bundled libraries - Information about libraries
References - Some publications related to I*Link.

BACKGROUND

I*Link is a graphical user interface tool for creating and storing associations between segments in a bitext (a source text with a translated target text). I*Link is aimed at word and phrase associations and requires bitexts that are prealigned at the sentence level.

I*Link is developed at the Natural Language Processing Laboratory (NLPLAB), Department of Computer and Information Science at Linköping University, Sweden, with funding from The Swedish Research Council (Vetenskapsrådet) and The Swedish Agency for Innovation Systems (VINNOVA). I*Link is developed as a general tool for creating and classifying word associations in parallel corpora, and should be useful for different kinds of research including translation studies, contrastive linguistics and machine translation.

I*Link was first thought of as an additional module for correcting mistakes in the associations made by an automatic word alignment system (Ahrenberg, Andersson, Merkel, 2000) and adding and/or correcting classifications of them. During the development of this module, however, it turned out that an interactive tool could be quite powerful with the aid of static resources such as base dictionaries and dynamic resources created by the user during sessions and exploited by the system using machine learning techniques (Petterstedt, forthcoming).

VERSIONS

The current version of I*Link is a stand-alone application with no integrated automatic word aligner. It was developed by Lars Ahrenberg, Mikael Andersson, Magnus Merkel and Michael Petterstedt. The current version was implemented by Michael Petterstedt. Future versions that integrate automatic and interactive alignment techniques are planned (Ahrenberg, Merkel and Andersson, 2002).

Important. In version 0.9, only adjacent multi-word units can be selected as tokens. The possibilities of selecting non-adjacent words as belonging to a single phrase/token will be implemented in the next version.

HOW IT WORKS

Work with I*Link is organized in terms of projects. A project is defined by two input files, the source and target files of a bitext, and a set of resource files. The latter may include bilingual lexicons, or files that define corresponding patterns, e.g. for cognates. The different resources are used to propose associations in the bitext that the user can accept or reject. The different resources are ordered by the user for best performance.

Different bilingual lexicons are used to guide I*Link when proposing the best corresponding links to the user. The user then decides which links that are valid by verifying the proposed links and by specifying new links. This information is then stored by I*Link as a resource for generating better links further on using a machine learning approach.

The output from I*Link consists of link files denoting the verified associations within the bitext as well as different kinds of lexicons. The output can be examined by using the simple built-in tools in I*Link or by being exported to external formats and viewers. The latter option, though, is not supported by any tools in I*Link.

REQUIREMENTS

I*Link is implemented in Java and requires that you have installed the Java 2 Platform before running I*Link. Since I*Link uses java you can run I*Link on any platform that has the Java 2 platform installed.

If you don't have the Java 2 platform installed, open your favorite web browser and go to

http://www.javasoft.com

and browse the download pages. At present (2-September-2002) the download page is at

http://java.sun.com/j2se/downloads.html

where you should choose the last stable version of Java 2 Platform.

I*Link has been tested successfully in

  • Microsoft Windows 98 running on a PC Pentium-2 400 MHz 128 MByte RAM ,
  • Microsoft Windows NT4 running on a PC Pentium-2 450 MHz 256 MByte RAM ,
  • Microsoft Windows 2000 running on a PC Pentium-4 1.7 GHz 512 MByte RAM ,
  • Sun Solaris 1.4 running on a Ultra Sparc 10,
  • Red Hat Linux 7.3 running on a PC Pentium-4 1.8 GHz 1.5 GByte RAM ,
  • Mac OS X (10.2) running on a Macintosh PowerBook 800 MHz 512 MByte RAM

The Java 2 Platform version has been either 1.3 or 1.4.

The minimum hardware requirements are estimated to be at least the equivalent to a PC Pentium 300 MHz with 64 MByte of RAM.

You should be able to set up your monitor to show 1280x1024 pixels even if 1024 x 768 pixels will do. At NLPLAB we usually use 1600x1200 to take full advantage of the modules in I*Link.

INSTALLATION

Make sure you have the Java 2 Platform installed on your computer, otherwise see REQUIREMENTS.

Download I*Link from

http://www.ida.liu.se/~nlplab/ILink/setup-ILink.jar

which is a Java installation file bundled as a jar-file. It works as a self extraction file if you have the Java 2 Platform installed on your system.

Windows users

There are two ways to install I*Link on a Windows platform. If you don't have an extraction utility associated with jar-files you can execute the setup-ILink.jar in Windows Explorer by double-clicking on it. Otherwise you open a command prompt window and enter

promt java -jar download-directory\setup-ILink.jar

for example

C:\>java -jar Z:\downloads\setup-ILink.jar

Follow the instructions on the screen to complete the installation.

Solaris and Linux users

Open a shell and enter

prompt java -jar download-directory/setup-ILink.jar

for example

snuffen02 <301> java -jar downloads/setup-ILink.jar

Follow the instructions on the screen to complete the installation.

Run I*Link

When you have installed I*Link, just follow the instructions below to start I*Link. Within I*Link you are able to access more information about how to use I*Link through the help menu.

Windows users

There are two ways of starting I*Link on a Windows platform. If you don't have an extraction utility associated with jar-files you can execute the ILink.jar in Windows Explorer by double-clicking on it. Otherwise you open a command prompt window, go to the installation directory and enter

promt java -jar ILink.jar

for example

C:\Program Files\ILink>java -jar ILink.jar

Solaris and Linux users

Open a shell, go to the installation directory and enter

prompt java -jar ILink.jar

for example

snuffen02 <301> java -jar ILink.jar

BUNDLED RESOURCES

The example resources bundled with I*Link are bitexts and lexicons that are free to use.

Bitext

The provided bitext sample comes from an extract of John Bunyon's novel "Pilgrim's Progress", published in 1678, together with a translation into Swedish. The source text can be found at

http://www.johnbunyan.org/text/bun-pilgrim.txt

The translation into Swedish, "Kristens resa", can be found electronically at Project Runeberg's site for Nordic literature:

http://www.lysator.liu.se/runeberg/kristens/

The translation was made by G.S. Löwenhielm and published in 1903.

The sample bitext consists of 151 sentence-aligned segments which have been analysed with Conexor's FDG Analyzers for English and Swedish, by courtesy of Conexor Oy, Finland. The analysed texts are provided in XML format, consistent with the bundled DTD: LIU-MONO.DTD. More information about Conexor's tools is available at

http://www.conexoroy.com/

Lexicons

The bundled  lexicons can be used when the source language is English and the target language is Swedish. Lexicons are of three different types; Dynamic, Static and Pattern ones. The dynamic lexicons are normally defined by you for a specific project. However, the shipped dynamic lexicons are either default resources which are included when you create a new default project or resources which belong to the example projects provided with I*Link.The static lexicons were constructed by empirical data from previous alignments of bitexts at the department.

The shipped static resources are:

  • core-small.bilex - Main core lexicon
  • pos-def.bilex - Corresponding part-of-speech tags
  • func-def.bilex - Corresponding function tags

The pattern lexicons are:

  • cognates.bilex - Identifies cognates by the Levensthein algorithm
  • pos-equals.bilex - Identifies equal part of speech tags ("cognate test" at poslevel)
  • func-equals.bilex - Identifies equal function tags ("cognate test" at functional level)
  • numbers.bilex - Identifies numbers
  • propnames.bilex - Identifies proper names
  • punctuations.bilex - Identifies periods, commas, parenthesis ...

The first three pattern resources (cognates.bilex, pos-equals.bilex & func-equals.bilex) differs from the others because an internal relation between the source and target pattern is needed. Currently the solution is hardcoded in I*Link but trigged and controlled by the actual resource. Ex (cognates.bilex) :

<COGNATE>#4#4#1

where <COGNATE> triggers the cognate function within I*Link. The first "4" makes I*Link to look for compound cognates with at most four tokens involved. The second "4" tells I*Link to only generate translation candidates that consists of four characters or more. The cognate function uses Levensthein distance on strings to decide what is a cognate or not. In the example the distance threshold is set to "1".

The remaining pattern resources are truly regular expressions (Perl 5.1 notation style) formatted as I*Link bilexical resource files. In contrast to the cognate pattern resources these resources only identify patterns within the source or target. For example, in numbers.bilex

\d+([-., ]\d+)*#\d+([-., ]\d+)*#1#0

the pattern identifies numbers like "1", "10", "10.02", "10,02", "1000 000" in both source and target texts. There is no relation though between the source and the target pattern which means that I*Link will generate a translation candidate between a source text token "1" and a target text token "10.02" if numbers.bilex is used. If you want to make an internal relation within a pattern you are able to use the ordinary back reference (\group-id) in Perl 5.1. For example,

(").*?\1#(").*?\1#1#0

will generate translation candidates like "this is an example of a source text within quotes" related to "this is an example of a target text within quotes" .

Note that dynamic and static lexicons are truly dependent of both used languages in the bitext and the direction of the alignment process and therefore not reusable in other configurations. A pattern lexicons may on the other hand be reused. The number.bilex resource is probably reusable because many languages have the same representation of numbers. However, the cognate.bilex pattern will not be productive if you try to align English texts with Russian texts due to the different representations of the alphabet.

BUNDLED LIBRARIES

I*Link uses several libraries that is both developed at NLPLAB (internal) but also free libraries found on the Internet (external). The internal libraries are shipped as part of I*Link and shares therefore the same license. Each external librariy has its own license.

I*Link use JAXP v1.1 package (The Java API for XML Processing, by Sun Microsystems, Inc., http://java.sun.com/xml/jaxp/) for its capabilities in handling XML-encoded documents through DOM (Document Object Model) and the SAX (Simple Api for XML) parser. The package contains different subpackages named javax.xml, org.w3c.dom, org.apache, org.jdom and org.xml. The software in the packages javax.xml.parsers and javax.xml.transform is covered by the JAXP Reference Implementation License. The software under the package hierarchies beginning with org.w3c.dom is covered by the W3C Software License. All of the remaining software in this distribution is covered by the Apache Software License.

The pattern recognition engine in I*Link is managed by a regular expression package called RegExp v1.1.2 (http://jakarta.apache.org/regexp/). The package is covered by Apache Software Foundation - GNU Public Licence.

Final, the look and feel of I*Link has been improved by the Kunstoff package (http://incors.org/). The package is covered by GNU Lesser General Public Licence (LGPL).

REFERENCES

Lars Ahrenberg, Mikael Andersson & Magnus Merkel (2000) A knowledge-lite approach to word alignment. In J. Véronis (ed.) Parallel Text Processing: Alignment and Use of Parallel Corpora, pp. 97-116. Dordrecht, Kluwer, 2000.

Lars Ahrenberg, Magnus Merkel & Mikael Andersson (2002). A system for incremental and interactive word linking. In Proceedings from The Third International Conference on Language Resources and Evaluation (LREC-2002), Las Palmas, 2002, pp. 485-490.

Michael Petterstedt (forthc.) Interaktiv länkning i bitexter - I*Link. Master's Thesis, Department of Computer and Information Science, Linköping University.


Natural Language Processing Laboratory (NLPLAB)
Department of Computer and Information Science
Linköping University
581 83 Linköping
Sweden

ilink@ida.liu.se
http://www.ida.liu.se/~nlplab/ILink


Sun and Java are trademarks of Sun Microsystems, Inc.

Microsoft Windows 98, Microsoft Windows NT4, Microsoft Windows 2000 are trademarks of Microsoft Corporation.

Red Hat is trademark of Red Hat, Inc.

Max OS X, Macintosh, iBook are trademarks of Apple Computer, Inc.