I*LinkReadmeversion 0.9CONTENTS
BACKGROUNDI*Link is a graphical user interface tool for creating and storing associations between segments in a bitext (a source text with a translated target text). I*Link is aimed at word and phrase associations and requires bitexts that are prealigned at the sentence level. I*Link is developed at the Natural Language Processing Laboratory (NLPLAB), Department of Computer and Information Science at Linköping University, Sweden, with funding from The Swedish Research Council (Vetenskapsrådet) and The Swedish Agency for Innovation Systems (VINNOVA). I*Link is developed as a general tool for creating and classifying word associations in parallel corpora, and should be useful for different kinds of research including translation studies, contrastive linguistics and machine translation. I*Link was first thought of as an additional module for correcting mistakes in the associations made by an automatic word alignment system (Ahrenberg, Andersson, Merkel, 2000) and adding and/or correcting classifications of them. During the development of this module, however, it turned out that an interactive tool could be quite powerful with the aid of static resources such as base dictionaries and dynamic resources created by the user during sessions and exploited by the system using machine learning techniques (Petterstedt, forthcoming). VERSIONSThe current version of I*Link is a stand-alone application with no integrated automatic word aligner. It was developed by Lars Ahrenberg, Mikael Andersson, Magnus Merkel and Michael Petterstedt. The current version was implemented by Michael Petterstedt. Future versions that integrate automatic and interactive alignment techniques are planned (Ahrenberg, Merkel and Andersson, 2002). Important. In version 0.9, only adjacent multi-word units can be selected as tokens. The possibilities of selecting non-adjacent words as belonging to a single phrase/token will be implemented in the next version. HOW IT WORKSWork with I*Link is organized in terms of projects. A project is defined by two input files, the source and target files of a bitext, and a set of resource files. The latter may include bilingual lexicons, or files that define corresponding patterns, e.g. for cognates. The different resources are used to propose associations in the bitext that the user can accept or reject. The different resources are ordered by the user for best performance. Different bilingual lexicons are used to guide I*Link when proposing the best corresponding links to the user. The user then decides which links that are valid by verifying the proposed links and by specifying new links. This information is then stored by I*Link as a resource for generating better links further on using a machine learning approach. The output from I*Link consists of link files denoting the verified associations within the bitext as well as different kinds of lexicons. The output can be examined by using the simple built-in tools in I*Link or by being exported to external formats and viewers. The latter option, though, is not supported by any tools in I*Link. REQUIREMENTSI*Link is implemented in Java and requires that you have installed the Java 2 Platform before running I*Link. Since I*Link uses java you can run I*Link on any platform that has the Java 2 platform installed. If you don't have the Java 2 platform installed, open your favorite web browser and go to and browse the download pages. At present (2-September-2002) the download page is at where you should choose the last stable version of Java 2 Platform. I*Link has been tested successfully in
The Java 2 Platform version has been either 1.3 or 1.4. The minimum hardware requirements are estimated to be at least the equivalent to a PC Pentium 300 MHz with 64 MByte of RAM. You should be able to set up your monitor to show 1280x1024 pixels even if 1024 x 768 pixels will do. At NLPLAB we usually use 1600x1200 to take full advantage of the modules in I*Link. INSTALLATIONMake sure you have the Java 2 Platform installed on your computer, otherwise see REQUIREMENTS. Download I*Link from which is a Java installation file bundled as a jar-file. It works as a self extraction file if you have the Java 2 Platform installed on your system. Windows usersThere are two ways to install I*Link on a Windows platform. If you don't have an extraction utility associated with jar-files you can execute the setup-ILink.jar in Windows Explorer by double-clicking on it. Otherwise you open a command prompt window and enter
for example
Follow the instructions on the screen to complete the installation. Solaris and Linux usersOpen a shell and enter
for example
Follow the instructions on the screen to complete the installation. Run I*LinkWhen you have installed I*Link, just follow the instructions below to start I*Link. Within I*Link you are able to access more information about how to use I*Link through the help menu. Windows usersThere are two ways of starting I*Link on a Windows platform. If you don't have an extraction utility associated with jar-files you can execute the ILink.jar in Windows Explorer by double-clicking on it. Otherwise you open a command prompt window, go to the installation directory and enter
for example
Solaris and Linux usersOpen a shell, go to the installation directory and enter
for example
BUNDLED RESOURCESThe example resources bundled with I*Link are bitexts and lexicons that are free to use. BitextThe provided bitext sample comes from an extract of John Bunyon's novel "Pilgrim's Progress", published in 1678, together with a translation into Swedish. The source text can be found at The translation into Swedish, "Kristens resa", can be found electronically at Project Runeberg's site for Nordic literature: The translation was made by G.S. Löwenhielm and published in 1903. The sample bitext consists of 151 sentence-aligned segments which have been analysed with Conexor's FDG Analyzers for English and Swedish, by courtesy of Conexor Oy, Finland. The analysed texts are provided in XML format, consistent with the bundled DTD: LIU-MONO.DTD. More information about Conexor's tools is available at LexiconsThe bundled lexicons can be used when the source language is English and the target language is Swedish. Lexicons are of three different types; Dynamic, Static and Pattern ones. The dynamic lexicons are normally defined by you for a specific project. However, the shipped dynamic lexicons are either default resources which are included when you create a new default project or resources which belong to the example projects provided with I*Link.The static lexicons were constructed by empirical data from previous alignments of bitexts at the department. The shipped static resources are:
The pattern lexicons are:
The first three pattern resources (cognates.bilex, pos-equals.bilex & func-equals.bilex) differs from the others because an internal relation between the source and target pattern is needed. Currently the solution is hardcoded in I*Link but trigged and controlled by the actual resource. Ex (cognates.bilex) :
where <COGNATE> triggers the cognate function within I*Link. The first "4" makes I*Link to look for compound cognates with at most four tokens involved. The second "4" tells I*Link to only generate translation candidates that consists of four characters or more. The cognate function uses Levensthein distance on strings to decide what is a cognate or not. In the example the distance threshold is set to "1". The remaining pattern resources are truly regular expressions (Perl 5.1 notation style) formatted as I*Link bilexical resource files. In contrast to the cognate pattern resources these resources only identify patterns within the source or target. For example, in numbers.bilex
the pattern identifies numbers like "1", "10", "10.02", "10,02", "1000 000" in both source and target texts. There is no relation though between the source and the target pattern which means that I*Link will generate a translation candidate between a source text token "1" and a target text token "10.02" if numbers.bilex is used. If you want to make an internal relation within a pattern you are able to use the ordinary back reference (\group-id) in Perl 5.1. For example,
will generate translation candidates like "this is an example of a source text within quotes" related to "this is an example of a target text within quotes" . Note that dynamic and static lexicons are truly dependent of both used languages in the bitext and the direction of the alignment process and therefore not reusable in other configurations. A pattern lexicons may on the other hand be reused. The number.bilex resource is probably reusable because many languages have the same representation of numbers. However, the cognate.bilex pattern will not be productive if you try to align English texts with Russian texts due to the different representations of the alphabet. BUNDLED LIBRARIESI*Link uses several libraries that is both developed at NLPLAB (internal) but also free libraries found on the Internet (external). The internal libraries are shipped as part of I*Link and shares therefore the same license. Each external librariy has its own license. I*Link use JAXP v1.1 package (The Java API for XML Processing, by Sun Microsystems, Inc., http://java.sun.com/xml/jaxp/) for its capabilities in handling XML-encoded documents through DOM (Document Object Model) and the SAX (Simple Api for XML) parser. The package contains different subpackages named javax.xml, org.w3c.dom, org.apache, org.jdom and org.xml. The software in the packages javax.xml.parsers and javax.xml.transform is covered by the JAXP Reference Implementation License. The software under the package hierarchies beginning with org.w3c.dom is covered by the W3C Software License. All of the remaining software in this distribution is covered by the Apache Software License. The pattern recognition engine in I*Link is managed by a regular expression package called RegExp v1.1.2 (http://jakarta.apache.org/regexp/). The package is covered by Apache Software Foundation - GNU Public Licence. Final, the look and feel of I*Link has been improved by the Kunstoff package (http://incors.org/). The package is covered by GNU Lesser General Public Licence (LGPL). REFERENCESLars Ahrenberg, Mikael Andersson & Magnus Merkel (2000) A knowledge-lite approach to word alignment. In J. Véronis (ed.) Parallel Text Processing: Alignment and Use of Parallel Corpora, pp. 97-116. Dordrecht, Kluwer, 2000. Lars Ahrenberg, Magnus Merkel & Mikael Andersson (2002). A system for incremental and interactive word linking. In Proceedings from The Third International Conference on Language Resources and Evaluation (LREC-2002), Las Palmas, 2002, pp. 485-490. Michael Petterstedt (forthc.) Interaktiv länkning i bitexter - I*Link. Master's Thesis, Department of Computer and Information Science, Linköping University. Natural Language Processing Laboratory (NLPLAB) ilink@ida.liu.se Sun and Java are trademarks of Sun Microsystems, Inc. Microsoft Windows 98, Microsoft Windows NT4, Microsoft Windows 2000 are trademarks of Microsoft Corporation. Red Hat is trademark of Red Hat, Inc. Max OS X, Macintosh, iBook are trademarks of Apple Computer, Inc. |