Multi-lingual term extraction and term structuring
Department of Computer and Information Science
Linköpings universitet, S-581 83 Linköping, SWEDEN
Tel. +46 13 28 19 64. Fax: +46 13 28 44 99
1. Specific goals
The goal of this project is to develop new methods and systems for term extraction and term structuring from multi-lingual document collections. The work of such a system can be divided into the following four major tasks: (1) Term recognition, (2) Term alignment, (3) Generation of concepts, i.e., sets of synonymous terms, (4) Recognition of semantic relations between concepts, in particular hyponymy and co-hyponymy.
Current methods, including our own, usually combine linguistic and statistical data in some way. In recent years matrix/vector-space methods have been extensively used in several document retrieval applications, and it has been demonstrated that they can improve the performance of such systems considerably. Also graph-based methods have been successfully used in language processing. We believe that the use of such numerical methods together with refined linguistic techniques may lead to similar improvements in other areas of computational linguistics. One of the aims of this proposal is to explore the potential benefits of this combination.
While the creation of multi-lingual data is important in itself, the project will test the hypothesis that multi-lingual parallel data actually offer an advantage over mono-lingual term extraction and structuring. Several works in related areas such as word alignment and word sense disambiguation give support for this hypothesis. Thus, we will also evaluate the multi-lingual methods developed in the project on mono-lingual text collections and compare their performance with state-of-the-art methods for mono-lingual data.
2. The area
Electronic texts are multiplying, on the internet, and also within companies and organizations, where they constitute a potential asset. These text masses contain not only factual information, i.e. specific pieces of information expressed in language in the texts, but they also convey knowledge about the domains that is not expressed explicitly, namely information on domain-related terminology and concept structure.
Harvesting this knowledge is a basic goal of terminology engineering, and related areas such as ontology engineering, word sense disambiguation, information extraction, machine translation, word alignment, and semantic classification.
One strand in terminology engineering involves exploiting external semantic resources such as domain dictionaries, taxonomies or ontologies as reference data and use these resources to determine whether terms extracted share semantic relations. For instance, one could measure the similarity of two words based on words shared in dictionary definitions. Alternatively, as in this project, the methods are primarily data-driven, bottom-up, based on the empirical data contained in real-world texts. This more bottom-up approach can be divided into two main approaches:
1. Statistical approaches. A corpus-based approach based on statistical evidence where the problem of termhood or semantic relations is seen as a distributional problem. It is assumed that the more similar distributions of the units, the more likely it is that the units share some common features, such as being synonyms or belonging to the same semantic field.
2. Linguistic approaches. Evidence for semantic relationships is found in the morphological structures of words or word groups, or through satisfaction of lexico-syntactic patterns/relational markers (e.g. Hearst 1992, Malaisé et al. 2007).
Within terminology engineering and empirical knowledge representation the focus is shifting from pure term extraction to term structuring. The efforts need to be “directed towards automatic detection of semantically-related terms” from texts (Cabré Castellví et al. 2007, p. 3). There is an obvious need for language resources that contain semantics and structure, for tasks such as translation, information retrieval, question-answering and text mining. There are according to Cabré Castellví several issues that need to be resolved in order to arrive at empirically based semantic resources, among them:
1. Terms need to be identified in the texts, and bad term candidates must be filtered away.
2. Clustering terms into synonym sets (synsets) that form concepts.
3. Identify semantic relations between concepts.
Term extraction is mainly performed on monolingual texts, but the challenge still lies in separating standard language from domain-specific terminology. There are both statistical and linguistic approaches to term candidate identification. In an evaluation of terminology extractors done in 2001, the conclusion was that more effort has to be put into actively combining statistical and linguistic methods (Cabré Castellví et al. 2001).
Furthermore, clustering terms into synonym sets that form concepts is an essential step and has been pursued for example by analyzing dictionaries and their definitions (Blondel & Snellart 2002) and from monolingual texts (Greffenstette 1994, Hamon & Nazarenko 2001).
Semantic relations have been detected in ontology-building endeavours using linguistic patterns (e.g. Malaisé et al. 2007). Gillam et al. (2007) use statistical methods to form an initial conceptual hierarchy, which is then subsequently populated through linguistic analysis. Grabar & Zweigenbaum (2002) utilize lexical induction to relate terms hierarchically by performing inclusion tests with morphological normalizations. For example, the term “tumour” can be seen as a superordinate term to “benign tumour”.
While most work is being done on monolingual corpora, there are interesting related work done on multi-lingual parallel corpora, for example within the field of word sense disambiguation,. Word sense disambiguation using parallel corpora has been proposed and evaluated by e.g. Resnik & Yarowsky (1997) and Ide et al. (2002).
There are more solutions to semantic problems in multilingual parallel corpora than in monolingual texts due to the fact that semantic problems have been resolved during translation. Translators have been faced with synonymy problems, with lexical and structural ambiguity, vagueness, terminological consistency, etc. and they had to resolve the challenges and the solutions are present in the relationship between the source and target texts.
Parallel corpora and word alignment have provided another method to derive semantic resources. Helge Dyvik (2002) has suggested that it is possible to use semantic mirroring and parallel corpora in order to derive semantic relations similar to the ones that is provided in WordNet, e.g., synonymy, hyponymy and partitions of semantic fields. Dyvik’s hypothesis is that semantically related words should exhibit significantly overlapping translations, and that words that are more general (”semantically wide”) should have a larger number of translations than words with very specific meanings. In Dyvik’s work focus has been on using sentence-aligned parallel corpora and manual identification of word correspondences for the words under scrutiny.
3. Project description
The project will compare and integrate computational methods for the following tasks: (1) mono-lingual term recognition and extraction, (2) term alignment on parallel multi-lingual documents, (3) generation of sets of synonymous terms (‘synsets’) and concepts based on term identification and term alignment, (4) generation of semantic relations such as hyponomy (is-a) and co-hyponomy.
By combining methods for these primitive tasks different systems can be created that, given a multi-lingual document collection, will generate sets of concepts that can be organized hierarchically and provide a partial ontology for the domain of the document collection. Since the concepts will have pointers to words the result can also be organized as a bilingual dictionary, or thesaurus.
There are many different ways of building a combined system, however, e.g., as regards the order and integration of mono-lingual and multi-lingual processing, and as regards the use and integration of different approaches. The project will explore a large number of these possibilities. In the following we first describe our current system and then present the enhancements and alternatives to be investigated in the project.
3.1 Current process and tools
A given bilingual parallel corpus is first sentence-aligned using the GMA system (Melamed, 2001) and both halves of the corpus are parsed with the Machinese Syntax dependency parsers (Tapanainen & Järvinen, 1997). After post-processing and re-formatting, a clue-based word alignment system is applied (Tiedemann, 2003; Merkel et al., 2003; Foo & Merkel, 2007), where information from various sources such as general dictionaries, validated alignments, statistical measures of co-occurrence and word translation probabilities computed with Giza++ (Och & Ney 2005) are combined to generate candidate translation pairs.
Single word alignments are extended to alignment of multi-word units at the token level via phrase formation and linguistic filters. The filters are absolute, i.e., a unit that does not meet the specified constraints is not considered. Pairs of term candidates are generated based on linguistic filters that model “termhood” (cf. Frantzi et al., 2000).
The term candidates are then ranked on the basis of a simple measure, which we call Q-value and which is defined as follows:
Here, TPF is the type pair frequency, TpS is the number of distinct target types per source type, and SpT the number of distinct source types per target type (Merkel & Foo, 2007). The idea is that high frequency pairs where the source and target terms are rarely used in other term pairs is an indicator of high quality. The performance of the Q-value has been compared with the Dice coefficient and type frequency for the task of selecting good candidates and we have shown that Q-value outperforms both alternatives (Merkel & Foo 2007).
The term candidates can then be grouped, e.g. on the basis of source language lemmas. When there are different translations for a given lemma, the group can be extended with all candidate pairs that share a translation with the given lemma. These groups constitute the basis for the formation of synonym sets.
The first extension concerns more refined methods for the generation of synonym sets, or, term clustering. There are several alternatives that will be tested.
Semantic Mirroring (Dyvik, 2002) has been applied both to the tasks of finding multi-lingual synonyms from parallel corpora, and finding synonyms from dictionaries (Andersson 2004). We will investigate variations of the method and apply it to the list of candidate pairs generated from alignment.
Another alternative is given by the algebraic/graph-based methods (Blondel et al, 2004). Here synonyms are extracted by computing the similarity between one graph representing a partial dictionary and another graph representing the "synonym property" of the words. For large graphs (dictionaries) it is computationally demanding to compute the similarity measure (a large eigenvalue problem must be solved). We will develop efficient algorithms for this problem, based on the vast body of knowledge in numerical linear algebra for handling and extracting information from large sparse matrices/graphs and the preliminary work reported below.
Vector space methods can also be used to generate candidate pairs directly. When a candidate pair is considered for inclusion in a synonym set, Q-values and vector distances may then be combined in making decisions.
In the vector-space model, the term-document matrix can be considered as a representation of a bi-partite graph, where the two sets of nodes are terms and sentences, respectively. Thus two parallel texts, each represented by its graph, can be compared using the measure of graph similarity developed by Blondel et al, 2004, and which is computed by solving a certain large eigenvalue problem. This approach was also studied in (Törnfeldt, 2008), and we will develop this further.
Mostly when algebraic/graph-based methods have been applied to problems in information retrieval and text mining, the preprocessing has been rather standard (stemming, stop word removal). We will investigate how a combination of linguistic techniques and algebraic methods can be used to further improve the performance of the algebraic/graph-based methods.
The final extension concerns the recognition of semantic relations. This stage might be integrated with the previous one as far as processing goes, but is notionally distinct. The candidate synonym sets may often include terms that contract a hyponomy relation rather than being synonymous. We will explore different methods to find these relations:
Lexico-syntactic patterns (Hearst 1992) are primarily focused on monolingual data, but could be extended to operate on multilingual texts as well. Semantic mirroring techniques involve set theory and operations on sets of synonym clusters to create hierarchical representations. Lexically induced methods, such as Grabar & Zweigenbaum (2002) will also be applied both mono-lingually and on parallel data. Dyvik’s (2002) hypothesis for determining semantic relations is based primarily on that more general terms/words (”semantically wide”) should have a larger number of translations than words with very specific meanings. This hypothesis will be tested and alternative algebraic structuring techniques than Dyvik’s algorithm will be implemented and evaluated.
There are two types of comparisons that will be performed in the project. On the one hand we investigate different methods for the same subtasks, such as our linguistic-statistic model for word alignment with algebraic methods, where these two methods can also be combined within the system. On the other hand, we will compare our type of system that exploits multi-lingual data for the task of recognising terms and term structures for one language, with systems that primarily work on mono-lingual data (Grefenstette, 1994; Frantzi & Ananiadou, 2000), thus testing the hypothesis that “two languages are better than one” even for a mono-lingual task.
Monolingual termhood recognition is based on co-occurrence statistics, entropies and linguistic filtering using part-of-speech tagging, recognition of common part-of-speech patterns and word-based filtering. This may be extended and combined with frequency comparisons of frequencies within the collection and frequencies in other collections, including a balanced corpus for the language such as the British National Corpus (BNC). Once the term candidates have been derived, they can be clustered into sets based on distributional and linguistic criteria as described in section 2.
Also, we will investigate whether filtering-before-alignment will show the same performance as alignment-before-filtering. This is a kind of system often referred to as translation spotting, where terms are first proposed on mono-lingual evidence only, and then aligned based on co-occurrence and external resources such as bilingual dictionaries.
There are two kinds of evaluation that will be used in the project. Some of our subtasks such as term recognition and word alignment can be evaluated on gold standards, i.e., lists of correct terms (or alignments) and performance can be measured by precision and recall. Our group has previously created gold standard resources and have several available for testing. Others are available from alignment campaigns (e.g. Mihalcea and Pedersen, 2003).
The second kind of evaluation is a-posteriori evaluation of output from the systems by human experts. This is the way we will evaluate synonym sets and semantic relations, as creating gold standards for these purposes from a large document collection will simply be too costly for the project. Thus, external human evaluators with appropriate domain knowledge will be asked to judge the accuracy of synsets and semantic relations proposed by the systems. This means that evaluation will focus on precision. As the systems will assign a measure of certitude to each candidate, precision can be measured at different levels of certitude.
Resources such as WordNet have been used for evaluation of similar tasks. However, WordNet does not contain the majority of terms that can be found in our data and is thus not suitable for the task.
For development and evaluation we will use a Microsoft online Help corpus available for research at our department, patent descriptions from The Swedish Patent Office (PRV) and the English and Swedish sections of the freely available JRC-Acquis multilingual parallel corpus (Steinberger et al., 2006). If available, we will also use data from international term extraction campaigns.
3.5 Year plan
· Building the infrastructure, tools and data, for sub-task evaluation.
· Improvement and evaluation of term filtering techniques.
· Improvement and evaluation of term clustering methods (Multilingual)
· Development and adaption of software for linear algebra algorithms applied to problems in linguistics, based on state-of-the-art numerical program libraries.
· Building the system for mono-lingual term extraction and clustering.
· Further improvement and variation of multi-lingual term extraction.
· Evaluation of monolingual vs. multilingual term extraction.
· Integration of linguistic and linear algebra software.
· Further integration and evaluation of term structuring software
· Evaluation of extracted semantic relations.
· Final evaluation of monolingual vs. multilingual approaches to term extraction, term clustering and semantic relations
· Results on the implications of combining linguistic and algebraic methods.
4. Preliminary results
Sofar, the project group has produced a tools suite where multilingual text collections can be taken as input and then aligned on the word and phrase level (Ahrenberg et al. 2003). Also, preliminary results on filtering out non-terms have been made as well as using a simple variant of semantic mirroring (Foo & Merkel 2008). In preliminary experiments, where semantic mirroring is used on aligned term candidates using Q-value as a filter, we have found that sets of terms belonging to the same concept can be generated with promising results. The Q-value has been tested and shown to outperform the Dice coefficient and pure frequency as measures to rank term candidate pairs by quality standards (Merkel & Foo 2007). Currently, we are testing the Q-value against other means of ranking data sets such as log-likelihood, mutual information and t-score.
Vector space methods for term alignment in parallel texts are based on the creation of a common approximate basis (in a linear algebra sense) for the two texts (Sahlgren and Karlgren, 2005). Their method is based on a particular probabilistic method for computing the basis. Recently vector space methods for term alignment have been studied in a masters thesis (Törnfeldt, 2008), where a few different data compression approaches for "noise reduction" are used, based on singular value decomposition of the term-document matrix. Very promising results are reported, with 80% precision as compared with the 60% of (Sahlgren and Karlgren, 2005).
Term extraction, mono-lingual and multi-lingual synonym generation and concept formation are important subtasks of the general problem of learning ontologies from text. While there exist several methods and systems today that produce useful results, there is still need for improvement, in particular as regards filtering out bad term candidates, and producing accurate semantic relations. In this project we will tackle these problems from a multi-lingual perspective, combining and integrating state-of-the-art techniques in new ways, and test methods such as semantic mirroring that have not been applied to terminology engineering before.
The results of the project are also of great practical utility. Companies and organizations with international clients and customers are well aware of the importance of accurate terminologies and term use for efficient communication internally and externally. However, often only a fraction of their terminologies is coded and stored centrally, while the rest is more or less implicit in their various documents. By harvesting the terms and their relations from these document collections, the terminologies can be brought into better order. This of course requires human reviewing of candidate terms and concepts which again underlines the need for high precision in generated output.
Ahrenberg, L., Merkel, M., & Petterstedt, M. (2003). Interactive word alignment for language engineering. Proceedings of EACL-2003. Budapest.
Andersson, S. (2004) Semantisk spegling. En implementation för att synliggöra semantiska relationer i tvåspråkiga data. Master’s Thesis, Linköpings universitet, LIU-KOGVET-D--04/01--SE.
Berry, M. & Browne, M. (2005). Email Surveillance Using Non-negative Matrix Factorization Computational Mathematical Organization Theory, 2005, 11, 249-264
Blondel, V. D.; Gajardo, A.; Heymans, M.; Senellart, P. & Dooren, P. V. A
Measure of Similarity between Graph Vertices: Applications to Synonym Extraction
and Web Searching
M.T., Bagot, R.E., Palastresi, J.V. (2001).
Automatic Term Detection. Review of Current Systems. In
Recent Advances in Computational
Terminology (eds. Bourigault, D.
Jacquemin, C., L'Homme, M.-C. John Bnejamins Publishing Company,
M.T., Condamines, A., Ibekwe-SanJuan, F. 2007. Introduction: Application-driven
terminology engineering. In
Application-Driven Terminology Engineering (eds. Ibekwe-SanJuan, F.,
Condamines, A., Cabré Castellvi, M.T. Benjamins,
Dyvik, H. (2002). Translations as semantic mirrors: from parallel corpora to wordnet. 23rd International Conference on English Language Research on Computerized Corpora of Modern and Medieval English (ICAME 23). Gothenburg.
Foo, J. Merkel, M. (2008). Building standardized term bases through automated
term extraction and advanced editing tools. In
Proceedings of the International
Conference on Terminology, November 16-17, 2006.
Franzi, K., Ananiadou, S., and Mima, H. (2000). Automatic Recognition of Multi-Word Terms: the C-value/NC-value Method. International Journal on Digital Libraries 3(2), pp. 115-130.
Gillam, L. Tariq, M., Ahmad, K. (2007). Terminology and the construction of
Engineering (eds. Ibekwe-SanJuan, F., Condamines, A., Cabré Castellvi, M.T.
Benjamins Publishing Company,
Greffenstette, G. (1994). Exploration in
Automatic Thesaururs Discovery.
Hamon, T. Nazarenko, A. (2001). Detection of synonymy links between terms:
Experiment and results. In Recent andvaces in Computational Terminology (eds.
Bourigault, D. Jacquemin, C. L’Homme, M.-L. John Benjamins Publishing Company,
Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora.
In Proceedings of th 14th
International Conference on Computational Linguistics,
Ide, N. Erjavec, T. Tufis, D. (2002). Sense Discrimination with Parallel Corpora. In Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions: July 2002, pp. 54-60.
Malaisé, V. Zweigenbaum, P., Bachimont, B. (2007).
Mining defining contexts to help structuring differential ontologies. In
Application-Driven Terminology Engineering (eds. Ibekwe-SanJuan, F.,
Condamines, A., Cabré Castellvi, M.T. Benjamins Publishing Company,
Melamed, I.D. (2001). Empirical Methods for Exploiting ParallelTexts. MIT Press.
Merkel, M., & Foo, J. (2007). Terminology extraction and term ranking for
standardizing term banks. Procedings from 16th Nordic Conference of
Computational Linguistics (NODALIDA).
Merkel, M., Petterstedt, M., & Ahrenberg, L. (2003).
Interactive Word Alignment for Corpus Linguistics. Proceedings from the
Interantional Conference of Corpus Linguistics .
Mihalcea, Rada and Ted Pedersen (2003). An Evaluation Exercise for Word
Alignment. In Proceedings of the HLT-NAACL 2003 Workshop, Building and Using
Parallel Texts: Data Driven Machine Translation and Beyond, pp 1-6,
Och F, Hermann Ney. (2003) A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, vol 29, number 1, pp. 19-51.
Resnik, P. and Yarowsky, D. (1997). A perspective on word sense disambiguation
methods and their evaluation. ACL-SIGLEX Workshop Tagging Text with Lexical
Semantics: Why, What, and How?
Sahlgren, M. & Karlgren, J. (2005): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Parallel Corpora. Journal of Natural Language Engineering, Special Issue on Parallel Texts, 11(3) September.
Steinberger R., Pouliquen, B. Widiger, A. Ignat, C. Erjavec, T. Tufiş, D. Varga,
D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+
languages. Proceedings of the 5th International Conference on Language Resources
and Evaluation (LREC'2006).
Tapanainen, P., & Järvinen, T. (1997). A non-projective dependency parser. Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP'97), (pp. 64-71).
Tiedemann J: Combining clues for word
alignment (2003). Proceedings of the 10th
Conference of the EACL;
Törnfeldt, T., Graph Similarity, Parallel Texts, and Automated Bilingual Lexicon Acquisition, Masters thesis, Linköping Institute of Technology, Department of Mathematics. LiTH-MAT-Ex-08/08-SE, 2008
Page responsible: Magnus Merkel
Last updated: 2010-09-29