Multi-lingual term extraction and term structuring
Magnus
Merkel
Department of Computer and Information Science
Linköpings universitet, S-581 83 Linköping, SWEDEN
Tel. +46 13 28 19 64.
Fax: +46 13 28 44 99
magme@ida.liu.se |
1.
Specific goals
The goal of this project is to develop
new methods and systems for term extraction and term structuring from
multi-lingual document collections. The work of such a system can be divided
into the following four major tasks: (1) Term recognition, (2) Term alignment,
(3) Generation of concepts, i.e., sets of synonymous terms, (4) Recognition of
semantic relations between concepts, in particular hyponymy and co-hyponymy.
Current methods, including our own,
usually combine linguistic and statistical data in some way. In recent years
matrix/vector-space methods have been extensively used in several document
retrieval applications, and it has been demonstrated that they can improve the
performance of such systems considerably. Also graph-based methods have been
successfully used in language processing. We believe that the use of such
numerical methods together with refined linguistic techniques may lead to
similar improvements in other areas of computational linguistics. One of the
aims of this proposal is to explore the potential benefits of this combination.
While the creation of multi-lingual data
is important in itself, the project will test the hypothesis that multi-lingual
parallel data actually offer an advantage over mono-lingual term extraction and
structuring. Several works in related areas such as word alignment and word
sense disambiguation give support for this hypothesis. Thus, we will also
evaluate the multi-lingual methods developed in the project on mono-lingual text
collections and compare their performance with state-of-the-art methods for
mono-lingual data.
2.
The area
Electronic texts are multiplying, on the
internet, and also within companies and organizations, where they constitute a
potential asset. These text masses contain not only factual information, i.e.
specific pieces of information expressed in language in the texts, but they also
convey knowledge about the domains that is not expressed explicitly, namely
information on domain-related terminology and concept structure.
Harvesting this knowledge is a basic
goal of terminology engineering, and related areas such as ontology engineering,
word sense disambiguation, information extraction, machine translation, word
alignment, and semantic classification.
One strand in terminology engineering
involves exploiting external semantic resources such as domain dictionaries,
taxonomies or ontologies as reference data and use these resources to determine
whether terms extracted share semantic relations. For instance, one could
measure the similarity of two words based on words shared in dictionary
definitions. Alternatively, as in this project, the methods are primarily
data-driven, bottom-up, based on the empirical data contained in real-world
texts. This more bottom-up approach can be divided into two main approaches:
1.
Statistical
approaches. A corpus-based approach based on
statistical evidence where the problem of termhood or semantic relations is seen
as a distributional problem. It is assumed that the more similar distributions
of the units, the more likely it is that the units share some common features,
such as being synonyms or belonging to the same semantic field.
2.
Linguistic
approaches. Evidence for semantic relationships is
found in the morphological structures of words or word groups, or through
satisfaction of lexico-syntactic
patterns/relational markers (e.g. Hearst 1992, Malaisé et al. 2007).
Within terminology engineering and
empirical knowledge representation the focus is shifting from pure term
extraction to term structuring. The efforts need to be “directed towards
automatic detection of semantically-related terms” from texts (Cabré Castellví
et al. 2007, p. 3). There is an obvious need for language resources that contain
semantics and structure, for tasks such as translation, information retrieval,
question-answering and text mining. There are according to Cabré Castellví
several issues that need to be resolved in order to arrive at empirically based
semantic resources, among them:
1.
Terms need to be identified in
the texts, and bad term candidates must be filtered away.
2.
Clustering terms into synonym
sets (synsets) that form concepts.
3.
Identify semantic relations
between concepts.
Term extraction is mainly performed on
monolingual texts, but the challenge still lies in separating standard language
from domain-specific terminology. There are both statistical and linguistic
approaches to term candidate identification. In an evaluation of terminology
extractors done in 2001, the conclusion was that more effort has to be put into
actively combining statistical and linguistic methods (Cabré Castellví et al.
2001).
Furthermore, clustering terms into
synonym sets that form concepts is an essential step and has been pursued for
example by analyzing dictionaries and their definitions (Blondel & Snellart
2002) and from monolingual texts (Greffenstette 1994, Hamon & Nazarenko 2001).
Semantic relations have been detected in
ontology-building endeavours using linguistic patterns (e.g. Malaisé et al.
2007). Gillam et al. (2007) use statistical methods to form an initial
conceptual hierarchy, which is then subsequently populated through linguistic
analysis. Grabar & Zweigenbaum (2002) utilize lexical induction to relate terms
hierarchically by performing inclusion tests with morphological normalizations.
For example, the term “tumour” can be seen as a superordinate term to “benign
tumour”.
While most work is being done on
monolingual corpora, there are interesting related work done on multi-lingual
parallel corpora, for example within the field of word sense disambiguation,.
Word sense disambiguation using parallel corpora has been proposed and evaluated
by e.g. Resnik & Yarowsky (1997) and Ide et al. (2002).
There are more solutions to semantic
problems in multilingual parallel corpora than in monolingual texts due to the
fact that semantic problems have
been resolved during translation. Translators have been faced with synonymy
problems, with lexical and structural ambiguity, vagueness, terminological
consistency, etc. and they had to resolve the challenges and the solutions are
present in the relationship between the source and target texts.
Parallel corpora and word alignment have
provided another method to derive semantic resources. Helge Dyvik (2002) has
suggested that it is possible to use
semantic mirroring and parallel corpora in order to derive semantic
relations similar to the ones that is provided in WordNet, e.g., synonymy,
hyponymy and partitions of semantic fields. Dyvik’s hypothesis is that
semantically related words should exhibit significantly overlapping
translations, and that words that are more general (”semantically wide”) should
have a larger number of translations than words with very specific meanings. In
Dyvik’s work focus has been on using sentence-aligned parallel corpora and
manual identification of word correspondences for the words under scrutiny.
3.
Project description
The project will compare and integrate
computational methods for the following tasks: (1) mono-lingual term recognition
and extraction, (2) term alignment on parallel multi-lingual documents, (3)
generation of sets of synonymous terms (‘synsets’) and concepts based on term
identification and term alignment, (4) generation of semantic relations such as
hyponomy (is-a) and co-hyponomy.
By combining methods for these primitive
tasks different systems can be created that, given a multi-lingual document
collection, will generate sets of concepts that can be organized hierarchically
and provide a partial ontology for the domain of the document collection. Since
the concepts will have pointers to words the result can also be organized as a
bilingual dictionary, or thesaurus.
There are many different ways of
building a combined system, however, e.g., as regards the order and integration
of mono-lingual and multi-lingual processing, and as regards the use and
integration of different approaches. The project will explore a large number of
these possibilities. In the following we first describe our current system and
then present the enhancements and alternatives to be investigated in the
project.
3.1
Current process
and tools
A given bilingual parallel corpus is
first sentence-aligned using the GMA system (Melamed, 2001) and both halves of
the corpus are parsed with the Machinese Syntax dependency parsers (Tapanainen &
Järvinen, 1997). After post-processing and re-formatting, a clue-based word
alignment system is applied (Tiedemann, 2003; Merkel et al., 2003; Foo & Merkel,
2007), where information from various sources such as general dictionaries,
validated alignments, statistical measures of co-occurrence and word translation
probabilities computed with Giza++ (Och & Ney 2005) are combined to generate
candidate translation pairs.
Single word alignments are extended to
alignment of multi-word units at the token level via phrase formation and
linguistic filters. The filters are absolute, i.e., a unit that does not meet
the specified constraints is not considered. Pairs of term candidates are
generated based on linguistic filters that model “termhood” (cf. Frantzi et al.,
2000).
The term candidates are then ranked on
the basis of a simple measure, which we call Q-value and which is defined as
follows:
Here, TPF is the type pair frequency,
TpS is the number of distinct target types per source type, and SpT the number
of distinct source types per target type (Merkel & Foo, 2007). The idea is that
high frequency pairs where the source and target terms are rarely used in other
term pairs is an indicator of high quality. The performance of the Q-value has
been compared with the Dice coefficient and type frequency for the task of
selecting good candidates and we have shown that Q-value outperforms both
alternatives (Merkel & Foo 2007).
The term candidates can then be grouped,
e.g. on the basis of source language lemmas. When there are different
translations for a given lemma, the group can be extended with all candidate
pairs that share a translation with the given lemma. These groups constitute the
basis for the formation of synonym sets.
3.2
Extensions
The first extension concerns more
refined methods for the generation of synonym sets, or, term clustering.
There are several alternatives that will be tested.
Semantic Mirroring (Dyvik, 2002) has
been applied both to the tasks of finding multi-lingual synonyms from parallel
corpora, and finding synonyms from dictionaries (Andersson 2004). We will
investigate variations of the method and apply it to the list of candidate pairs
generated from alignment.
Another alternative is given by the
algebraic/graph-based methods (Blondel et al, 2004). Here synonyms are extracted
by computing the similarity between one graph representing a partial dictionary
and another graph representing the "synonym property" of the words. For large
graphs (dictionaries) it is computationally demanding to compute the similarity
measure (a large eigenvalue problem must be solved). We will develop efficient
algorithms for this problem, based on the vast body of knowledge in numerical
linear algebra for handling and extracting information from large sparse
matrices/graphs and the preliminary work reported below.
Vector space methods can also be used to
generate candidate pairs directly. When a candidate pair is considered for
inclusion in a synonym set, Q-values and vector distances may then be combined
in making decisions.
In the vector-space model, the
term-document matrix can be considered as a representation of a bi-partite
graph, where the two sets of nodes are terms and sentences, respectively. Thus
two parallel texts, each represented by its graph, can be compared using the
measure of graph similarity developed by Blondel et al, 2004, and which is
computed by solving a certain large eigenvalue problem. This approach was also
studied in (Törnfeldt, 2008), and we will develop this further.
Mostly when algebraic/graph-based
methods have been applied to problems in information retrieval and text mining,
the preprocessing has been rather standard (stemming, stop word removal). We
will investigate how a combination of linguistic techniques and algebraic
methods can be used to further improve the performance of the
algebraic/graph-based methods.
The final extension concerns the
recognition of semantic relations. This stage might be integrated with the
previous one as far as processing goes, but is notionally distinct. The
candidate synonym sets may often include terms that contract a hyponomy relation
rather than being synonymous. We
will explore different methods to find these relations:
Lexico-syntactic patterns (Hearst 1992)
are primarily focused on monolingual data, but could be extended to operate on
multilingual texts as well. Semantic mirroring techniques involve set theory and
operations on sets of synonym clusters to create hierarchical representations.
Lexically induced methods, such as Grabar & Zweigenbaum (2002) will also be
applied both mono-lingually and on parallel data. Dyvik’s (2002) hypothesis for
determining semantic relations is based primarily on that more general
terms/words (”semantically wide”) should have a larger number of translations
than words with very specific meanings. This hypothesis will be tested and
alternative algebraic structuring techniques than Dyvik’s algorithm will be
implemented and evaluated.
3.3
Comparisons
There are two types of comparisons that
will be performed in the project. On the one hand we investigate different
methods for the same subtasks, such as our linguistic-statistic model for word
alignment with algebraic methods, where these two methods can also be combined
within the system. On the other hand, we will compare our type of system that
exploits multi-lingual data for the task of recognising terms and term
structures for one language, with systems that primarily work on mono-lingual
data (Grefenstette, 1994; Frantzi & Ananiadou, 2000), thus testing the
hypothesis that “two languages are better than one” even for a mono-lingual
task.
Monolingual termhood recognition is
based on co-occurrence statistics, entropies and linguistic filtering using
part-of-speech tagging, recognition of common part-of-speech patterns and
word-based filtering. This may be extended and combined with frequency
comparisons of frequencies within the collection and frequencies in other
collections, including a balanced corpus for the language such as the British
National Corpus (BNC). Once the term candidates have been derived, they can be
clustered into sets based on distributional and linguistic criteria as described
in section 2.
Also, we will investigate whether
filtering-before-alignment will show the same performance as
alignment-before-filtering. This is a kind of system often referred to as
translation spotting, where terms are first proposed on mono-lingual evidence
only, and then aligned based on co-occurrence and external resources such as
bilingual dictionaries.
3.4
Evaluation
There are two kinds of evaluation that
will be used in the project. Some of our subtasks such as term recognition and
word alignment can be evaluated on gold standards, i.e., lists of correct terms
(or alignments) and performance can be measured by precision and recall. Our
group has previously created gold standard resources and have several available
for testing. Others are available from alignment campaigns (e.g. Mihalcea and
Pedersen, 2003).
The second kind of evaluation is
a-posteriori evaluation of output from the systems by human experts. This is the
way we will evaluate synonym sets and semantic relations, as creating gold
standards for these purposes from a
large document collection will simply be too costly for the project. Thus,
external human evaluators with appropriate domain knowledge will be asked to
judge the accuracy of synsets and semantic relations proposed by the systems.
This means that evaluation will focus on precision. As the systems will assign a
measure of certitude to each candidate, precision can be measured at different
levels of certitude.
Resources such as WordNet have been used
for evaluation of similar tasks. However, WordNet does not contain the majority
of terms that can be found in our data and is thus not suitable for the task.
For development and evaluation we will
use a Microsoft online Help corpus available for research at our department,
patent descriptions from The Swedish Patent Office (PRV) and the English and
Swedish sections of the freely available JRC-Acquis multilingual parallel corpus
(Steinberger et al., 2006). If available, we will also use data from
international term extraction campaigns.
3.5
Year plan
Year 1: |
·
Building the
infrastructure, tools and data, for sub-task evaluation.
·
Improvement and
evaluation of term filtering techniques.
·
Improvement and
evaluation of term clustering methods (Multilingual)
·
Development and
adaption of software for linear algebra algorithms applied to problems
in linguistics, based on state-of-the-art numerical program libraries. |
Year 2: |
·
Building the system
for mono-lingual term extraction and clustering.
·
Further improvement
and variation of multi-lingual term extraction.
·
Evaluation of
monolingual vs. multilingual term extraction.
·
Integration of
linguistic and linear algebra software.
·
Further integration
and evaluation of term structuring software |
Year 3: |
·
Evaluation of
extracted semantic relations.
·
Final evaluation of
monolingual vs. multilingual approaches to term extraction, term
clustering and semantic relations
·
Results on the
implications of combining linguistic and algebraic methods. |
4.
Preliminary results
Sofar, the project group has produced a
tools suite where multilingual text collections can be taken as input and then
aligned on the word and phrase level (Ahrenberg et al. 2003). Also, preliminary
results on filtering out non-terms have been made as well as using a simple
variant of semantic mirroring (Foo & Merkel 2008). In preliminary experiments,
where semantic mirroring is used on aligned term candidates using Q-value as a
filter, we have found that sets of terms belonging to the same concept can be
generated with promising results. The Q-value has been tested and shown to
outperform the Dice coefficient and pure frequency as measures to rank term
candidate pairs by quality standards (Merkel & Foo 2007). Currently, we are
testing the Q-value against other means of ranking data sets such as
log-likelihood, mutual information and t-score.
Vector space methods for term alignment
in parallel texts are based on the creation of a common approximate basis (in a
linear algebra sense) for the two texts (Sahlgren and Karlgren, 2005). Their
method is based on a particular probabilistic method for computing the basis.
Recently vector space methods for term alignment have been studied in a masters
thesis (Törnfeldt, 2008), where a few different data compression approaches for
"noise reduction" are used, based on singular value decomposition of the
term-document matrix. Very promising results are reported, with 80% precision as
compared with the 60% of (Sahlgren and Karlgren, 2005).
5.
Impact
Term
extraction, mono-lingual and multi-lingual synonym generation and concept
formation are important subtasks of the general problem of learning ontologies
from text. While there exist several methods and systems today that produce
useful results, there is still need for improvement, in particular as regards
filtering out bad term candidates, and producing accurate semantic relations. In
this project we will tackle these problems from a multi-lingual perspective,
combining and integrating state-of-the-art techniques in new ways, and test
methods such as semantic mirroring that have not been applied to terminology
engineering before.
The
results of the project are also of great practical utility. Companies and
organizations with international clients and customers are well aware of the
importance of accurate terminologies and term use for efficient communication
internally and externally. However, often only a fraction of their terminologies
is coded and stored centrally, while the rest is more or less implicit in their
various documents. By harvesting the terms and their relations from these
document collections, the terminologies can be brought into better order. This
of course requires human reviewing of candidate terms and concepts which again
underlines the need for high precision in generated output.
6.
References
Ahrenberg, L., Merkel, M., & Petterstedt, M. (2003).
Interactive word alignment for language engineering.
Proceedings of EACL-2003.
Budapest.
Andersson, S. (2004) Semantisk spegling.
En implementation för att synliggöra semantiska relationer i tvåspråkiga data.
Master’s Thesis, Linköpings universitet, LIU-KOGVET-D--04/01--SE.
Berry, M. & Browne, M. (2005).
Email Surveillance Using Non-negative Matrix Factorization Computational
Mathematical Organization Theory, 2005, 11, 249-264
Blondel, V. D.; Gajardo, A.; Heymans, M.; Senellart, P. & Dooren, P. V. A
Measure of Similarity between Graph Vertices: Applications to Synonym Extraction
and Web Searching
Cabré Castellví,
M.T., Bagot, R.E., Palastresi, J.V. (2001).
Automatic Term Detection. Review of Current Systems. In
Recent Advances in Computational
Terminology (eds. Bourigault, D.
Jacquemin, C., L'Homme, M.-C. John Bnejamins Publishing Company,
Cabré Castellví,
M.T., Condamines, A., Ibekwe-SanJuan, F. 2007. Introduction: Application-driven
terminology engineering. In
Application-Driven Terminology Engineering (eds. Ibekwe-SanJuan, F.,
Condamines, A., Cabré Castellvi, M.T. Benjamins,
Dyvik, H. (2002). Translations as semantic mirrors: from parallel corpora to
wordnet. 23rd International Conference on English Language Research on
Computerized Corpora of Modern and Medieval English (ICAME 23). Gothenburg.
Foo, J. Merkel, M. (2008). Building standardized term bases through automated
term extraction and advanced editing tools. In
Proceedings of the International
Conference on Terminology, November 16-17, 2006.
Franzi, K., Ananiadou, S., and Mima, H. (2000).
Automatic Recognition of Multi-Word Terms: the C-value/NC-value Method.
International Journal on Digital
Libraries 3(2), pp. 115-130.
Gillam, L. Tariq, M., Ahmad, K. (2007). Terminology and the construction of
ontology. In
Application-Driven Terminology
Engineering (eds. Ibekwe-SanJuan, F., Condamines, A., Cabré Castellvi, M.T.
Benjamins Publishing Company,
Grabar,
Greffenstette, G. (1994). Exploration in
Automatic Thesaururs Discovery.
Hamon, T. Nazarenko, A. (2001). Detection of synonymy links between terms:
Experiment and results. In Recent andvaces in Computational Terminology (eds.
Bourigault, D. Jacquemin, C. L’Homme, M.-L. John Benjamins Publishing Company,
Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora.
In Proceedings of th 14th
International Conference on Computational Linguistics,
Ide, N. Erjavec, T. Tufis, D. (2002).
Sense Discrimination with Parallel Corpora. In
Proceedings of the SIGLEX/SENSEVAL
Workshop on Word Sense Disambiguation: Recent Successes and Future
Directions: July 2002, pp. 54-60.
Malaisé, V. Zweigenbaum, P., Bachimont, B. (2007).
Mining defining contexts to help structuring differential ontologies. In
In
Application-Driven Terminology Engineering (eds. Ibekwe-SanJuan, F.,
Condamines, A., Cabré Castellvi, M.T. Benjamins Publishing Company,
Melamed, I.D. (2001). Empirical Methods for Exploiting ParallelTexts. MIT Press.
Merkel, M., & Foo, J. (2007). Terminology extraction and term ranking for
standardizing term banks. Procedings from 16th Nordic Conference of
Computational Linguistics (NODALIDA).
Merkel, M., Petterstedt, M., & Ahrenberg, L. (2003).
Interactive Word Alignment for Corpus Linguistics. Proceedings from the
Interantional Conference of Corpus Linguistics .
Mihalcea, Rada and Ted Pedersen (2003). An Evaluation Exercise for Word
Alignment. In Proceedings of the HLT-NAACL 2003 Workshop, Building and Using
Parallel Texts: Data Driven Machine Translation and Beyond, pp 1-6,
Och F, Hermann Ney. (2003) A Systematic Comparison of Various Statistical
Alignment Models, Computational
Linguistics, vol 29, number 1, pp. 19-51.
Resnik, P. and Yarowsky, D. (1997). A perspective on word sense disambiguation
methods and their evaluation. ACL-SIGLEX Workshop Tagging Text with Lexical
Semantics: Why, What, and How?
Sahlgren, M. & Karlgren, J. (2005):
Automatic Bilingual Lexicon Acquisition Using Random Indexing of Parallel
Corpora. Journal of Natural Language Engineering, Special Issue on
Parallel Texts, 11(3) September.
Steinberger R., Pouliquen, B. Widiger, A. Ignat, C. Erjavec, T. Tufiş, D. Varga,
D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+
languages. Proceedings of the 5th International Conference on Language Resources
and Evaluation (LREC'2006).
Tapanainen, P., & Järvinen, T. (1997). A non-projective dependency parser.
Proceedings of the 5th Conference on Applied Natural Language Processing
(ANLP'97), (pp. 64-71).
Tiedemann J: Combining clues for word
alignment (2003). Proceedings of the 10th
Conference of the EACL;
Törnfeldt, T., Graph Similarity, Parallel
Texts, and Automated
Bilingual
Lexicon Acquisition,
Masters thesis, Linköping Institute
of Technology, Department of Mathematics. LiTH-MAT-Ex-08/08-SE, 2008
Page responsible: Magnus Merkel
Last updated: 2015-04-28