|
|
Processing multi-modal and multi-lingual
information
Information applicable for processing and
access in the e-home can come from various sources, in different languages
and in multiple formats. Some of the data is structured (databases, XML
and SGML documents), but the majority of information is stored in unstructured
documents. (Bielawski and Boyle (1997) claims that over 80 per cent of
an organization's information are to be found in documents, that is, as
unstructured data). Furthermore, the problem is not the availability of
the information, rather the problem is to locate the right information
for a specific need at the right time. Multi-modal communication
need access to structured information, and as most information
is represented in unstructured formats (documents), processes to identify
different types of information of different granularity need to be developed
further. Below we list research areas which are of interest to build structured
information sources that can serve the e-home with information in multi-modal
dialogue.
-
Information retrieval (IR). The
process of identifying the correct information source, for example, the
relevant documents, is what is in focus for information retrieval. Here
various techniques from language engineering, such as stemming, lemmatizing
and shallow syntactic parsing, will aid in providing more accurate retrieval
results.
-
Information extraction (IE). This
is one step further compared to IR in that not only the relevant documents
should be found, but the process moves on to first single out passages,
or extracts from documents, which contain the desired pieces of information,
and second, "transforms them into information that is more readily digested
and analyzed" (Cowie & Lehnert 1996, p. 80-81). The goal of IE is to
find and link relevant information and at the same time ignore extraneous
and irrelevant information. IE is performed in several steps, i) identifying
and marking up instances of objects, events and relations in documents,
ii) extracting the available information based on the mark-up, and iii)
presenting the requested information in a suitable format for the user.
In NLP terms, IE requires a number of separate techniques that are used
together; i.e., POS-tagging, functional and sense disambiguation, pronoun
resolution, etc. In practice, many IE approaches use agents or spiders
that concentrate on a particular task, such as identifying entities in
the documents that can be classified as places, persons, organizations,
times, positions, etc. For the user, IE will provide added value because
the techniques involved will serve as an extra filter and help the user
to focus on only the relevant information hidden in parts of the documents.
-
Text mining. Text mining is also
known as text data mining (Hearst 1999) is described as the process of
extracting interesting and non-trivial patterns or knowledge from unstructured
text documents. Often it is viewed as an extension of data mining or knowledge
discovery from databases (Simoudis 1996). Text mining is a more multidisciplinary
field than IR and IE as it involves IR, IE, text analysis, clustering,
categorization, visualization, database technology, machine learning and
data mining (Tan 1999). One distinguishing factor that separates text mining
from IE is that IE is user-driven in the sense that the user explicitly
states what he is looking for whereas text mining can be seen as system-driven
if the aim is to find links and relationships between entities in the document
base. However, in many practical cases, the borderline between text mining
and information extraction is rather vague. Within the e-home framework,
text mining can in the future play an important role for the user in the
role of acting like intelligent personal assistants. A personal miner would
be able to learn a particular user's profile and preferences, perform text
mining automatically, and present information to the user without explicit
requests (Tan 1999).
-
Document summarization. To be able
to accurately compress and summarize large documents into concise descriptions
of the content is another approach to handle the increasing information
flow. Document summarization can be defined as "a reductive transformation
of source text to summary text through content reduction by selection and/or
generalization on what is important in the source" (Sparck Jones 1999).
There are many basic similarities between the information extraction paradigm
and the document summarization field as they use similar NLP techniques.
As Sparck Jones points out the division for summarization approaches lies
between systems that reduces content by selection or extraction on the
one hand, and systems that interpret the source document and shorten the
source document by generalizing the content. Most systems today adopt the
extraction approach, i.e., where the text is shortened by selecting the
most important passages and compiling these into one text. One example
of a commercial extraction approach to summarization is the AutoSummarize
function present in later versions of Microsoft Word. Very few systems
today can accurately interpret and make a shorter abstracted version of
a document, but with progress within IE and Text mining will definitely
help to promote such progress in the future, as they share a great deal
of basic components.
-
Document classification. Document
classification is a simpler area than the one mentioned above, and often
serves as a subcomponent for IE, text mining and document summarization.
Document classification involves superficial analysis of documents in order
to specify type, genre and language.
- Multi-lingual document
processing. On the Internet and in many document archives,
documents are available in multiple languages. By making information
retrieval and extraction as well as text mining and summarization
operate on multi-lingual sources, these techniques will give the user
access to information stored in several languages. The multi-lingual
perspective requires linguistic resources, such as multi-lingual
lexicons and term banks, as well as multi-lingual processing systems
like machine translation systems, to be able to arrive at
language-independent intermediate representations. NLP has seen a
number of new corpus-based approaches in the last decades where focus
has been on extracting multi-lingual resources from both mono-lingual
and multilingual documents, for example by extracting technical
terminology and by using word alignment programs to create bilingual
lexicons automatically (cf. Melamed 1998, Merkel 1999, Ahrenberg,
Andersson & Merkel 1998).
-
Multi-modal document processing.
Documents are often regarded as "text only", but in recent years, documents
are more and more considered as "containers of information or knowledge",
irrespective of modality, be it text, graphics, video, sound, etc. The
widening concept of document makes it necessary to process multi-modal
documents with a unified perspective, which demands for integrating various
techniques, such as image analysis and speech recognition with general
text processing techniques.
Taken together, these research areas will
contribute to making information more readily available to various applications
and information appliances in the e-home.
References
Lars Ahrenberg, Mikael Andersson and Magnus Merkel. A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL’98, Montreal, pp. 29-35, 1998.
Jim Cowie and Wendy Lehnert: Information Extraction. In Communications of the ACM. January 1996/Vol.39, No. 1, 1996.
Marti Hearst. Untangling text data mining. In Proceedings of ACL '99: The 37th annual meeting of the Association for Computational Linguistics, University of Maryland, 1999. Also available at http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
Dan I. Melamed. Empirical Methods for MT Lexicon Construction. In Machine Translation and the Information Soup. D. Farwell, L. Gerber and E. Hovy (eds.), Berlin, Springer Verlag, pp. 18-30, 1998.
Magnus Merkel. Understanding and enhancing translation by parallel text processing. Dissertation No. 607. Department of Computer and Information Science, Linköping university, 1999.
E. Simoudis. Reality check for data mining. In IEEE Expert, 11(5), 1996.
Karen Sparck Jones. Automatic summarizing: factors and directions. In Advances in Automatic Text Summarization. Mani, I. & Maybury M.T. (eds.), MIT Press, London, 1999.
Ah-Hwee Tan. Text Mining: The state of the art and the challenges. In Proceedings, PAKDD'99 Workshop on Knowledge discovery from Advanced Databases (KDAD'99), Beijing, pp. 71-76, April 1999.
|