|Project financed by:||
Development of Generic Resources for Language Technology
BackgroundIdeally, the research community of language technology should work towards one and the same goal, concerning its software -- a shared code library that can serve as a platform for commercial efforts and applications. However, this is not a trivial task. There is a need to further develop the set of strategies that satisfy all involved parties. It is important that different research groups and companies can meet, develop and distribute their contributions in a way that suit their own particular situation.
Today much work in the field of language technology is done over and over again by different groups. Systems and frameworks tend to have a proprietary character (with a few important exceptions such as the DARPA initiative and SVENSK). We believe that the open source community is a strong candidate for a better dissemination of results from a research group such as ours, as well as from other similar groups.
Moreover, the bottom-up character of an open source library is an important complement to existing efforts of dissemination that operate in a top-down fashion towards a fixed system or goal. One important benefit of the open source methodology is that running systems are displayed early in the development for all interested users to share and modify. Thus, we may expect that systems are spread quickly, and the successful ones are developed more rapidly. The open source community offers an existing infrastructure and methodology well adjusted to the heterogeneous intellectual environment of applied research. The history of the open source community shows that it has managed to produce a mass of code and several really successful applications so far, and at present the community is expanding as ever before.
In the proposed project we plan to contribute with work directed towards an open source vision for language technology -- a shared open source library of research effort ready to be used commercially. Such a code library will, of course, have an evolutionary character and be developed for quite some time to become mature. Our own contributions will initially concern only parts where we have research expertise. Hopefully this serves as an example for other sub-niches of language technology to follow at a later stage. Moreover, our efforts will be naturally connected to other parallel activities on open source research and education at IDA, the computer science department of Linköping university.
Distribution and Sharing of Research ResultsTraditionally, distribution and sharing of research results is mainly done through articles, at least for more basic interface research sub-areas such as language technology. Implementations are often only developed to a point where they prove existence of an idea, not to the point of a robust, easy-to-use, working system. Such systems are rarely ready for use as parts of other applications or for real users, noted, for instance, during the work on SVENSK: No matter how linguistically adequate a piece of language processing software is, without a proper API it cannot be used in conjunction with other programs.
This state of affairs has several drawbacks. It is difficult to distribute research results to industry, since the step from an article to a working implementation often is too large for particular industrial projects. It is difficult for research itself to keep in touch with parallel development of ideas of software in industry, the open source community and other parts of the society. There is a risk of a growing gap between these activities that needs to be dealt with, in our view.
The lack of competent programmers that understand our problems when we need them, is a common problem in our own research experience. Implementations derived from research ideas often lack channels that hand the systems over to a next phase in development. Similarly, it is hard to find users that are committed enough so that they can provide substantial feedback on the research software.
From the industrial point of view there are similar problems with the knowledge transfer from several parts of software research right now. The results from the research community are not directly usable in their activities. Their feedback to research also lack channels that conceptualise actual needs of software into suitable levels of research issues.
To sum up, we find ourselves in a situation where we want to:
Towards a Distributed Language Technology Software LibraryBasic research on natural language systems in the past has often aimed for identification and experimentation with new algorithmic and system design ideas, and not so much on how to integrate these techniques with sources outside the research field. We see a need for further activities on turning these basic ideas into more full fledged network compliant software for real use in industry and society.
The new language technology research software must share the properties of distributed software that is developed today. The software should form a library on which new commercial applications easily can be built. Exactly how such a library will look must be developed iteratively over time. However, it seems necessary, in any case, that the code of the library obeys the new laws of distributed software to be useful. That is, it seems that the new interface software should incorporate the following principles, at least:
What is Open Source?In the community, open source software is often defined through the OSI license agreement. The license agreement that OSI provides controls the use source code that is handled by the open source certification that OSI provides. The agreement does not exclude sales, in fact it specifically mentions sales. However, many open source projects are also provided free of charge.
According to Bruce Perence, who originally wrote the draft of the OSI definition for the Debian open source project, the definition is a bill of rights for computer users. Certain rights are required in software licenses for that software to be certified as Open Source. Essentially the right to:
As definitions go, a bill of rights may state when software is open source but it does not really describes the nature of open source software development. For instance, the famous open source projects such as Linux, Mozilla and Apache have had large and organisationally independent groups contributing to the same development. This aspect is however, not regulated under the OSI definition.
Open source is often regarded as massive parallel development. Furthermore, open source is often connected with individuals working in communities of highly able, strongly artistic developers motivated by entertainment, status, pride and engineering esthetics, and peer acknowledgment rather than money. The interested reader searching a more in-depth analysis of open-source is recommended Feller and Fitzerald's framework analysis of open source software development.
Still, these definitions of open source are somewhat shallow and need more fundamental analysis of the phenomena. In particular, the following aspects of open source are interesting to study from the perspective of development methodology:
Much information, especially information available on the Internet, is not in structured form. Furthermore, even with powerful techniques for extracting information, it is still very hard for a user to formulate a query that reflects the information need. The requests often require collecting information from the user, and possibly also other information systems, before they are defined precisely enough for the search engine.
The overall research goal of this proposal is to incorporate document and text processing techniques in multimodal dialogue systems. The main result is, thus, a natural language information system, where multimodal interaction allows for intuitive and efficient formulation of complex requests for information that can be extracted from unstructured, distributed information sources.
Such an integrated multimodal dialogue system requires a variety of shared knowledge sources, especially important is a common ontology, that can be utilised by both the dialogue system and the information extraction components. The development of a general ontology is far beyond the research goals of this project. However, we expect to have ontologies that are useful for various types of applications and domains. Furthermore, one research issue that will be investigated concerns maintaining ontologies in a dynamic environment.
Another important goal is to understand multimodal human computer interaction, and especially issues related to control and co-operation. The expected results are on the one hand knowledge on multimodal interaction, but also principles for design of multimodal dialogue systems.
Page responsible: Webmaster
Last updated: 2012-05-07