Project financed by:

Development of Generic Resources for Language Technology

Summary

Background

Ideally, the research community of language technology should work towards one and the same goal, concerning its software -- a shared code library that can serve as a platform for commercial efforts and applications. However, this is not a trivial task. There is a need to further develop the set of strategies that satisfy all involved parties. It is important that different research groups and companies can meet, develop and distribute their contributions in a way that suit their own particular situation.

Today much work in the field of language technology is done over and over again by different groups. Systems and frameworks tend to have a proprietary character (with a few important exceptions such as the DARPA initiative and SVENSK). We believe that the open source community is a strong candidate for a better dissemination of results from a research group such as ours, as well as from other similar groups.

Moreover, the bottom-up character of an open source library is an important complement to existing efforts of dissemination that operate in a top-down fashion towards a fixed system or goal. One important benefit of the open source methodology is that running systems are displayed early in the development for all interested users to share and modify. Thus, we may expect that systems are spread quickly, and the successful ones are developed more rapidly. The open source community offers an existing infrastructure and methodology well adjusted to the heterogeneous intellectual environment of applied research. The history of the open source community shows that it has managed to produce a mass of code and several really successful applications so far, and at present the community is expanding as ever before.

In the proposed project we plan to contribute with work directed towards an open source vision for language technology -- a shared open source library of research effort ready to be used commercially. Such a code library will, of course, have an evolutionary character and be developed for quite some time to become mature. Our own contributions will initially concern only parts where we have research expertise. Hopefully this serves as an example for other sub-niches of language technology to follow at a later stage. Moreover, our efforts will be naturally connected to other parallel activities on open source research and education at IDA, the computer science department of Linköping university.

Distribution and Sharing of Research Results

Traditionally, distribution and sharing of research results is mainly done through articles, at least for more basic interface research sub-areas such as language technology. Implementations are often only developed to a point where they prove existence of an idea, not to the point of a robust, easy-to-use, working system. Such systems are rarely ready for use as parts of other applications or for real users, noted, for instance, during the work on SVENSK: No matter how linguistically adequate a piece of language processing software is, without a proper API it cannot be used in conjunction with other programs.

This state of affairs has several drawbacks. It is difficult to distribute research results to industry, since the step from an article to a working implementation often is too large for particular industrial projects. It is difficult for research itself to keep in touch with parallel development of ideas of software in industry, the open source community and other parts of the society. There is a risk of a growing gap between these activities that needs to be dealt with, in our view.

The lack of competent programmers that understand our problems when we need them, is a common problem in our own research experience. Implementations derived from research ideas often lack channels that hand the systems over to a next phase in development. Similarly, it is hard to find users that are committed enough so that they can provide substantial feedback on the research software.

From the industrial point of view there are similar problems with the knowledge transfer from several parts of software research right now. The results from the research community are not directly usable in their activities. Their feedback to research also lack channels that conceptualise actual needs of software into suitable levels of research issues.

To sum up, we find ourselves in a situation where we want to:

hand over existing preliminary prototypes into a phase of more robust software constructions
keep our research software compliant with the network-based software technology to be useful for industry
find competent computing competence that are willing to engage in our projects
use our systems in an industrial environment and find users that are truly interested in our results and therefore give us feedback that help us direct our own research work.

At this point we turn to the open source community and see that they seem to solve several of these problems for us, if we only could join their activities. It is our hypothesis that open source communities can bridge the gap between the software research community and industry. Open source means that research results are disseminated more rapidly and can be used to further develop research ideas into useful designed systems and components that can constitute as a basis for new commercial applications. Furthermore, open source facilitates robustness as the various modules are further developed by those who use them. These communities can also serve as a natural meeting place for research, industry and the student body.

Towards a Distributed Language Technology Software Library

Basic research on natural language systems in the past has often aimed for identification and experimentation with new algorithmic and system design ideas, and not so much on how to integrate these techniques with sources outside the research field. We see a need for further activities on turning these basic ideas into more full fledged network compliant software for real use in industry and society.

The new language technology research software must share the properties of distributed software that is developed today. The software should form a library on which new commercial applications easily can be built. Exactly how such a library will look must be developed iteratively over time. However, it seems necessary, in any case, that the code of the library obeys the new laws of distributed software to be useful. That is, it seems that the new interface software should incorporate the following principles, at least:

connectivity -- open peer-to-peer module design
communication -- standard communication formats on application level
cooperativeness -- easy to adapt to new situations
co-existing -- function in a heterogeneous network environment.

Existing software prototypes in language technology research, such as some of our own systems, seldom meet these requirements today. However, we believe that the existing systems, or at least its core algorithms and formats, could serve as a good starting point for construction of such modules in an open source development initiative.

What is Open Source?

In the community, open source software is often defined through the OSI license agreement. The license agreement that OSI provides controls the use source code that is handled by the open source certification that OSI provides. The agreement does not exclude sales, in fact it specifically mentions sales. However, many open source projects are also provided free of charge.

According to Bruce Perence, who originally wrote the draft of the OSI definition for the Debian open source project, the definition is a bill of rights for computer users. Certain rights are required in software licenses for that software to be certified as Open Source. Essentially the right to:

make copies of the program, and distribute those copies
have access to the software's source code, a necessary preliminary before you can change it
make improvements to the program.

Thus, the OSI certificate protects the source code's ability to move freely though different development projects. Potentially this gives rise to critical-mass effects in software development in which the efforts of many globally distributed independent groups with different goals jointly develop software that become more powerful than what these groups could have built on their own. In a sense, the software becomes its own thing that can grow in completely different directions than the original developers had in mind.

As definitions go, a bill of rights may state when software is open source but it does not really describes the nature of open source software development. For instance, the famous open source projects such as Linux, Mozilla and Apache have had large and organisationally independent groups contributing to the same development. This aspect is however, not regulated under the OSI definition.

Open source is often regarded as massive parallel development. Furthermore, open source is often connected with individuals working in communities of highly able, strongly artistic developers motivated by entertainment, status, pride and engineering esthetics, and peer acknowledgment rather than money. The interested reader searching a more in-depth analysis of open-source is recommended Feller and Fitzerald's framework analysis of open source software development.

Still, these definitions of open source are somewhat shallow and need more fundamental analysis of the phenomena. In particular, the following aspects of open source are interesting to study from the perspective of development methodology:

publicity: development is open to public scrutiny since communication is done through open channels such as mailing lists, discussion forums and web sites.
community cooperation: development is conducted by a community of independent groups that jointly develop software as the result of pursuing their different goals within the same framework (technical or application purpose). user-driven design: The members of the user community essentially become the developers of the content. Development tends to become just-in-time, relevance-based and bottom-up.

Moreover, open source projects are not really planed but rather grow as a result of the emerging needs of the many different users and the desires of individuals. Resulting from the directional flexibility is adaptation to changing requirements. In an open source community the ability to change is built into the individuality of development. Everyone is a change-prone individual making decisions on their own, going their own way. In its extreme, open source is a non-directional development process. Though most open source projects have an application vision, total freedom of choice is present within the boundaries of that vision and the vision may also change slightly.

Much information, especially information available on the Internet, is not in structured form. Furthermore, even with powerful techniques for extracting information, it is still very hard for a user to formulate a query that reflects the information need. The requests often require collecting information from the user, and possibly also other information systems, before they are defined precisely enough for the search engine.

The overall research goal of this proposal is to incorporate document and text processing techniques in multimodal dialogue systems. The main result is, thus, a natural language information system, where multimodal interaction allows for intuitive and efficient formulation of complex requests for information that can be extracted from unstructured, distributed information sources.

Such an integrated multimodal dialogue system requires a variety of shared knowledge sources, especially important is a common ontology, that can be utilised by both the dialogue system and the information extraction components. The development of a general ontology is far beyond the research goals of this project. However, we expect to have ontologies that are useful for various types of applications and domains. Furthermore, one research issue that will be investigated concerns maintaining ontologies in a dynamic environment.

Another important goal is to understand multimodal human computer interaction, and especially issues related to control and co-operation. The expected results are on the one hand knowledge on multimodal interaction, but also principles for design of multimodal dialogue systems.

Page responsible: Webmaster
Last updated: 2012-05-07

IDA - Department of Computer and Information Science

Development of Generic Resources for Language Technology

Background

Distribution and Sharing of Research Results

Towards a Distributed Language Technology Software Library

What is Open Source?