Natural Language Dialog Systems

The design of natural language dialog systems for use as interfaces to information systems has been a long term research area for us. We have investigated this area from several perspectives. On one hand we have been interested in the architecture of a "shell" for a NL dialog system that could be adapted to background systems, setting different requirements on the interface as regards domain knowledge and communicative behaviour. We have designed such a system, called the Linköping Natural Language Interface, or LINLIN, for short. We have been particularly interested in issues of discourse representation for the system, and the knowledge and processes needed to support a coherent dialog. A third important goal has been the characterization of the sublanguage of man-machine communication in natural language, on the assumption that this sublanguage differs in many respects from the language used in dialogues between humans.

LINLIN architecture

To be of general use as an interface system, a dialog system must meet a number of requirements. Only some of these are actually connected to the system's ability to understand and produce natural language, but even if we restrict ourselves to such problems, it is unlikely that general-purpose systems can be developed. This is so because the language requirements are different in different applications. For instance, it is an advantage if meanings of a word that do not occur in the specific knowledge domain of the application are not listed as alternatives in the dictionary used. But the specific linguistic requirements are not limited to vocabulary for the expression of domain concepts, but are also concerned with syntactic constructions, the speech acts likely to occur in interactions with the system and the ways in which context is exploited.

The declarative knowledge-bases of the system, which should be changed to suit the needs of a given application, thus comprises not only the dictionary and the domain concepts, but the grammar and the dialog objects, i.e. the possible moves (speech acts) and exchanges as well.

An aim of our work has been to represent all knowledge in the same structure, and in the same representation language. This would make it possible to develop the linguis tic knowledge and the domain knowledge simultaneously in the same environment. It also makes it possible in principle to integrate syntactic and semantic processing. The processing modules that we have implemented so far, however, differ in the representation languages that they assume.

The central processing module of the system is the dialog manager, DM, which receives user inputs, controls the data-flow of the system and maintains the discourse representation. The discourse representation consists of three dynamic structures. The first one, the score-board, keeps information about salient objects and properties which are needed by the instantiator and generator modules. The score-board is basically an interface to the second dynamic structure, a dialog tree which represents the entire dialog as it proceeds in the interaction. The nodes of the dialog tree are instances of dialog objects, i.e. various types of moves and segments. They carry information about properties such as speaker, hearer, topic and focus, and are associated with a local plan. The plan is structured in terms of actions and is combined with similar plans of other nodes to form the third structure, the action plan stack where the actions to be performed by the DM are stored.

The interaction is interpreted from the information conveyed in the speech act directly; no reasoning about users' intentions or goals is utilized. Speech act information is assembled into Initiative-Response units which form a basis for interpreting the segment structure. A simple context free grammar can model the interaction and the rules are selected based on information about properties of objects describing the information provided by the system. Referring expressions are handled by copying information from the previous segment to the current segment which is in turn updated with information from the background system.

The DM is thus characterized by its distributed control. The actions of the action plan stack are distributed on the nodes of the dialog tree that are still open. This means that if, say, the parser fails with a certain node being current, that node creates an instance of a clarification request segment, which will control the dialog during the clarification. This segment consists of two parts, one part for prompting the user with a clarification request and another to interpret the user input. Finally the user response is integrated into the dialog tree. The distributed design has the advantage that we can use quite simple, local plans. Detailed descriptions of the dialog manager can be found in Ahrenberg, Jönsson and Dahlbäck, (1990), Dahlbäck and Jönsson (1992) and Jönsson (1991, 1993a, 1993b).

The principles for dialogue management utilized in the Dialogue Manager are applied to written natural language interaction for simple service systems. We are currently investigating its applicability for multi-modal interaction. There are indications that the principles also apply to multi-modal communication for simple service systems (Jönsson, 1995; Stein & Thiel, 1993).

Wizard-of-Oz studies and NL-dialog characteristics

An important part of the work on dialog systems has been to find characteristics of the sublanguage of man-machine communication in NL, which would be useful for the design of NL-interfaces. Empirical studies of these kind of dialogues have been undertaken for some time now in our group using so-called Wizard of Oz experiments, i.e. by letting users communicate with a background system through an interface which they have been told is a natural-language interface, but which in reality is a person simulating such a device (Dahlbäck, Jönsson & Ahrenberg, 1993).

We have previously studied a number of different real or simulated background systems, to provide an empirical basis for the development of the LINLIN system described above. This work is described in a number of publications, e.g. Dahlbäck & Jönsson (1989, 1992), Dahlbäck (1991a, b), Dahlbäck, Jönsson, Ahrenberg, (1993).

This work, as well as similar studies by others, indicates that dialogues with computers in written natural language differ from dialogues between people. It is still, however, an open question as to what extent these differences are due to assumed and real differences between people and computers as dialog partners, or due to the qualities of the communication channel. In an on-going project we have collected a corpus of 60 dialogues to study these questions. Three different scenarios were used, two of which concerned querying a data base for information, but on different domains. The third scenario involved both ordering and data base querying. For each scenario, 10 subjects were told that they were interacting with a computer system directly, and 10 were told that they were interacting via terminal with a person having such a system on his desk. The analysis of this corpus is continuing, but the results obtained thus far indicates that there are small or no differences between the dialogues with people and with computers. Consequently, the characteristics of so-called `computerese', i.e. the sub-language used when interacting with a computer, seem to stem more from the characteristics of the communication channel and the task situation, than from the believed characteristics of the communication partner.

Empirical studies of computational models of discourse

Research on computational models of discourse can be motivated from two different standpoints (cf. Dahlbäck & Jönsson, 1992). One is to develop general models and theories of discourse for all kinds of agents and situations. The other approach is to account for a computational model of discourse for a specific application, say a natural language interface. It is not obvious that the two approaches should present similar computational theories for discourse. Instead the different motivations and approaches should be considered when presenting theories of dialogue management for natural language interfaces.

There are two general classes of theories on dialogue management in the natural language community. One is the plan-based approach. Here the linguistic structure is used to identify the intentional state in terms of the user's goals and intentions. These are then modelled in plans describing the actions which may possibly lead to their fulfillment.

The other approach to dialogue management is to use only the information in the linguistic structure to model the dialogue expectations, i.e. utterances are interpreted based on their functional relation to the surrounding interaction. The idea is that these constraints on what can be uttered, allows us to write a dialogue grammar to manage the dialogue.

The plan-based approach is not only a model for dialogue in natural language interfaces but also aims to account for general discourse. The dialogue grammar approach, however, is more limited (though there are researchers that claim that this method could also be used as a general model of discourse, both within computational approaches (e.g. Reichman, 1985), and in other areas of discourse analysis (e.g. Stubbs 1983)).

Several theories of discourse that are relevant for NLP make central use of some notion of a discourse segment. A problem with all of them, however, is that they do not provide a definition of a segment which is both general and precise enough for computer applications. In these circumstances we found it necessary in our dialog system project to adopt a sublanguage approach to discourse representation and processing, using simulation data as the primary source of data for development of a model.

A basic finding of the studies was that almost all input from users (and output from the systems) could be classified as either initiatives or responses and that initiatives typically introduce a single goal in the form of a single question or request. Nestings could occur, however, so that an initiative from the system could be countered by an initiative from the user e.g. requesting some clarification from the system. Still, the overall structure of the dialogue can be given a simple tree structure in terms of segments defined by initial initiatives and closing responses. Moreover, this segment structure correlated strongly with the range of anaphoric references (Dahlbäck, 1991a, 1992) and it seemed possible to keep track of the focused information in each segment by means of a small list of attributes that hold items that are likely to be referenced by a pronoun or be implicit in a following utterance (Ahrenberg, Jönsson, Dahlbäck, 1990, Jönsson 1993a). These results can be summarized by saying that a grammar-based approach to discourse representation seems sufficient for many important application areas so that the complexity associated with the more general plan-based approaches can be avoided (Jönsson, 1991, 1993a).

One problem with comparing the two approaches to dialogue management is that they have been developed using different empirical bases. To overcome this, we are currently engaged in a project whose aim is to empirically compare the two approaches, by analysing a set of dialogues using both models. We will collect a corpus of dialogues from human-computer interaction, both written and spoken, and analyze the dialogues using both a coding scheme for our dialogue grammar model, as well as with a scheme for a plan or intention based model, similar to the one used by Grosz and Hirschberg in their empirical work on discourse structure (Hirschberg and Grosz, 1992, Grosz and Hirschberg, 1992). The dialogues will come both from our own corpora, as well as from other researchers in Sweden and abroad. We are interested both in issues such as coding reliability and applicability for the different approaches, as well as the usefulness of the assigned structures for anaphora resolution and answer generation. This work is still in progress, and will continue until the summer of 1996. Parts of the work will be presented at the 1995 AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation (Ahrenberg, Dahlbäck and Jönsson 1995).

Back to Current Research