Robustness in Speech Based Interfaces:

Sharing the Tricks of the Trade
A CHI 2002 Workshop

Sunday April 21

Introduction

Speech technology has the potential to fundamentally change, in the long term, the way we interact with computers. While currently speech cannot be used universally to replace a keyboard, it is being used successfully and productively in a variety of applications. The key to using speech effectively is that there be a compelling reason to use it in the first place.

Speech is as innate to humans as breathing. We speak to our cats, our dogs and some of us even speak to our plants. While speech is a very natural form of communication for people, its adoption rate has been slow in major commercial computer systems. Part of this may be due to the fact that insufficient usability expertise has gone into the design of many speech systems; many having been created by speech developers instead of HCI experts. Given that speech is not 100% accurate, nor will be in the foreseeable future, a special set of skills and knowledge must be applied to the design of speech interfaces to make them usable. Part of the goal of the workshop is to expand the base of this knowledge through sharing from experience.

Goals of the Workshop

This one-day workshop will focus on the kind of knowledge gained through the experience of developing robust and user-friendly speech-based interfaces, which rarely finds its way to journal and conference publications. We want the participants to bring to the table one or more of their "tricks of the trade" – a technique that worked well in increasing the interactional robustness of their system.

While it always is interesting to hear about and learn from success stories, perhaps even more can be learned from failure stories. By sharing knowledge about approaches tried without success, other workers and project teams need not walk down the same cul-de-sac. Another important issue in learning from experience is knowing in which contexts the lessons learned apply and in which contexts they can not be applied.

Even though some attempts have been made in creating a taxonomy for different kinds of dialogue situations, e.g. [1], more efforts are needed here. Therefore, the workshop will also address this issue, by discussing the application range of the shared experiences, both positive and negative.

What we hope will emerge from the workshop is not only a list of 'this works' and 'this doesn't work', but also qualifications of these statements with respect to different conditions.

Among the issues the workshop will take on are:

What are the techniques that can be used to increase robustness in speech-based interfaces? Do they apply if used in different domains and different user groups?
What are the dimensions for distinguishing between different classes of speech-based interface types and user situations, and how do these dimensions impact the techniques discussed?

Regarding the CHI 2002 theme of "Changing the World, Changing Ourselves," the development of truly intuitive natural language interfaces has the potential to fundamentally change the way we work. While it seems unlikely that this would lead to changing the world, it certainly would contribute to making computers more accessible by all through non-traditional computational means.

Detailed Description of the Topic

The focus of the workshop is on speech-based dialogue systems. We are interested in both speech-only systems and in multi-modal systems, where speech is an important aspect of the interaction. Of special interest are issues related to speech recognition (as opposed to speech synthesis), since the variability in the accuracy creates robustness problems in the dialogue.

Robustness in interactive systems comprises two interrelated aspects, the internal and the interactional. The internal concerns the quality of the software from a systems point of view (e.g. the accuracy of the speech recognition algorithms). The interactional concerns the robustness of the dialogue between the user and the system. In this case, robustness makes it possible for the interlocutors to communicate in a way that avoids mistakes and mis-understanding. This can be accomplished either by getting it right the first time, or by making it possible to easily recover from mistakes, in a flexible and non-disturbing way.

In this workshop we will concentrate on the interactional aspects of robustness.

Plan for the Workshop

Prior to the workshop we will distribute all participants' description of their "tricks of the trade". We will also ask participants to provide information (e.g. website or white papers) on the project/s from which their examples are taken. We will ask all participants to familiarize themselves with these, and hereby reduce the need for providing background and context information during the workshop session. This will allow more time for discussion of the different methods described. It will also provide a common ground for the work we hope to do together on how to distinguish different classes of speech based systems, and how to qualify suggestions of different methods with respect to different conditions.

We will also ask a small set of participants to act as commentators for selected presentations, to help focus the discussions and get them off the ground faster. We will select approximately three commentators, and group the presentations into potential discussion topics. For example one group might consist of papers that relate similar experiences but in different situations, or different experiences in seemingly similar situations. (The exact form of this grouping cannot be described in detail until we know which techniques the participants will bring to the workshop).

Workshop Structure

Participant presentations – each participant will be asked to present his or her favorite trick/s for increasing robustness in the user interaction of a speech-based system. Ideally the participant will have one or more success stories to share as well as one or more failure stories. These will be followed by short comments by other selected participants.
Group examination and discussion – a facilitated discussion will ensue to examine the various techniques presented and attempt to find common threads as well as define circumstances for success.
Summarization of techniques;
Discussion and summarization of suggested dimensions of speech-based systems and user situations;
Wrap up – the concluding activities for the workshop will be the creation of poster and a definition of follow-on activities (e.g. publications).

Desired Number of Participants

For this workshop, the most fruitful discussions will be obtained by gathering together practitioners and researchers with significant experience designing speech based systems and prototypes. No more than 15 participants will be selected to ensure the novelty of the suggested techniques and to avoid repetition.

Participant Selection Criteria

We are looking for participants who have lessons to share (either good or bad) from their experiences with testing or using their speech-based applications with real world users. Our goal would be to construct a list of attendees with a wide diversity of experiences including varied platforms, contexts of use, and user populations. We will select participants based on of the quality of their proposed "trick of the trade" and the value of sharing a description of a failed attempt.

Given the dynamic nature of speech technology, preference will be given to the participants with more recent experience using speech-based interactive technology. For example, tricks of the trade that worked well with discrete speech recognition would have little applicability to today's technologies.

Method of Interaction

Initially, we will ask each participant to briefly describe his or her proposed techniques along with the experience with it. The focus here will be on understanding what the techniques are and the context for applying the particular technique.

After each presentation, the group will react to the technique and success/failure story presented. The commentators will give their thoughts (only a couple of minutes long). Others will then share their experience with the technique as well as their particular circumstances of the usage. If none have previous experience with the technique, there will be a discussion of the potential strengths of the proposal and its applicability to a variety of circumstances. Interlacing each ten-minute presentation with questions and group discussion will ensure that the format of the workshop will remain interactive.

At the end of the formal presentations, the organizers will facilitate a group discussion as to what the common threads are in the techniques tried and how these relate to the dimensions of the particular speech interface on which it was tried. We will then summarize the group's proposed techniques. We will also allow time for a general discussion of how to evaluate our output and how to continue the work in a distributed fashion.

Schedule for the Workshop

Morning: Preliminary introductions; agenda; presentations by participants (about 10 minutes each); group reaction and commentary to each technique presented.

Afternoon: Facilitated discussion of common threads in proposed techniques. Definition of dimensions for classes of speech-based interfaces and applicability of the various techniques for each dimension. Synthesis by group of findings into poster. Discussion of ways to publish the results from the workshop

Pre-workshop Activities

We would require participants to read all the position papers and be prepared to discuss their experience (if any) with the "trick" proposed in the position paper of others workshop participants. If the participant does not have any experience in the particular technique presented, then thought should be given as to its applicability to a wide range of systems.

We will set up a website for the workshop to facilitate the distribution of the position papers as well as pre-workshop discussions and interactions. Through these discussions we will select a small number of commentators that will prepare comments or questions to the position papers.

Dissemination of the Results

We will create a poster for the CHI conference.

Given that the workshop will focus on topics that usually are difficult topics for traditional journal and conference papers, selecting the contributions from the workshop for a journal special issue might not be the best idea for distributing the lessons learned to a wider audience. However, one of our goals is to summarize the workshop in one or more papers related to the two central themes of the workshop. The exact format for this will be discussed at the workshop, and possibly in post-workshop email discussions. In addition to these publications, we will prepare a report for publication in the SIGCHI Bulletin.

Organizers' Backgrounds

Jennifer Lai is a Senior Designer who has been working in field of Speech Recognition at the IBM T.J. Watson Research Center for 12 years. During the first 6 years she worked on the IBM Speech Recognition engine, creating language models and tools to facilitate model creation, while for the last 6 years she has directed her attention to developing speech applications for users. Jennifer specializes in requirements gathering and analysis, interaction design, and usability testing for speech systems including the IBM product MedSpeak for Radiology. As the lead for User Centered Design, Jennifer has focused more closely the past few years on managing relationships with the user/client development partners. She has published numerous papers on the use of speech in multi-modal systems, the comprehensibility of text-to-speech, the development of statistical language models and has multiple patents in Natural Language Translation and Speech Interfaces. She has taught a full day tutorial on the design of speech interfaces at CHI every year for the past four years, has presented papers at CHI'97, 2000 and 2001, and has participated in several CHI workshops. She has also led both workshops and focus groups as part her work at IBM.

Nils Dahlbäck is Assistant Professor in Cognitive Science at the Natural Language Processing Laboratory at the Computer and Information Science Department, Linköping University, Sweden. He earned Bachelor Degrees in Psychology (1973) and in Speech Therapy (1997) from Lund University, and his Ph.D. in Communication Studies from Linköping University (1992). His current research interests are: dialogue managers for natural language and multi-modal interaction; hypermedia navigation, especially the influence of individual differences in cognitive and cultural factors; the effects of use of speech interactive technology for non-native language speakers. He has also worked on developing iterative methods for the development of Natural Language interfaces, and especially the use of Wizard of Oz-techniques for this, and on dimensions for describing NL-dialogue systems. He was co-organizer of the CHI 2000 workshop on natural language interfaces, and the CHI 2001 SIG on Natural Language, as well as other workshops on NL-dialogue and multi-modal interaction. He has also participated in a number of CHI workshops previously.

Arne Jönsson is Associate Professor in Computer Science, Department of Computer and Information Science, Linköping University, Sweden. Ph. D, Computer Science (1993). Tech lic, Computer Systems (1984). BA in Education Psychology (1984). MSc in Computer Technology, (1980). He has been working with research on natural language interfaces, especially dialogue managers and empirical studies of human computer interaction for almost twenty years. He has organized various scientific workshops on Dialogue Systems, most recently two IJCAI-workshops on "Knowledge and Reasoning in Practical Dialogue Systems". He is project leader for research projects on multimodal interaction, dialogue models and development of Swedish dialogue systems. He has published a number of papers on dialogue systems, methods for empirical investigations and methods for development of dialogue systems.

References

[1] Dahlbäck, Nils (1997) Towards a Dialogue Taxonomy. In Elisabeth Maier, Marion Mast, Susann LuperFoy (Eds.), Dialogue Processing in Spoken Language Systems. Springer Verlag Series LNAI – Lecture Notes in Artificial Intelligence, 1236.