Machine Intelligence 18 Workshop,

Room K/G33

King's Manor, University of York,

19-21 September 2001

This workshop is the 18th in the Machine Intelligence series, which was founded by Professor Donald Michie in 1965. The workshop programme chair is Professor Stephen Muggleton, with local arrangements support by Jane White. The theme of this year's workshop is ``Biological aspects of Machine Intelligence''. This year's release of the first draft of the human genome has led to a number of the invited papers discussing the topical area of applying Machine Intelligence techniques to the analysis of biological data. Techniques investigated include the use of ontologies, machine learning and robotics. Other topics include biologically inspired robot architectures, agency as well as language usage (by ants) and language acquisition (by children).

Papers at the workshop will be refereed for a special issue of the Machine Intelligence area of the ETAI (Electronic Transactions on Artificial Intelligence) Journal. Referee comments for inclusion of papers will be posted on the MI18 ETAI web page.

We would like to acknowledge generous support of the Machine Intelligence 18 workshop by EPSRC Grant GR/R69655/01.

Programme for Machine Intelligence 18 (PDF - includes information about breaks and session chairs)

Contributors

Kiyoshi Asai
Michael Ashburner
Chris Bryant
Jacques Cohen
Adrian Cootes
Luc De Raedt
Koichi Furukawa
Jean Hayes-Michie
Ross King
Jan Komorowski
Donald Michie

Satoru Miyano (represented by Hideo Bannai)
Shinichi Morishita
Stephen Muggleton
Nils Nilsson
Philip Reiser
Zhanna Reznikova
Boris Ryabko
Claude Sammut
Erik Sandewall
Steffen Schulze-Kremer

Paper Formatting Information

Venue Information

Railtrack Information

Accommodation Information - Please contact Jane White (jane@cs.york.ac.uk) concerning booking of accomodation at the Dean Court and Minster Hotels.

Titles and Abstracts

Speaker: Kiyoshi Asai
Title: Finding Genes from DNA Sequences by Using Multi-stream Hidden Markov Models
Abstract: It is necessary to integrate various types of information in order to recognize the genes in DNA sequences. The information might be the four letters of DNA sequences, stochastic frequency of the corresponding codons, homology search scores, splice cite signal strengths. We have developed a software system of multi-stream Hidden Markov Models with which those information can be easily integrated with a consistent measure of probabilities. The output symbols of HMMs are the signals that we can observe. In the field of bioinformatics, the output symbols of the HMMs are mostly the four letters of nucleic acids or the twenty letters of amino acids. However, the output symbols can be anything, as far as we can attach their probability distributions. They can be discrete symbols, real values, real valued vectors, and multiple streams of those values. We propose gene annotation using HMMs with multiple streams, which combine the sequence and other pre-processed information. The important feature of multi-stream HMMs is that the weights of the streams can be optimized for each model. The multi-stream HMM with adjustable weight permits very flexible design of the gene finding systems. (URL: http://www.genedecoder.org/)
TBA

Speaker: Michael Ashburner
Title: On the representation of "gene function and process" in genome databases: The Gene Ontology Consortium
Abstract:
Genes are expressed in temporally and spatially characteristic patterns. Their products are (often) located in specific cellular compartments and may be components of complexes. Gene products possess one or more functions which may play a role in one or more biological processes. We are developing a method to describe these attributes in a rigorous way that will allow researchers to consistently annotate and explore the universe of genomes. This is a collaborative project between several model organism databases; FlyBase for the fruitfly, the Saccharomyces Genome Database for yeast, Mouse Genome Database for mouse, TAIR for Arabidopsis and WormBase for C. elegans. GO is now being used quite extensively in industry. The three organizing principles of GO are function, process and cellular component. A gene product can have one or more functions and can be used in one or more processes; a gene product may be a component of one or more supra-molecular complexes. Function is what something does. It describes only what it can do without specifying where or when this usage actually occurs. Process is a biological objective. A process is accomplished via one or more ordered assemblies of functions. The model organism database curators assign controlled vocabulary terms to gene products based on current knowledge. This information describes the specific function(s) of a gene product, the role(s) it plays in cellular processes and its localization and associations. The structure of the "Gene Ontology" Database allows both attribution and querying at different levels of granularity. The Gene Ontology database provides annotation, definitions, and unique identifiers for all terms. The cooperation between the model organism databases ensures the use of a common ontology and, therefore, facilitates queries across all cooperating model organism databases. Gene Ontology, therefore, allows users to make queries exemplified by: "What mouse genes encode products playing a role in RAS signal transduction ?"; "What fly genes encode products in this biological process ?"; "What genes in flies and yeast have products that are components of the mitochondrial inner membrane and function as enzymes ?". GO is modelled as a directed acyclic graph, with two classes of relationship between parent and child - instance (isa) and part-of. The semantics of these relationships are not as well defined as we would wish. In principle, all GO terms have formal definitions (in practice only 11% as I write). GO is also being instantiated in DAML+OIL by the Manchester group. The GO database is open to all without restraint. It, a GO browser and editor, are available from http:://www.geneontology.org/. We thank AstraZeneca & the NIH for financial support.

Speaker: Christopher H. Bryant
Title: Combining Inductive Logic Programming, Active Learning and Robotics to Discover the Function of Genes
Abstract:
We aim to partially automate some aspects of scientific work, namely the processes of forming hypotheses, devising trials to discriminate between these competing hypotheses, physically performing these trials and then using the results of these trials to converge upon an accurate hypothesis. We have developed ASE-Progol, an Active Learning system which uses Inductive Logic Programming to construct hypothesised first-order theories and uses a CART-like algorithm to select trials for eliminating ILP derived hypotheses. We have developed a novel form of learning curve, which in contrast to the form of learning curve normally used in Active Learning, allows one to compare the costs incurred by different leaning strategies. We plan to combine ASE-Progol with a standard laboratory robot to create a general automated approach to Functional Genomics. As a first step towards this goal, we are using ASE-Progol to rediscover how genes participate in the aromatic amino acid pathway of Saccharomyces cerevisiae. Our approach involves auxotrophic mutant trials. To date, ASE-Progol has conducted such trials in silico. However we describe how they will be performed automatically in vitro by a standard laboratory robot designed for these sorts of liquid handling tasks, namely the Beckman/Coulter Biomek 2000.

Speaker: Jacques Cohen
Title: Classification of approaches used to study cell regulation: Search for a unified view using constraints and machine learning
Abstract:
Historically, the first attempts to study cell regulation were based on establishing systems of nonlinear differential equations that "simulate" the behavior of the variables involved as a function of time. In that context the variables correspond to genes capable of producing compounds (mRNA and proteins) that in turn influence the production capacity of other genes. This problem has been thoroughly studied in the mathematical theory of dynamic systems. Graphs expressing the interactions among genes are often used as convenient abstractions enabling the formulation of the appropriate systems of differential equations. The nodes of those graphs correspond to the variables (genes) and the directed edges -- labeled by plus or minus signs -- express the positive or negative interactions among the variables. In the case of cell regulation thousands of genes may interact with each other. The advent of micro-arrays urgently motivates the solution of the inverse problem: given the curves expressing the capacity of each gene of producing known compounds as time progresses, attempt to infer the original system of equations or equivalently the graph expressing the interactions among genes. Presently biologists are capable of generating huge sets of data expressing the amount of production of proteins or mRNA as a function of time. One then wishes to establish the corresponding gene interaction graph. The inverse problem may be called "modelization" in contrast with "simulation", the term used to describe the direct problem of determining the behavior of the variables in a system of differential equations. There have been many attempts to discretize (render discrete) the cell regulation problem. By considering Boolean variables instead of continuous variables one can simplify remarkably the complexity of the original problem. Several papers are now available that describe solutions of both the direct and inverse problem using the Boolean approach. An important aspect of the inverse problem is that the data obtained by biologists is not reliable in the sense that many experimental errors occur when dealing with thousands of genes. It then becomes imperative to take into account a probabilistic approach of the inverse problem. In other words, given imperfect data expressing the behavior of variables with time, can one infer the corresponding gene interaction graph? More recently this last problem has been simplified by having biologists conjecture the interaction graph and propose the simpler question: How likely is the hypothesized graph capable of generating the given data set? The approach used in this paper is to first attempt to classify the existing methods proposed in the literature to solve the direct and inverse problems. An initial classification involves three parameters and thus eight subclasses: Probabilistic versus Nonprobabilistic Direct versus Inverse Continuous versus Discrete In a recent article de Jong and Page propose a Direct-Discrete-Nonprobabilistic type of approach; that amounts to a qualitative (a la Kuipers) simulation of the systems of differential equations representing a given influence graph. Even though their algorithms are described in a standard programming language, the authors make extensive use of constraints in formulating the solution to that problem. In this paper I first propose the use of constraint logic programming as a paradigm for solving the discrete versions of the cell regulation problem. Basically one considers regions in an n-dimensional space, and transitions that specify possible paths between the regions. Both regions and transitions are specified by sets of constraints that restrict the values of the variables being studied. A path linking an initially specified region, to subsequent regions via the transitions, expresses the behavior of the system. (That description is inherent to the approach taken by de Jong and Page.) The inverse problem amounts to determining regions and transitions when given the paths that may be derived from the data expressing the behavior of the variables. From those regions and transitions one should be able to determine the desired gene interaction graph. In the particular case of Boolean constraints -- a special case among the discrete approaches -- the determination of the regions and transitions amounts to the problem of synthesizing Boolean circuits from input and output data. The regions correspond to subsets of the input data and the transitions correspond to Boolean formulas that allow transforming the input data into output. Therefore the problem using Boolean variables can be expressed in terms of program synthesis or machine learning. Similarly, it is hoped that the synthesis of constraints from data should play a role in the discrete solution of the inverse problem. An interesting payoff of the use of constraints is that they may well pave the way for probabilistic solutions to both the direct and indirect versions of the cell regulation problem. The approach reinforces the importance of developing probabilistic machine learning algorithms expressed in logic programming and in constraint logic programming.

Speaker: Adrian Cootes
Title: Automatic determination of protein fold signatures from structural superpositions
Abstract:
It remains unclear what principles underlie a protein sequence/structure adopting a given fold. Local properties such as the arrangement of secondary structure elements adjacent in sequence or global properties such as the total number of secondary structure elements may act as a constraint on the type of fold that a protein can adopt. Such constraints might be considered "signatures" of a given fold and their identification would be useful for the classification of protein structure. Inductive Logic Programming (ILP) has been applied to the problem of automatic identification of structural signatures. The signatures generated by ILP can then be both readily interpreted by a protein structure expert and tested for their accuracy. A previous application of ILP to this problem indicated that large insertions/deletions in proteins are an obstacle to learning rules that effectively discriminate between positive and negative examples of a given fold. Here, we apply an ILP learning scheme that reduces this problem by employing the structural superposition of protein domains with similar folds. This was done in three basic steps. Firstly, a multiple alignment of domains was generated for each type of fold studied. Secondly, the alignment was used to determine the secondary structure elements in each of those domains that can be considered equivalent to one another (the "core" elements of that fold). Thirdly, an ILP learning experiment was conducted to learn rules defining a fold in terms of those core elements.

Speaker: Luc De Raedt
Title: Molecular Feature Mining in HIV Data
Abstract:
We present the application of Feature Mining techniques to the Developmental Therapeutics Program's AIDS antiviral screen database. The database consists of 43576 compounds, which were measured for their capability to protect human cells from HIV-1 infection. According to these measurements, the compounds were classified as either active, moderately active or inactive. The distribution of classes is extremely skewed: Only 1.3% of the molecules is known to be active, and 2.7% is known to be moderately active. Given this database, we were interested in molecular substructures (i.e., features) that are frequent in the active molecules, and infrequent in the inactives. In data mining terms, we focused on features with a minimum support in active compounds and a maximum support in inactive compounds. We analyzed the database using the level-wise version space algorithm that forms the basis of the inductive query and database system MolFea (Molecular Feature Miner). Within this framework, it is possible to declaratively specify the features of interest, such as the frequency of features on (possibly different) datasets as well as on the generality and syntax of them. Assuming that the detected substructures are causally related to biochemical mechanisms, it should in principle be possible to facilitate the development of new pharmaceuticals with improved activities.

Speaker: Koichi Furukawa
Title: A Computational Model for Children's Language Acquisition using Inductive Logic Programming
Abstract:
This paper describes our research activity on developing a computational model for children's word acquisition using inductive logic programming. Modeling human language acquisition is a very hard problem. The reason why we focused children's word acquisition is that it is an initial phase of the entire human language acquisition and therefore the problem is rather easy compared to the entire learning activity. We tried to incorporate cognitive biases developed recently to explain the efficiency of children's language acquisition. We also designed a co-evolution mechanism of acquiring concept definitions for words and developing concept hierarchy. Concept hierarchy plays an important role of defining contexts for later word learning processes. A context switching mechanism is used to select relevant set of attributes for learning a word depending on the category which it belongs to. On the other hand, during acquiring definitions for words, concept hierarchy is developed. This mechanism is supposed to solve the selection problem of relevant sensory inputs for each word learning task. We developed an experimental language acquisition system called WISDOM (Word Induction System for Deriving Object Model) and conducted virtual experiments or simulations on acquisition of words in two different categories. The experiments showed feasibility of our approach.

Speaker: Jean Hayes-Michie
Title: Real Time Skill: the role of semantic information in learning, performance and report
Abstract:
The literature on human skills has seen the rise of two successive models described below in simplified form: (1) BEHAVIOURIST: skill is formed by pure practice. Trial-and-error experience builds a black box of stimulus-response bonds, impervious to both instruction and introspection. Subjects' post hoc reports are confabulations. (2) COGNITIVIST: skill is first acquired in 'declarative' form, though either or both of (a) construction of a conscious model from the task's perceived behavioural semantics; (b) tutorial exposition of the task's structure and of applicable heuristics. Declarative forms are then internally remoulded into procedural black boxes, but also retained for purposes of post hoc report. EXPERIMENTAL FINDINGS The learning by 51 male and female subjects of a computer simulation of a well-known real-time control problem supplied some useful observations. Under certain conditions of 'cognitive overload' an otherwise unlearnable task bacame learnable only when the screen presentation was descriptive of the task's behavioural semantics. The cognitive overload condition was the combination of a substantially visual task with female gender. It is well-attested that females are in general at a disadvantage when dealing with highly visual information, especially if it involves motion. When a meaningful animated cartoon was substituted for an abstract representation, female subjects showed a capacity to learn, a capacity quite absent when they were asked to use the simple 'instrument panel' variant. This corroborates elements of the cognitivist model, and is not explicable in terms of pure behaviourism. But discovery by male subjects of key heuristics by trial and error, regardless of representation, contradicts certain elements of the cognitivist position. A need is apparent for a cognivitist-behaviourist synthesis. A new model is suggested.

Speaker: Ross King
Title: Representing molecules for induction using logic and quantum mechanics
Abstract:
The key to applying machine inference to scientific problems is to have an appropriate representation of the problem. The most successful applications of inductive logic programming (ILP) have been in the field of learning molecular quantitative structure activity relationships (QSARs). The basic representation used in these applications has been based on atoms linked together by bonds. This representation, although successful and intuitive, ignores the most basic discovery of 20th century chemistry, that molecules are quantum mechanical objects. Here we present a new representation using quantum topology (StruQT) for machine inference in chemistry which is firmly based on quantum mechanics. The new representation is based on Richard F.W. Bader's quantum topological atoms in molecules (AIM) theory. Central to this representation is the use of critical points in the electron density distribution of the molecule. Other gradient fields such as the Laplacian of the electron density can also be used. We have demonstrated the utility of the use of AIM theory (in a propositional form) in the regression problem of predicting the lowest UV transition for a system of 18 anthocyanidans. The critical point representation of fields can be easily mapped to a logic programming representation, and is therefore well suited for ILP. We have developed an ILP-StruQT method and are currently evaluating it for QSAR problems.

Speaker: Jan Komorowski
Title: On Discovering Gene Function with Approximate Methods
Abstract: Functional Genomics is the study of gene function on a large scale that is done by conducting parallel analysis of gene expression for thousands of genes. This research is still in an early phase but has been recently sped up by the advent of high-throughput technologies such as, for instance, microarrays. With the completion of several genomes, the attention is shifting towards discovering function of new genes. Critical to any scientific discovery is the ability to re-use the existing knowledge. Somewhat surprisingly, most if not all work currently reporting on the analysis of gene expression applies unsupervised approaches that do not really use the huge repositories of biological knowledge. In our recent work, we have taken a supervised approach to predicting gene function from gene expression time-series profiles and from background biological knowledge. To this end we annotated known genes with terms from Gene Ontology. Using rough sets, as implemented in the publicly available ROSETTA toolkit, we then generated a predictive model of high quality (AUC value 0.83) that provided, among others, function hypotheses for uncharacterized genes. The method is robust in the following sense. Despite missing annotations of some known genes, the model appeared to be able to classify and re-classify several of the known genes in a way that was new to biologists. In other words, the model discovered biological knowledge that was not present in the learning examples. These discoveries, represented as false positive classifications, were initially dismissed. After a literature-based re-examination by biologists, a significant portion of the false positives appeared to be correct. A considerable number of the literature confirmations of false positive classifications could be visualized by using PubGene (http://www.pubgene.org/), a tool developed by us for effective retrieval of literature-based knowledge about relations between genes. Our recent publication in Nature Genetics 28, 21-28 (2001) gives details about that work. Our work demonstrates that Gene Ontology annotations emulate deep biological knowledge that may be functionally linked to gene expression profiles thereby aiding the discovery of new gene functions. We conclude that methods of approximate reasoning that are able to deal with large amount of conflicting and missing knowledge will be instrumental to the further development of Functional Genomics.

Speaker: Donald Michie
Title: Return of the "imitation game"
Abstract:
Very recently there has been an unexpected rebirth of Turing's imitation game in the context of commercial demand. To meet the new requirements the following properties of human conversation constitute a minimal list of what must be simulated. Real chat utterances are concerned with associative exchange of mental images. They are constrained by contextual relevance rather than by logical or linguistic laws. Time-bounds do not allow real-time construction of reasoned arguments, but only the retrieval of stock lines and rebuttals, assembled Lego-like on the fly. A human agent has a place of birth, age, sex, nationality, job, family, friends, partners, hobbies etc. On meeting again with the same conversational partner, a human agent is expected to recall the gist of previous chats, as well as what has passed in the present conversation so far. A consistent personality emerges from a human's expression of likes, dislikes, pet theories, humour, stock arguments, superstions, hopes, fears, aspirations etc. A human agent typically has a main goal, of fact-provision, fact elicitation, wooing, selling, "conning" etc. at each stage. A human agent remains ever-ready to maintain or re-establish rapport by switching from goal mode to chat mode. Some mechanisms and tactics for simulating these things will be illustrated from work in progress.

Speaker: Satoru Miyano (represented by Hideo Bannai)
Title: HypothesisCreator: Concepts for Accelerating the Computational Knowledge Discovery Process
Abstract:
We summarize and discuss the work accomplished through the HypothesisCreator (HC) project, an ongoing effort whose aim is to develop systematic methods to accelerate the computational knowledge discovery process. A novel approach to describe the knowledge discovery process, focusing on a generalized form of attribute called view, is presented. It is observed that the process of knowledge discovery can, in fact, be modeled as the design, generation, use, and evaluation of views, asserting that views are the fundamental building blocks in the discovery process. Also, we note that the trial-and-error cycle inherent in any knowledge discovery process, can be formulated as the composing and decomposing of views. To assist this cycle, a programming language called VML (View Modeling Language) was designed, providing facilities for this purpose. VML is designed as an extension to the functional language ML with the extensions `view' and `vmatch', which are used for defining and decomposing views. We describe, as VML programs, successful knowledge discovery tasks which we have actually experienced, and show that computational knowledge discovery experiments can be efficiently developed and conducted using this language. The original idea for VML, however, was only informally described and therefore contained subtle problems both in theory and practice. This was settled by Sumii and Bannai (2001) who gave the formal foundation of VML by defining VMlambda, an extention to the standard simply typed call-by-value lambda-calculus.

Speaker: Shinichi Morishita
Title: Enumerating Alternative Spliced Transcripts for Understanding Complexity of the Human Genome
Abstract:
According to recent several reports, the number of human genes is estimated fewer than 40,000, which for example only doubles 19,000 of Caenorhabditis elegans. This fact indicates that the correlation between complexity of a species and its gene number appears to be loose, motivating studies on other sources of complexity such as alternatively splicing and cis-regulatory machinery. Machine learning would be useful to uncover relevance between those mechanisms and biological consequences, but the first step towards this research direction would be to list candidates of alternatively spliced transcripts and promoters. Because of the release and updates of the human genome, this enumeration could be achieved by aligning millions of expressed sequence tags (ESTs, for short) to a draft human genome and then organizing alignments into related groups, which is however computationally costly. We therefore have developed an efficient algorithm for this purpose, taking only one day to align millions of ESTs to a newly revised draft genome from scratch. Analysis of 2.2 millions of alignments identifies about 8,000 groups of alternatively spliced transcripts. A typical group contains tens of alignments, demanding a method of selecting representatives. All the results are accessible through the WWW at our Gene Resource Locator web site "http://grl.gi.k.u-tokyo.ac.jp/."

Speaker: Stephen Muggleton
Title: Are grammatical representations useful for learning from biological sequence data?- a case study
Abstract:
This paper investigates whether Chomsky-like grammar representations are useful for learning cost-effective, comprehensible predictors of members of biological sequence families. The Inductive Logic Programming (ILP) Bayesian approach to learning from positive examples is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). Collectively, five of the co-authors of this paper, have extensive expertise on NPPs and general bioinformatics methods. Their motivation for generating a NPP grammar was that none of the existing bioinformatics methods could provide sufficient cost-savings during the search for new NPPs. Prior to this project experienced specialists at SmithKline Beecham had tried for many months to hand-code such a grammar but without success. Our best predictor makes the search for novel NPPs more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. As far as these authors are aware, this is both the first biological grammar learnt using ILP and the first real-world scientific application of the ILP Bayesian approach to learning from positive examples. A group of features is derived from this grammar. Other groups of features of NPPs are derived using other learning strategies. Amalgams of these groups are formed. A recognition model is generated for each amalgam using C4.5 and C4.5rules and its performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The highest RA was achieved by a model which includes grammar-derived features. This RA is significantly higher than the best RA achieved without the use of the grammar-derived features. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives.

Speaker: Nils Nilsson
Title: Teleo-Reactive Programs and the Triple-Tower Architecture
Abstract:
I describe how a new programming format is used in a proposed architecture for linking perception and action in a robot. Along the way, I mention some analogies with control mechanisms in animals. The programming format is called 'teleo-reactive' because it allows programs to react dynamically to changing situations in ways that lead inexorably toward their goals. The proposed architecture consists of three "towers" of layered components. The "perception tower" consists of rules that create ever-higher descriptions of the environmental situation starting with the primitive predicates produced by the robot's sensory apparatus. These descriptions, themselves high-level predicates, are deposited in a "model tower" which is continuously kept faithful to the current environmental situation by a "truth-maintenance" system. The predicates in the model tower, in turn, evoke appropriate action-producing programs in the "action tower." The action tower consists of teleo-reactive programs that are organized more-or-less hierarchically. Execution of the lowest-level action programs causes the robot to take primitive actions in its environment. The effects of the actions are sensed by the robot's sensory mechanism, completing a sense-model-act cycle that is quiet only at those times when the robot's goal is satisfied. I demonstrate the operation of the architecture using a simple block-stacking task.

Speaker: Philip Reiser
Title: Developing a Logical Model of Yeast Metabolism
Abstract: With the completion of the sequencing of genomes of an increasing number of organisms, the focus of biology is moving to determining the role of these genes (functional genomics). To this end it is useful to view the cell as a biochemical machine: it consumes simple molecules to manufacture more complex ones by chaining together biochemical reactions into long sequences referred to as metabolic pathways. Such metabolic pathways are not linear but often intersect to form a complex network. Genes play a fundamental role in this network by synthesising the enzymes that catalyse biochemical reactions. Although developing a complete model of metabolism is of fundamental importance to biology and medicine, the size and complexity of the network has proven beyond the capacity of human reasoning. This paper presents intermediate results in the Robot Scientist research programme that aims to discover the function of genes in the metabolism of the yeast Saccharomyces cerevisiae. Results include: (1) the first logical model of metabolism; (2) a method to predict phenotype by deductive inference; and (3) a method to infer reactions and gene function by abductive inference. We describe the in vivo experimental set-up which will allow these in silico inferences to be automatically tested by a laboratory robot.

Speaker: Zhanna Reznikova
Title: How to study ant's numerical competence using their own communicative means and ideas of information theory
Abstract:
The main point of proposed approach to study ants' cognitive abilities is that our experiments provide a situation in which insects have to transmit information quantitatively known to the experimentalist in order to obtain food. One may estimate some properties of ant intelligence by measuring complexity of tasks they solve in order to pass definite pieces of information from scouts to foragers. Our previous experiments, basing on ideas of Information Theory, have shown that ants are able to memorize and transmit messages concerning sequence of turns toward a trough of syrup and use the simplest regularities to compress the information. To reveal counting and number related skills, we suggested red wood ants Formica polyctena to transmit information on the number and coordinates of objects. One of the experimental set-ups consisted of a «tree trunk» with branches that ended in empty troughs, except for one which was filled with syrup. Another set-up consisted of a lattice which simulated Cartesian coordinates. The foragers of F. polyctena were separated into teams of 5-8 individuals , each with one scout. All laboratory ants were marked with coloured labels. To start the experiment, an ant scout was placed at the randomly numbered trough containing food and then returned to the nest on its own. The duration of the contact between foragers and the scout was measured. Then we removed the scout and the foragers had to search for the food by themselves. The experiments were so devised as to eliminate all possible ways that may help to find food, except for distant homing. It turns out that the ants are able to count within several tens, and transmit this information to their nestmates. The analysis of time duration of ants' contacts enable us to create a hypothesis of how they use numbers and coordinates in their communication. We suppose that only a few highly social ant species use such a complex communication system based on cognitive processes. At the same time, we believe that the experimental schemes described can be used to study the communication systems and numerical competence of other animals.

Speaker: Boris Ryabko
Title: The nonprobabilistic approach to learning the best prediction
Abstract:
The problem of predicting a sequence x_1, x_2,.... where each x_i belongs to a finite alphabet A is considered. Each letter x_{t+1} is predicted using information on the word x_1, x_2, ...., x_t only. We use the game theoretical interpretation which can be traced to Laplace where there exists a gambler who tries to estimate probabilities for the letter x_{t+1} in order to maximize his capital . The optimal method of prediction is described for the case when the sequence x_1, x_2,.... is generated by a stationary and ergodic source. It turns out that the optimal method is based only on estimations of conditional probabilities. In particular, it means that if we work in the framework of the ergodic and stationary source model, we cannot consider pattern recognition and other complex and interesting tools, because even the optimal method does not need them. That is why we suggest a so-called nonprobabilistic approach which is not based on the stationary and ergodic source model and show that complex algorithms of prediction can be considered in the framework of this approach. The new approach is to consider a set of all infinite sequences (over a given finite alphabet) and estimate the size of sets of predictable sequences with the help of the Hausdorff dimension. This approach enables us first, to show that there exist large sets of well predictable sequences which have zero measure for each stationary and ergodic measure. (In fact, it means that such sets are invisible in the framework of the ergodic and stationary source model and shows the necessity of the new approach.) Second, it is shown that there exist quite large sets of such sequences that can be predicted well by complex algorithms which use not only estimations of conditional probabilities.

Speaker: Claude Sammut
Title: Managing Context in a Conversational Agent
Abstract:
Traditional methods for Natural Language Processing have failed to deliver the expected performance required in a Conversational Agent. This is mainly due to the fact the speakers rarely use grammatically complete and correct sentences in conversation. Furthermore, speakers make significant assumptions about the background knowledge of the listener. Thus, pragmatic knowledge about the context of the conversation turns out to be a much more important factor in understanding an utterance than traditional linguistic analysis. For this reason, the a majority of conversational systems being implemented favour shallow parsing techniques. The challenge for such systems is in managing information about the context of the conversation. This paper describes a conversational agent, called "ProBot", that uses a novel structure for handling context. The ProBot is implemented as a rule-based system embedded in a Prolog interpreter. The rules consist of patterns and responses, where each pattern matches a user's input sentence and the response is an output sentence. Both patterns and responses may have attached Prolog expressions that act as constraints in the patterns and can invoke some action when used in the response. Rules are grouped into sets, called contexts. There is always a "current" context. This is the set of rules that is considered most applicable in to the current state of the conversation. A user's input is processed by attempting to match the sentence with the pattern in each rule. The first match causes that corresponding rule's response to be invoked. In addition to outputting an appropriate response, the rule can also invoke a Prolog program. One such program switches the current context. We can represent the course of a conversation as a the traversal of a graph in which each node is a context. The rules of the current context are used to carry on the conversation for some period, a transition to a new context corresponds to the traversal of an arc from one node in the graph to another. While this scheme works well for a highly structured conversation, it fails to take into account sentences that either change the topic of the conversation or those that contain expressions that the rules of the current context are not able to handle. To avoid these problems, we introduce the "filters" and "backups". A filter is a set of rules that is prepended to the current context. That is, the filter rules are examined for a match before any of the rules in the current context. Typically, filter rules contain keywords that assist the system in guessing that a user is changing the topic of the conversation. When a filter rule matches, the usual response is to switch the current context. Backup rules are appended to the current context. Their function is to catch user inputs that do not match any rules in the current context. Usually backup rules produce responses that temporise, provoking the user into providing further input that may be recognised by the system. In this paper, we give a detailed discussion of the architecture of the ProBot and explain the origins of this design. We also give some indication of future work. In particular, writing scripts for conversational agents is extremely labourious, but automating the construction of scripts through Machine Learning is very challenging. We discuss some of these challenges.

Speaker: Erik Sandewall
Title: On the Design of Software Invididuals
Abstract:
We use the term 'software individuals' to designate aggregates of programs and data that have the following properties: - They exist in a population of similar, but not identical individuals. - Individuals are able to interact with their surrounding environment, with each other, and/or with people. While doing so they may modify their internal state. - Each individual contains the safeguards that may be required in order to select which influences to accomodate and which ones to ignore. - The aggregate of programs and data that define an individual, and that in particular define its behavior, is a part of its internal state and can therefore be modified as the result of the interactions where the individual is involved. - Individuals or small groups of individuals are able to create new individuals that inherit the features of their parent(s). - The program/data aggregate that defines an individual is symbolic in character. The ability for knowledge representation is designed into individuals from the start. The last item distinguishes software individuals from the software that is evolved by genetic programming. In this article we address the question of design principles for software individuals, and approach it as a software design issue. This problem is of course an unconventional one from the point of view of ordinary software engineering. The emphasis on software self-modification runs counter to conventional software design principles, where the inherent ability of the von Neumann machine to modify its own programs is strongly restricted in particular through the design of conventional programming languages. In our approach, program self-modification is viewed as an important possibility, and not as an abberration. Our proposed Software Individual Architecture is characterized by the following principles: - The use of a Software Individual Kernel (SIK) which is a minimal individual having the basic properties of self-replication, self-modification, interaction, and ability to administrate its own state. - The use of a Knowledge Object Repository that is accessible by, but external to software individuals, and that can be shared by several individuals in a population. This repository is also the basis for several of the user services that are built using the architecture. - Each Software Individual Kernel is set up as a directory structure (directory and all its sub-directories) in the file system of a particular computer host. A population may 'live' within one host, or in a network of several hosts. Each individual is self-contained in this structure and each carries with itself (i.e., contains) a full set of the required programs and data defining it. - An individual can acquire additional features, besides those present in its Kernel, by adopting additional program and data modules, which either are sent to it by other individuals, or by importing them from the Repository. A number of previously developed knowledge management services have been ported so that they can be acquired by the new Software Individual Kernel. - We have worked the Software Individual Kernel so as to achieve maximal simplicity and generality. The full paper will describe the design of the Software Individual Kernel. It will also discuss how the design and use of this kind of software differs from the design of conventional software. Finally it will discuss the relevance of Software Individuals as the basis for implementing A.I. systems.

Speaker: Steffen Schulze-Kremer
Title: Ontologies for Molecular Biology
Abstract:
Molecular biology has a communication problem. There are many databases using their own labels and categories for storing data objects and others using identical labels and categories but with a different meaning. One prominent example is the concept "gene" which is used with different semantics by major international genomic databases. Ontologies are one means to provide a semantic repository to systematically order relevant concepts in molecular biology and to bridge the different notions in various databases by explicitly specifying the meaning of and relation between the fundamental concepts in an application domain. Here, design principles for ontology building and the upper level and a database branch of a prospective ontology for molecular biology is presented and compared to other ontologies with respect to suitability for molecular biology (http://igd.rz-berlin.mpg.de/~www/oe/mbo.html).