What connectionist models can (and cannot) tell us

Sally Andrews
Shool of Psychology
University of New South Wales

Latimer argues that the benefits of computational modelling lie in the precision of specification that is required to implement a verbal model in a computational form and implies that this precision avoids ambiguities that are inherent in verbal models. In particular, he suggests that computational modelling prevents the reification of abstract hypothetical constructs (Maze, 1954): that the act of formally implementing these constructs as variables and algorithms removes surplus meanings and demonstrates a level of description that is sufficient to explain the task that is modelled. Latimer cites the example of schema theory which has shaped cognitive psychological explanations of a wide variety of phenomena. He argues that attempts to develop computational models of schema theory reveal the construct of schema to be redundant in the sense that, at the level of implementation, schema are simply sets of features.

This interpretation appears to deny the distinction between the details of the implementation of a program and either the algorithm that is being implemented or the goals that the computations are designed to achieve (Marr, 1982). This Realist approach to the interpretation of the outcomes of computational modelling seems to me to run the risk of substituting reification of hypothetical constructs with reification of computational models themselves. In my view, the models do occupy the "middle kingdom" in which Latimer feels unsatisified - they are metaphors rather than realistic descriptions of cognitive operations. They are no more inherently accurate or realistic descriptions of processing by virtue of the fact of being realised in the physical form of a program or by being based on elements derived from a neuronal metaphor. The ultimately correct description of cognitive processes will have to be able to be implemented in neurons, but a model is not inherently accurate because it is neurally plausible. There are, moreover, dangers in attributing reality to a model by virtue of neural plausibility because it encourages a literal interpretation that reduces awareness of the assumptions - both theoretically relevant and arbitrary - that necessarily contribute to any computational model.

I will illustrate these issues with an example from my own research area of visual word recognition that parallels the schema/ feature issue discussed by Latimer.

Models of reading aloud

The example concerns the role of alphabetic rules in "reading aloud". There are a number of different verbal models of this task (see Humphreys & Evett, 1985; Patterson & Coltheart, 1987 for reviews) and examples of the major classes of models have also been implemented in computational form (Coltheart, Curtis, Atkins & Haller., in press; Jacobs &Grainger, 1993; Seidenberg & McClelland, 1989). Despite intensive empirical research and the precision of specification required for formal implementation, there is still no clear consensus on how the task is achieved (e.g., Coltheart et al., in press; Besner, Twilley, McCann & Seergobin., 1990; Seidenberg & McClelland, 1990). Much of the controversy concerns the form in which knowledge about orthographic- phonological correspondences is expressed and how this knowledge is related to knowledge of specific word forms.

The prototypical symbolic model is the "dual route model" which assumes that reading aloud reflects the outcome of a "race" between a lexical retrieval process and a procedure that assembles pronunciations from abstract spelling-sound rules (Coltheart, 1980). This model epitomises the "sacred cows" of verbal models of visual word recognition: the assumption that word identification reflects independent bodies of knowledge that are operated on by independent procedures (Besner et al., 1990).

PDP models are argued to challenge these "central dogma" of symbolic models by showing that the behaviours assumed to demonstrate the existence of separate lexical and rule-based procedures can be simulated by a standard three-layer back- propagation network that contains neither a lexicon of entries corresponding to individual words nor a set of pronunciation rules. (Seidenberg & McClelland, 1989). This class of models is claimed to "offer an alternative that dispenses with the two- route view" (p. 564) in favour of "a single uniform procedure that learns to process ...letter strings through experience with the spelling-sound correspondences implicit in the set of words from which it learns" (Seidenberg & McClelland, 1989, p. 525).

Thus, the knowledge learned by a PDP network reflects systematic relationships between orthographic and phonological word forms, but these relationships are not expressed as a system of rules. Indeed, according to Seidenberg and McClelland (1989) this is critical to the model's success because "English orthography...is not well captured by mechanisms involving systems of rules" (p. 564). But what does it mean to say that the PDP model does not contain knowledge of rules? To evaluate the validity of this claim it is necessary to specify precisely what does and does not constitute "a rule".

What is a rule?

The clearest empirical demonstration of knowledge of spelling-sound rules is the ability to produce a plausible pronunciation of a novel stimulus: to read a nonword. At one level, this capability for generalisation is, in itself, a demonstration of knowledge of rules: a system that produces rule-governed output can be argued to contain knowledge of rules at this level regardless of the form of realisation at a lower levels (Davies, 1990).

Tests of nonword performance by the Seidenberg and McClelland (1989) model produced very poor generalisation and led to the claim that, if anything, the model implemented the lexical component of a dual route model and that its poor performance on nonwords was proof of the need for a second route (Besner et al., 1990). But a revised version of the model (Plaut & McClelland, 1993) shows good generalisation to nonwords while maintaining accurate performance for words. So, PDP models can show knowledge of rules at the level of generalisation performance. However this is claimed to be an emergent property of the system which arises from a knowledge base that is fundamentally different from symbolic models (Seidenberg & McClelland, 1989). Paralleling Latimer's claims about schema and featrures, the construct of rules is claimed to be both unnecessary and an inaccurate: an "imperfect generalization about the nature of the input and ...what is learned" (p. 549).

Explict rules vs common components

Traditional dual route models (e.g., Coltheart, 1980) contain rules in the sense that the system explicitly contains the logical processes required to convert from an input to an output. Thus Coltheart et al.'s (in press) computational implementation of this model contains rules in the sense that there is both a body of knowledge that is independent of specific lexical forms and a set of procedures for parsing the input and matching it with appropriate rules. However, it has proved impossible to empirically distinguish between models embodying this classic sense of explicit rules and "interactive activation" or "multiple level" models of reading aloud (e.g., McClelland & Rumelhart, 1981; Shallice & McCarthy, 1985). These models do not contain rules in the sense of explicit procedures but rather consist of hierarchically organised layers of hard- wired localised units corresponding to words and to a variety of sublexical units. Different versions of the model assume different specific sublexical levels (e.g., Dell, 1986; McClelland & Rumelhart, 1981) but they share the assumption that all instances containing particular sublexical components elicit activation of the same constituent units and that apparently rule-governed behaviour reflects activation of units corresponding to constituent features of the input.

Though these models do not contain rules in the strong sense of explicit rule systems, they do demonstrate knowledge of rules in the weaker sense that the relationship between inputs and outputs can be fully specified in terms of component features of the input ( i.e., given input feature P infer output feature Q) so that components of the input have a causal status in determining the output (Davies, 1990). Both formalisms, then, describe input-output relationships in terms of causal connections between the components of orthographic and phonological word forms. In explicit rule systems causality resides in the procedures that abstract, select and apply the rule knowledge while in the multiple level models - and connectionist implementations in general - knowledge and procedures are intertwined:

Using knowledge in processing is [not] a matter of finding the relevant information in memory and bringing it to bear; it is part and parcel of the processing itself. (McClelland, Rumelhart & Hinton, 1986, p. 32) But the two models are functionally equivalent not only at the higher-order level of demonstrating rule-governed behaviour, but also at the lower level of the components into which words are partitioned and which determine the similarity relationships between words. Particular instantiations of each class of model vary in exactly how these components are defined - in the context-sensitivity of the rules; or the number of sublexical levels - but they show rule-governed behaviour for fundamentally the same reason: that orthographic and phonological word forms are functionally partitioned into a common set of constituent features.

It is on this dimension that PDP models are argued to differ from symbolic models. Once the model has been trained, repeated exposures to a particular word will give rise to the same pattern of activity and similar words will elicit similar patterns, but there is no sense in which the pattern for a word can be subdivided into constituents nor any necessary relationship between the pattern elicited by the same constituent when it occurs in different words. These models do not, then, contain rules in the sense of a causal relationship between input and output elements. "The output that the model produces for a particular letter string is determined by the properties of all the words presented during training" (Seidenberg & McClelland, 1989, p. 549) and cannot be predicted from the relationship between a fixed set of components that apply to all words. Thus, though PDP models might show apparently rule-governed behaviour, they do not contain rules in the same sense as symbolic models because the distributed patterns of activity elicited by words cannot be decomposed into a set of constituent units that are common across all words and that predict orthographic-phonological relationships.

Symbolic vs subsymbolic features

In principle, then, PDP models represent items at a subsymbolic level (Smolensky, 1989): the system does not employ a fixed set of features to define representations at the hidden unit level but capitalises on whatever relationships between input and output patterns serve to optimise overall performance. However, the relationships that are available to any specific implementation depend on the choice of input and output coding scheme. Seidenberg and McClelland (1989) used a variant of Wickelgren's (1969) "triples scheme" that resulted in a coarsely coded distributed representation (Hinton, 1986) that allowed sensitivity to local context and to the similarity of the same letter occurring in different positions, but which did not ensure even that the same letters were coded using the same input elements, let alone that they elicited similar patterns at hidden or output layers. Seidenberg and McClelland (1989) claimed no commitment to this scheme apart from the fact that it involved "a minimum of built-in knowledge of orthographic and phonological structure" (p. 528-9) and therefore provided a strong test of the degree to which "theory-relevant" aspects of the simulation such as the use of distributed representations and an error correction learning algorithm could approximate critical empirical data. But, the choice of coding scheme is critical to interpreting the outcome of the simulation particularly by comparison with symbolic models which I have argued to be principally defined by their assumptions about the units into which words are coded.

In particular, the choice of coding scheme might account for the poor generalization shown by the model. Seidenberg and McClelland (1989) claimed successful generalization because error scores derived from the model discriminated between nonwords that people pronounce with differential speed and accuracy (Glushko, 1979), but Besner et al. (1990) showed that generalization was very poor when assessed by the stricter criterion of activation of appropriate output units. Seidenberg and McClelland (1990) claim that these limitations in the model's generalization performance reflect theory-irrelevant limitations of the implemented model - particularly the restricted training corpus and inadequacies in the Wickelfeature coding schemes.

It is unlikely that the limited training vocabulary provides sufficient explanation of the poor generalization because computational implementations of both the explicit rule system of the dual route model (Coltheart et al. in press) and of a PDP model (Plaut & McClelland, 1993) are successful at producing rule-governed pronunciations of nonwords after training on essentially the same training vocabulary.

The Plaut and McClelland (1993) simulation is particularly relevant to evaluating the determinants of effective nonword generalization because it uses the distributed representations and learning algorithm that Seidenberg and McClelland claim to be the theoretically critical aspects of their model. There are two major differences between the two implementations. First, rather than Wickelfeatures, Plaut and McClelland (1993) used a localised input and output coding scheme in which each input and output unit corresponds to a single position-specific grapheme or phoneme. Second, the architecture of the revised model includes interactivity at the hidden and output layers allowing the network to develop "attractors" for recurring patterns of activity such as words. Plaut and McClelland imply that the improved performance for nonwords is primarily due to the inclusion of interactivity and the consequent development of stable attractors: "[the] results demonstrate that attractors can support effective generalization, challenging dual-route assumptions that multiple independent mechanisms are required for quasiregular tasks". However, given the earlier discussion of the sense in which rule-governed behaviour in symbolic models is due to the built-in constituent structure of words, it is equally plausible that credit for the improved generalization performance is entirely due to the coding scheme rather than to the architectural modifications.

By using localised input features corresponding to the symbolic components of written and spoken words Plaut and McClelland have created what can be seen as a hybrid of multiple level symbolic models and the PDP framework. Functionally, the model simulates development of the word level of a multiple levels model containing hard-wired lower level units corresponding to graphemes and phonemes. The model has the opportunity to create distributed representations of words at the hidden unit level, but the structure of the input and output coding schemes encourages development of patterns that reflect the set of common constituent symbols by which words have been coded. Such "componentiality" does appear to be characteristic of what the model learns, particularly for regular words (Plaut and McClelland, 1993).

So the generalisation capability of the Plaut and McClelland (1993) model can be seen as deriving from the same general characteristic as symbolic models: that words are coded in terms of a common set of constituent units. The modelling exercise establishes that a neural net model can learn to respond appropriately to both word and nonword stimuli, but only if it is assumed that words are coded as independent abstract graphemes and phonemes that have equivalent identities across different cases, fonts etc.

I am not trying to imply that Plaut and McClelland (1993) have in some sense "cheated" by using graphemic input units - although it is worth pointing out that attributing the improved generalization performance to the architectural rather than the coding assumptions smacks of the "vagueness and legerdemain" that Latimer attributes to verbal models. The assumption of abstract grapheme and phoneme units may be perfectly valid. It underlies the majority of symbolic models of word recognition and appears to receive independent support from recent applications of brain imaging technologies such as positron emission tomography (PET) to the study of visual word recognition (e.g., Petersen et al., 1991). But the point I am attempting to draw out here is that computational models are no less reliant on such assumptions than verbal models. The elements of a connectionist model provide no more direct insight into the neurophysiological basis of behaviour than the constructs described in verbal models and the validity of the interpretation derived from a PDP simulation requires just as careful an evaluation of the assumptions built into the implementation as is required for a verbal model.

Contrary to the implications of Latimer's discussion, I think that the level of the hardware implementation of a model can and must be separated from the representational and computational levels (Marr, 1982). Failing to recognise these distinctions may lead us to grossly inaccurate conclusions about where credit for the success of a simulation should be assigned and therefore how the performance of the model should be related to brain function. We need to maintain awareness that the computational model is simply a metaphor and use convergent evidence concerning the relationship between the model and empirical data on the one hand, and between the model and brain structure and organisation on the other, to evaluate the validity and implications of the metaphor.

So what are rules then?

If we take successful nonword pronunciation as the critical evidence for knowledge of rules, the implication of the evidence I have briefly summarised may be that, regardless of the details of the architecture or learning assumptions of the computational model, rule-governed generalization is evidenced by models that code words in terms of a common set of abstract constituent units and not by models that code words in a clearly non-symbolic form. There is not, as yet, a sufficiently comprehensive or systematic body of data comparing the consequences of different coding systems with the same set of architectural and algorithmic assumptions to justify this conclusion. There are a number of problems with the Wickelfeature system which may contribute to poor generalization independently of its failure to provide an input scheme that respects words' constituent structure. However if the conclusion that successful simulation of nonword reading requires that representations contain abstract constituents is validated by systematic comparisons of different coding schemes, it will have important implications for determining the critical distinctions between different models.

For example, one source of evidence that appears to unequivocally implicate knowledge of rules independently of lexical knowledge is provided by the selective preservation of rule-governed reading aloud in patients with "surface dyslexia" (Patterson, Coltheart & Marshall, 1985). This pattern of performance is claimed to provide incontrovertible evidence against PDP models that do not separate rule and lexical knowledge (Coltheart et al., in press). If knowledge of rules is equated with the constituent structure of word representations, then PDP models are functionally equivalent to multiple level models in this regard, and the critical issue for PDP models becomes whether they are sensitive to the number of words containing a particular component (type versatility) independently of the frequency of the word as a whole (token frequency) (Norris, submitted).

Conclusions

Let me stress that I am not in any way denying the utility of computational modelling in shedding light on psychological processes. Nor am I denying that these procedures provide insight into issues that are very difficult to address at the level of verbal models. The non-linear consequences of fully interactive systems can often only be appreciated through explicit simulation. Moreover, there may be inherent regularities within an input domain that are obscured by verbal conceptualisations of the input structure but able to be revealed by training a network. Realising these regularities may provide the basis for developing alternative models of tsk performance.

However, it is critical to maintain awareness of the fact that the models are simply metaphors that are inherently no better or worse than verbal models. They are useful because the act of modelling requires refinement of the metaphor and precise specification of how the elements of task map onto the elements of the metaphor embodied in the model. But these mappings do not acquire reality by virtue of their precision.

Unlike Latimer, I do not feel in the least uncomfortable in acknowledging that cognitive models, whether verbal or compuational, lie in a metaphorical middle ground. In fact, I think that this is one of their major virtues in contributing to understanding brain-behaviour relationships. Essentially, cognitive models provide a "common metric" for relating knowledge of behaviour to knowledge of neurophysiology: a level of description that is appropriate and essential for the task of functional analysis of brain-behaviour relationships.

REFERENCES

Besner, D., Twilley, L., McCann, R. & Seergobin, K. (1990) On the association between connectionism and data: Are a few words necessary? Psychological Review, 97, 432-466.

Coltheart, M. (1980) Lexical access in simple reading tasks. In G, Underwood (Ed.), Strategies for information processing. NY: Academic Press.

Coltheart, M., Curtis, B., Atkins, P. & Haller, M. (in press) Models of reading aloud: Dual route and parallel distributed processing approaches. Psychological Review.

Davies, M. (1990) Knowledge of rules in connectionist networks. Intellectica, 9-10, 81-126.

Dell, G. (1986) A spreading activation theory of retrieval in sentence production. Psychological Review, 93, 283-321.

Hinton, G.E. (1986) Learning distributed representations of concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum.

Humphreys, G. & Evett, L. (1985) Are there independent lexical and nonlexical routes in word processing? An evaluation of the dual-route model of reading. Behavioral and Brain Sciences, 8, 689-740.

Marr, D. (1982) Vision. San Francisco: W.H. Freeman.

Maze, J.R. (1954) Do intervening variables intervene? Psychological Review, 61, 226-234.

McClelland, J.L. & Rumelhart, D.E. (1981) An interactive activation model of context effects in letter perception: Part 1. An account of basic findings. Psychological Review, 88, 375-407.

McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1. Cambridge, MA: MIT Press.

Patterson, K. & Coltheart, V. (1987) Phonological processes in reading: A tutorial review. In M. Coltheart (Ed.), Attention and Performance XII. London: Erlbaum.

Patterson, K., Coltheart, M. & Marshall, J. (1985) Surface Dyslexia. London: Erlbaum.

Petersen, S.E., Fox, P.T., Snyder, A.Z.. & Raichle, M. (1990) Activation of extrastriate and frontal cortical areas by visual words and word-like stimuli. Science, 249, 1041- 1044.

Plaut, D. & McClelland, J.L. (1993) Generalization with componential attractors: Word and nonword reading in an attractor network. Proceedings of the 15th Annual Conference of the Cognitive Science Society.

Seidenberg, M. & McClelland, J. (1989) A distributed developmental model model of word recognition and naming. Psychological Review, 96, 523-568.

Seidenberg, M. & McClelland, J. (1990) More words but still no lexicon: Reply to Besner et al. (1990). Psychological Review, 97, 447-452.

Smolensky, P. (1988) On the proper treatment of connectionism. Behavioral and Brain Sciences, 11, 1-74.