Connectionist Modelling and Cognitive Psychology

Max Coltheart
School of Behavioural Sciences
Macquarie University

For more than thirty years, computer simulation has been one of the techniques used by cognitive psychologists for evaluating, extending or developing theories about cognition. Danny Latimer begins his paper by describing two reasons why those cognitive psychologists have favoured computer simulation - two benefits that arise when a successful simulation has been achieved. These concern (a) theory sufficiency and (b) theory completeness.

Theory sufficiency

If one's theory about how human beings carry out some basic cognitive task such as object recognition, retrieval of information from long-term memory, reading aloud or attending selectively can be turned into a computer program which actually succeeds in carrying out the task in question, that demonstrates that the theory is a sufficient one. Of course, that does not mean that the theory is a correct description of how people perform this task. All agree that any interesting cognitive task, such as any of those I have described, can be successfully executed in a variety of different ways. One of these is the way that people do the task. The way the computer simulation does it might be one of the other ways. But at least if the simulation is successful, the processing procedure upon which it is based is thereby shown to be one of the candidates for the correct theory about people. If no such successful simulations have been achieved, it remains possible for opponents of the theory to claim that the procedure proposed by the theory is unworkable in principle; from which of course it follows that this theory cannot be a correct account of human cognition. Latimer offers this kind of critique of global-precedence theories of pattern recognition. He points out that no one has successfully simulated a procedure for extracting global perceptual properties of objects directly from the raw stimulus information (without an intermediate stage of local feature processing). I understand him therefore to be insinuating that this is impossible in principle, from which it follows that global-precedence theories are wrong as accounts of human pattern recognition.

Of course, this isn't a completely knock-down argument, because the fact that no one has produced this particular kind of simulation so far does not mean that such a simulation will never be produced. But this kind of argument is still valuable, since if one keeps repeating the argument gadfly-style and the relevant simulation still does not emerge, one's confidence in the claim that it can never emerge will increase; if, on the other hand, the argument makes global-precedence theorists realise that they need to produce such a simulation in order to defend their type of theory, and they succeed in doing so, then we would then know that global-precedence theory is a viable form of theory of pattern recognition.

I have been involved in just such a dispute (see e.g. Coltheart, 1978, 1985), claiming that it is impossible in principle to devise a single processing procedure that can accurately read aloud both pronounceable nonwords such as zint and exception/irregular words such as pint, and that therefore any theory about the cognitive processes used when people read aloud which postulates just such a single procedure must be wrong in principle. This claim has provoked current connectionist modellers (Plaut & McClelland, 1993) to design and train a neural network for reading aloud in which, they assert, a single processing procedure can be learned via back-propagation which does read exception words and nonwords correctly. If they are correct, then the dispute turns into an empirical one: if both dual-route and single-route procedures can be demonstrated, by computer simulation, to be capable of performing the cognitive task in question, then we will need to discover which of these procedures (if either) is the one that people actually use, and this has to be done by deducing conflicting predictions from the two theories and determining which theory's predictions agree with what actually happens in appropriate experiments with human readers.

Theory completeness

The first thing that any theorist learns when attempting to translate a theory of cognition into a computational model is that the theory is far less complete and far less explicit than the theorist had realised. There are always a host of details which were not even thought about until programming began, but which must be decided upon before there is any chance at all that the program will even run (let alone successfully simulate); this is what Latimer refers to in his distinction between theory-relevant and theory-irrelevant components of a model. The impatient modeller will often brush aside such details, making some decisions not on the basis of psychological reality but on the basis of computational tractability: hacks rather than facts. Such a modeller makes two assumptions here: that the eventual success of the simulation is due not to the hacks but to the theory-relevant part of the computational model, and that the success of the simulation could be preserved even if the hacks came to be replaced by psychologically realistic procedures. Perusal of the connectionist literature suggests that quite often modellers move on to other domains at this stage, leaving behind these impatiently-constructed models rather than attending to the theory-irrelevant details - that is, leaving behind a model that, even though explicit, is incomplete because it possesses parts which the modeller acknowledges cannot be correct descriptions of human cognitive processing.

The patient computational modeller, however, can as Latimer points out benefit greatly: not only from discovering unsuspected lacunae in the theory, but also because the discipline imposed by patient modelling frequently leads to completely new ideas.

Now, none of the above is specifically about connectionist modelling, but is about any kind of computer simulation of cognition. I'd like to go on now to refer specifically to connectionism, and indeed to just one particular kind of connectionist modelling: the construction of neural-net models of cognition using back-propagation or other supervised learning procedures. What benefits are there for cognitive psychologists in doing this particular kind of connectionist modelling?

An example

I'll use the literature on back-propagation neural-net modelling of reading aloud (Seidenberg & McClelland, 1989; Plaut & McClelland, 1993) as a specific example, from which I will want to draw some general conclusions. As I mentioned above, before these authors began to build computational models of reading aloud, there existed a noncomputational model - the dual-route model as described e.g. by Coltheart (1985) - which was widely accepted. This model is based on the idea that exception words like yacht and colonel , since they contained word-specific correspondences between spelling and sound, could only be read aloud by accessing word-specific information (local representations of individual words in a mental lexicon), plus the idea that pronounceable nonwords such as slint or bleck could only be read aloud by use of a set of nonlexical generally-applicable spelling-to-sound rules (since nonwords would not be present in a mental lexicon). Thus the claim was that the functional architecture of the human reading system involved two routes from print to speech: a lexical route and a nonlexical route.

Seidenberg and McClelland (1989) rejected all this, claiming that there's no lexicon, that there are no rules, and that a single processing route can do the job. A subsequent model (Plaut & McClelland, 1993) uses different input and output representations but makes essentially the same claims. Both models are three-layer networks trained by backpropagation and then tested for their ability to create the correct phonological output representations from orthographic input representations: that is, to read aloud.

Plaut and McClelland (1993) raise the possibility that their one-route model might actually be a dual-route model: that is, since it is up to back-propagation to figure out how to do the task of reading aloud, it is possible that the trained network ends up using some of its hidden units (and the connections to and from them) as word-specific representations, to solve the problem of reading exception words, and using the rest of the hidden units (and the connections to and from them) to represent general facts about spelling-to-sound correspondences, to solve the problem of reading letter-strings that had not been presented during training (i.e. nonwords). They therefore carried out some analyses of the behaviour of the trained network to try to discover whether it was in fact behaving in a dual-route fashion.

What's the best way of describing what Plaut and McClelland were engaged in when they were trying to find the answer to this question? I think a good description is this: they were trying to discover the functional architecture of their trained network. Of course, in one sense they knew the architecture of the network, because they designed it: it is a layer of input units, fully connected to a layer of hidden units, fully connected to a layer of output units. But this isn't the functional architecture: let's call it the network architecture. The network architecture is determined by the modellers, and is obvious. The functional architecture is grown via the operation of the backpropagation learning algorithm, and can be very difficult to discover in realistically large networks.

What are cognitive psychologists interested in? Well, obviously, the functional architecture. They want to know, for example, whether people read via one route or two, and that is a question about functional architecture. If so, and if discovering this property of a large backpropagation-trained network is so hard, one begins to wonder about the contributions that this particular kind of neural-net modelling could make to cognitive psychology.

I argued above that all agree that any interesting cognitive task can be performed in several different ways. This, in connectionist terms, amounts to saying that there will be several different points in multidimensional weight space that represent solutions to the problem. Backpropagation will discover such points. It is likely that one of these points is the solution that people use. All the others are solutions that work, but which people happen not to use. So what cognitive psychologists will want is for backpropagation to discover the same solution as people. Here is one place where the distinction between network architecture and functional architecture is useful. By "same solution" here I don't mean "same network architecture", I mean "same functional architecture".

Of course, it is conceivable that the solution people use is so radically different in principle from anything that could be described as a neural network trainable by backpropagation that none of these pints is the solution people use. It is also conceivable, at least in some cognitive domains, that individual differences are substantial so that different points in weight space correspond to the solutions used by different people.

This raises a variety of difficulties vis-a-vis the use of backpropagation-trained networks for studying human cognition. I will briefly identify six of these problems.

The analysis of large back-propagation-trained networks is technically very difficult and it is by no means obvious how, or even whether, it is possible to determine the functional architecture of such networks.
If there are several different points in weight space that constitute successful solutions to the problem the network is being trained on, how do we know that back-propagation will find the one that corresponds to that which people use? Here it might be argued that individual differences are important; perhaps each of these solutions is used by some people? What's being claimed here is that there are qualitative differences between people in the functional architecture of the systems they use when performing some particular cognitive task. I know of no evidence for such qualitative differences in the way people perform basic cognitive activities such as recognising objects or reading aloud or retrieving information from long-term memory. Of course, the study of individual differences in cognition is an important aspect of cognitive psychology; but what is studied here are quantitative differences defined across a single common functional architecture.
Since the back-propagation algorithm will not learn when the initial weights are all the same (e.g. all zero), it is necessary if learning is to occur to start off with a configuration where the weights are not all the same. Typically, this is done by assigning random initial values to the weights. This amounts, of course, to starting off the network at some random point in weight space. When there are several points in weight space which constitute solutions, the particular solution attained by back-propagation may be constrained by this arbitrarily-chosen starting-point. To the extent that the solution back-propagation finds is constrained by the initial arbitrary choice of weights, back-propagation will be making an arbitrary choice amongst the possible solutions - rather than choosing the particular solution that is the one people use.
In practice, when networks are to be trained by back-propagation, the initial weights are often constrained to be small, and to be random within this constraint; here the network will be starting off somewhere near the origin of weight space. So such networks may be constrained to find solutions which are close to the origin. Do we have any reason to believe that the human solutions are typically close to the origin of weight space? If not, then sometimes it may be impossible, rather than being just a matter of random choice, for back-propagation to reach the human solution.
Just as the initial weight configuration may be such that the human solution will not be found by back-propagation, so it is the case that the choice of network architecture, which is also made arbitrarily, may make it impossible for the network to reach the solution that people reach. Take the matter of choosing the number of hidden units, for example. Plaut and McClelland (1993) used an arbitrarily-chosen number of hidden units (one hundred) in their three-layer network. After the network had been trained, they then investigated whether some of these hidden units might be dedicated to the task of reading the exception words in the training corpus, and found no evidence that this was the case. Now, in this corpus of about 3000 words, there were about 700 exception words (Coltheart et al., 1993) - that is, far more exception words than hidden units. So the picture might have been completely different if the number of hidden units had been closer to the number of exception words in the training corpus - t hat is, the particular functional architecture attained by the network after training might have depended upon the decision about how many hidden units to have, and this is an arbitrary decision.
Imagine constructing a training set in which there were demonstrably enough hidden units to allow the network to solve the problem using a functional architecture that involved using local representations (such as the dual-route architecture). The set could contain, say, 100 exception words plus some regular words, and the network could contain enough hidden units (more than 100) so that one hidden unit could be assigned to each exception word and the remaining units assigned to the task of encoding general (rather than word-specific) information about letter-to-sound relationships). The network could be initially hardwired to do this, and its success demonstrated by showing that with this hardwiring all the exception words are read correctly but so also are nonwords, which had not been considered when the hardwiring was determined. Then the weights could be initialised in the typical way (small and random) and the network trained by backpropagation. With repeated training from scratch, would the network ever find the demonstrably existent local-representation solution? In my (extremely limited) experience of such situations, networks end up with distributed solutions even when local solutions exist. Now, whether the human solution in any cognitive domain involves local or distributed representations is a purely empirical question. If the human solution might be a local one, and if networks trained by backpropagation do not find local solutions, the use of back-propagation trained networks for studying human cognition will be extremely problematic.

Conclusions

The value of connectionist models to cognitive psychologists is this: once the psychologist has defined a functional architecture which is a theory about how people perform some cognitive task, formulating this as a connectionist model offers the advantages referred to above under the rubrics "Theory sufficiency" and "Theory completeness". But the functional architecture must precede the construction of the model. If the functional architecture is instead the creature of backpropagation, rather than of the cognitive psychologist after reflection upon the relevant empirical data, I have argued firstly that it is very difficult to discover what the trained model's functional architecture actually is and secondly that there is no reason to expect that this will correspond to the functional architecture of the relevant human cognitive processing system. This is why my enthusiasm for connectionist modelling is limited to an enthusiasm for hard-wired models, and does not extend to models that train themselves.

References

Coltheart, M. (1978). Lexical access in simple reading tasks. In Underwood, G. (ed). Strategies of Information-Processing. London: Academic Press.

Coltheart, M. (1985). Cognitive neuropsychology and the study of reading. In Posner, M.I., and Marin, O.S.M. (eds). Attention and Performance XI. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Coltheart, M., Curtis, B., Atkins, P. & Haller, M. (1993). Models of reading aloud: Dual-route and parallel-distributed-processing approaches. Psychological Review (in press).

Plaut, D. & McClelland, J.L. (1993) Generalization with componential attractors: Word and nonword reading in an attractor network. Proceedings of the 15th Annual Conference of the Cognitive Science Society.

Seidenberg, M. & McClelland, J.L. (1989) A distributed developmental model of word recognition and reading. Psychological Review, 97, 523-568.