[Table of Contents]
Connectionist Modelling and Cognitive Psychology
Max Coltheart
School of Behavioural Sciences
Macquarie University
For more than thirty years, computer simulation has been one of the
techniques used by cognitive psychologists for evaluating, extending or
developing theories about cognition. Danny Latimer begins his paper by
describing two reasons why those cognitive psychologists have favoured
computer simulation - two benefits that arise when a successful
simulation has been achieved. These concern (a) theory sufficiency and
(b) theory completeness.
Theory sufficiency
If one's theory about how human beings carry out some basic cognitive
task such as object recognition, retrieval of information from
long-term memory, reading aloud or attending selectively can be turned
into a computer program which actually succeeds in carrying out the
task in question, that demonstrates that the theory is a sufficient
one. Of course, that does not mean that the theory is a correct
description of how people perform this task. All agree that any
interesting cognitive task, such as any of those I have described, can
be successfully executed in a variety of different ways. One of these
is the way that people do the task. The way the computer simulation
does it might be one of the other ways. But at least if the simulation
is successful, the processing procedure upon which it is based is
thereby shown to be one of the candidates for the correct theory about
people. If no such successful simulations have been achieved, it
remains possible for opponents of the theory to claim that the
procedure proposed by the theory is unworkable in principle; from which
of course it follows that this theory cannot be a correct account of
human cognition. Latimer offers this kind of critique of
global-precedence theories of pattern recognition. He points out that
no one has successfully simulated a procedure for extracting global
perceptual properties of objects directly from the raw stimulus
information (without an intermediate stage of local feature
processing). I understand him therefore to be insinuating that this is
impossible in principle, from which it follows that global-precedence
theories are wrong as accounts of human pattern recognition.
Of course, this isn't a completely knock-down argument, because the
fact that no one has produced this particular kind of simulation so far
does not mean that such a simulation will never be produced. But this
kind of argument is still valuable, since if one keeps repeating the
argument gadfly-style and the relevant simulation still does not
emerge, one's confidence in the claim that it can never emerge will
increase; if, on the other hand, the argument makes global-precedence
theorists realise that they need to produce such a simulation in order
to defend their type of theory, and they succeed in doing so, then we
would then know that global-precedence theory is a viable form of
theory of pattern recognition.
I have been involved in just such a dispute (see e.g. Coltheart, 1978,
1985), claiming that it is impossible in principle to devise a single
processing procedure that can accurately read aloud both pronounceable
nonwords such as zint and exception/irregular words such as pint, and
that therefore any theory about the cognitive processes used when
people read aloud which postulates just such a single procedure must be
wrong in principle. This claim has provoked current connectionist
modellers (Plaut & McClelland, 1993) to design and train a neural
network for reading aloud in which, they assert, a single processing
procedure can be learned via back-propagation which does read exception
words and nonwords correctly. If they are correct, then the dispute
turns into an empirical one: if both dual-route and single-route
procedures can be demonstrated, by computer simulation, to be capable
of performing the cognitive task in question, then we will need to
discover which of these procedures (if either) is the one that people
actually use, and this has to be done by deducing conflicting
predictions from the two theories and determining which theory's
predictions agree with what actually happens in appropriate experiments
with human readers.
Theory completeness
The first thing that any theorist learns when attempting to translate a
theory of cognition into a computational model is that the theory is
far less complete and far less explicit than the theorist had realised.
There are always a host of details which were not even thought about
until programming began, but which must be decided upon before there is
any chance at all that the program will even run (let alone
successfully simulate); this is what Latimer refers to in his
distinction between theory-relevant and theory-irrelevant components of
a model. The impatient modeller will often brush aside such details,
making some decisions not on the basis of psychological reality but on
the basis of computational tractability: hacks rather than facts. Such
a modeller makes two assumptions here: that the eventual success of the
simulation is due not to the hacks but to the theory-relevant part of
the computational model, and that the success of the simulation could
be preserved even if the hacks came to be replaced by psychologically
realistic procedures. Perusal of the connectionist literature suggests
that quite often modellers move on to other domains at this stage,
leaving behind these impatiently-constructed models rather than
attending to the theory-irrelevant details - that is, leaving behind a
model that, even though explicit, is incomplete because it possesses
parts which the modeller acknowledges cannot be correct descriptions of
human cognitive processing.
The patient computational modeller, however, can as Latimer points out
benefit greatly: not only from discovering unsuspected lacunae in the
theory, but also because the discipline imposed by patient modelling
frequently leads to completely new ideas.
Now, none of the above is specifically about connectionist modelling,
but is about any kind of computer simulation of cognition. I'd like to
go on now to refer specifically to connectionism, and indeed to just
one particular kind of connectionist modelling: the construction of
neural-net models of cognition using back-propagation or other
supervised learning procedures. What benefits are there for cognitive
psychologists in doing this particular kind of connectionist
modelling?
An example
I'll use the literature on back-propagation neural-net modelling of
reading aloud (Seidenberg & McClelland, 1989; Plaut & McClelland, 1993)
as a specific example, from which I will want to draw some general
conclusions. As I mentioned above, before these authors began to build
computational models of reading aloud, there existed a noncomputational
model - the dual-route model as described e.g. by Coltheart (1985) -
which was widely accepted. This model is based on the idea that
exception words like yacht and colonel , since they contained
word-specific correspondences between spelling and sound, could only be
read aloud by accessing word-specific information (local
representations of individual words in a mental lexicon), plus the idea
that pronounceable nonwords such as slint or bleck could only be read
aloud by use of a set of nonlexical generally-applicable
spelling-to-sound rules (since nonwords would not be present in a
mental lexicon). Thus the claim was that the functional architecture of
the human reading system involved two routes from print to speech: a
lexical route and a nonlexical route.
Seidenberg and McClelland (1989) rejected all this, claiming that
there's no lexicon, that there are no rules, and that a single
processing route can do the job. A subsequent model (Plaut &
McClelland, 1993) uses different input and output representations but
makes essentially the same claims. Both models are three-layer networks
trained by backpropagation and then tested for their ability to create
the correct phonological output representations from orthographic input
representations: that is, to read aloud.
Plaut and McClelland (1993) raise the possibility that their one-route
model might actually be a dual-route model: that is, since it is up to
back-propagation to figure out how to do the task of reading aloud, it
is possible that the trained network ends up using some of its hidden
units (and the connections to and from them) as word-specific
representations, to solve the problem of reading exception words, and
using the rest of the hidden units (and the connections to and from
them) to represent general facts about spelling-to-sound
correspondences, to solve the problem of reading letter-strings that
had not been presented during training (i.e. nonwords). They therefore
carried out some analyses of the behaviour of the trained network to
try to discover whether it was in fact behaving in a dual-route
fashion.
What's the best way of describing what Plaut and McClelland were
engaged in when they were trying to find the answer to this question? I
think a good description is this: they were trying to discover the
functional architecture of their trained network. Of course, in one
sense they knew the architecture of the network, because they designed
it: it is a layer of input units, fully connected to a layer of hidden
units, fully connected to a layer of output units. But this isn't the
functional architecture: let's call it the network architecture. The
network architecture is determined by the modellers, and is obvious.
The functional architecture is grown via the operation of the
backpropagation learning algorithm, and can be very difficult to
discover in realistically large networks.
What are cognitive psychologists interested in? Well, obviously, the
functional architecture. They want to know, for example, whether people
read via one route or two, and that is a question about functional
architecture. If so, and if discovering this property of a large
backpropagation-trained network is so hard, one begins to wonder about
the contributions that this particular kind of neural-net modelling
could make to cognitive psychology.
I argued above that all agree that any interesting cognitive task can
be performed in several different ways. This, in connectionist terms,
amounts to saying that there will be several different points in
multidimensional weight space that represent solutions to the problem.
Backpropagation will discover such points. It is likely that one of
these points is the solution that people use. All the others are
solutions that work, but which people happen not to use. So what
cognitive psychologists will want is for backpropagation to discover
the same solution as people. Here is one place where the distinction
between network architecture and functional architecture is useful. By
"same solution" here I don't mean "same network architecture", I mean
"same functional architecture".
Of course, it is conceivable that the solution people use is so
radically different in principle from anything that could be described
as a neural network trainable by backpropagation that none of these
pints is the solution people use. It is also conceivable, at least in
some cognitive domains, that individual differences are substantial so
that different points in weight space correspond to the solutions used
by different people.
This raises a variety of difficulties vis-a-vis the use of
backpropagation-trained networks for studying human cognition. I will
briefly identify six of these problems.
- The analysis of large back-propagation-trained networks is
technically very difficult and it is by no means obvious how, or
even whether, it is possible to determine the functional architecture
of such networks.
- If there are several different points in weight space that
constitute successful solutions to the problem the network is being
trained on, how do we know that back-propagation will find the one that
corresponds to that which people use? Here it might be argued that
individual differences are important; perhaps each of these solutions
is used by some people? What's being claimed here is that there are
qualitative differences between people in the functional architecture
of the systems they use when performing some particular cognitive task.
I know of no evidence for such qualitative differences in the way
people perform basic cognitive activities such as recognising objects
or reading aloud or retrieving information from long-term memory. Of
course, the study of individual differences in cognition is an
important aspect of cognitive psychology; but what is studied here are
quantitative differences defined across a single common functional
architecture.
- Since the back-propagation algorithm will not learn when the
initial weights are all the same (e.g. all zero), it is necessary if
learning is to occur to start off with a configuration where the
weights are not all the same. Typically, this is done by assigning
random initial values to the weights. This amounts, of course, to
starting off the network at some random point in weight space. When
there are several points in weight space which constitute solutions,
the particular solution attained by back-propagation may be constrained
by this arbitrarily-chosen starting-point. To the extent that the
solution back-propagation finds is constrained by the initial arbitrary
choice of weights, back-propagation will be making an arbitrary choice
amongst the possible solutions - rather than choosing the particular
solution that is the one people use.
- In practice, when networks are to be trained by back-propagation,
the initial weights are often constrained to be small, and to be random
within this constraint; here the network will be starting off somewhere
near the origin of weight space. So such networks may be constrained to
find solutions which are close to the origin. Do we have any reason to
believe that the human solutions are typically close to the origin of
weight space? If not, then sometimes it may be impossible, rather than
being just a matter of random choice, for back-propagation to reach the
human solution.
- Just as the initial weight configuration may be such that the human
solution will not be found by back-propagation, so it is the case that
the choice of network architecture, which is also made arbitrarily, may
make it impossible for the network to reach the solution that people
reach. Take the matter of choosing the number of hidden units, for
example. Plaut and McClelland (1993) used an arbitrarily-chosen number
of hidden units (one hundred) in their three-layer network. After the
network had been trained, they then investigated whether some of these
hidden units might be dedicated to the task of reading the exception
words in the training corpus, and found no evidence that this was the
case. Now, in this corpus of about 3000 words, there were about 700
exception words (Coltheart et al., 1993) - that is, far more exception
words than hidden units. So the picture might have been completely
different if the number of hidden units had been closer to the number
of exception words in the training corpus - t hat is, the particular
functional architecture attained by the network after training might
have depended upon the decision about how many hidden units to have,
and this is an arbitrary decision.
- Imagine constructing a training set in which there were
demonstrably enough hidden units to allow the network to solve the
problem using a functional architecture that involved using local
representations (such as the dual-route architecture). The set could
contain, say, 100 exception words plus some regular words, and the
network could contain enough hidden units (more than 100) so that one
hidden unit could be assigned to each exception word and the remaining
units assigned to the task of encoding general (rather than
word-specific) information about letter-to-sound relationships). The
network could be initially hardwired to do this, and its success
demonstrated by showing that with this hardwiring all the exception
words are read correctly but so also are nonwords, which had not been
considered when the hardwiring was determined. Then the weights could
be initialised in the typical way (small and random) and the network
trained by backpropagation. With repeated training from scratch, would
the network ever find the demonstrably existent local-representation
solution? In my (extremely limited) experience of such situations,
networks end up with distributed solutions even when local solutions
exist. Now, whether the human solution in any cognitive domain involves
local or distributed representations is a purely empirical question. If
the human solution might be a local one, and if networks trained by
backpropagation do not find local solutions, the use of
back-propagation trained networks for studying human cognition will be
extremely problematic.
Conclusions
The value of connectionist models to cognitive psychologists is this:
once the psychologist has defined a functional architecture which is a
theory about how people perform some cognitive task, formulating this
as a connectionist model offers the advantages referred to above under
the rubrics "Theory sufficiency" and "Theory completeness". But the
functional architecture must precede the construction of the model. If
the functional architecture is instead the creature of backpropagation,
rather than of the cognitive psychologist after reflection upon the
relevant empirical data, I have argued firstly that it is very
difficult to discover what the trained model's functional architecture
actually is and secondly that there is no reason to expect that this
will correspond to the functional architecture of the relevant human
cognitive processing system. This is why my enthusiasm for
connectionist modelling is limited to an enthusiasm for hard-wired
models, and does not extend to models that train themselves.
References
Coltheart, M. (1978). Lexical access in simple reading tasks. In
Underwood, G. (ed). Strategies of Information-Processing. London:
Academic Press.
Coltheart, M. (1985). Cognitive neuropsychology and the study of
reading. In Posner, M.I., and Marin, O.S.M. (eds). Attention and
Performance XI. Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Coltheart, M., Curtis, B., Atkins, P. & Haller, M. (1993). Models of
reading aloud: Dual-route and parallel-distributed-processing
approaches. Psychological Review (in press).
Plaut, D. & McClelland, J.L. (1993) Generalization with componential
attractors: Word and nonword reading in an attractor network.
Proceedings of the 15th Annual Conference of the Cognitive Science
Society.
Seidenberg, M. & McClelland, J.L. (1989) A distributed developmental
model of word recognition and reading. Psychological Review, 97,
523-568.