Structure of Intelligence -- Copyright Springer-Verlag © 1993
Twenty years ago, Marr (1969) and Albus (1971) suggested that the circuitry of the cerebellum resembles the learning machine known as the "perceptron." A perceptron learns how to assign an appropriate output to each input by obeying the suggestions of its "teacher". The teacher provides encouragement when the perceptron is successful, and discouragement otherwise. Marr and Albus proposed that the climbing fibers in the cerebellum play the role of the teacher, and the mossy fibers play the role of the input to which the perceptron is supposed to assign output.
Perceptrons are no longer in vogue. However, the general view of the cerebellum as a learning machine has received a significant amount of experimental support. For instance, Ito (1984) has studied the way the brain learns the vestibulo-ocular reflex -- the reflex which keeps the gaze of the eye at a fixed point, regardless of head movement. This reflex relies on a highly detailed program, but it is also situation-dependent in certain respects; and it is now clear that the cerebellum can change the gain of the vestibulo-ocular reflex in an adaptive way.
The cerebellum, in itself, is not capable of coordinating complex movements. However, Fabre and Buser (1980) have suggested that similar learning takes place in the motor cortex -- the part of the cortex that is directly connected to the cerebellum. In order to learn a complex movement, one must do more than just change a few numerical values in a previous motion (e.g. the gain of a reflex arc, the speed of a muscle movement). Sakamoto, Porter and Asanuma (1987) have obtained experimental evidence that the sensory cortex of a cat can "teach" its motor cortex how to retrieve food from a moving beaker.
Asanuma (1989) has proposed that "aggregates of neurons constitute the basic modules of motor function", an hypothesis which is in agreement with Edelman's theory of Neural Darwinism. He goes on to observe that "each module has multiple loop circuits with many other modules located in various areas of the brain" -- a situation illustrated roughly by Figure 10. In this view, the motor cortex is a network of "schemes" or "programs", each one interacting with many others; and the most interesting question is: how is this network structured?
Consider an algorithm y=A(f,x) which takes in a guess x at the solution to a certain problem f and outputs a (hopefully better) guess y at the solution. Assume that it is easy to compute and compare the quality Q(x) of guess x and the quality Q(y) of guess y. Assume also that A contains some parameter p (which may be a numerical value, a vector of numerical values, etc.), so that we may write y=A(f,x,p). Then, for a given set S of problems f whose solutions all lie in some set R, there may be some value p which maximizes the average over all f in S of the average over all x in R of Q(A(f,x,p)) - Q(x). Such a value of p will be called optimal for S.
The determination of the optimal value of p for a given S can be a formidable optimization problem, even in the case where S has only one element. In practice, since one rarely possesses a priori information as to the performance of an algorithm under different parameter values, one is required to assess the performance of an algorithm with respect to different parameter values in a real-time fashion, as the algorithm operates. For instance, a common technique in numerical analysis is to try p=a for (say) fifty passes of A, then p=b for fifty passes of A, and then adopt the value that seems to be more effective on a semi-permanent basis. Our goal here is a more general approach.
Assume that A has been applied to various members of S from various guesses x, with various values of p. Let U denote the nx2 matrix whose i'th row is (fi,xi), and let P denote the nx1 vector whose i'th entry is (pi), where fi, xi and pi are the values of f, x and p to which the i'th pass of A was applied. Let I denote the nx1 vector whose i'th entry is Q(A(fi,xi,pi))-Q(xi). The crux of adaptation is finding a connection between parameter values and performance; in terms of these matrices this implies that what one seeks is a function C(X,Y) such that %C(U,P)-I% is small, for some norm % %.
So: once one has by some means determined C which thus relates U and I, then what? The overall object of the adaptation (and of A itself) is to maximize the size of I (specifically, the most relevant measure of size would seem to be the l1 norm, according to which the norm of a vector is the sum of the absolute values of its entries). Thus one seeks to maximize the function C(X,Y) with respect to Y.
PARAMETER ADAPTATION AS A BANDIT PROBLEM
The problem here is that one must balance three tasks: experimenting with p so as to locate an accurate C, experimenting with P so as to locate a maximum of C with respect to Y, and at each stage implementing the what seems on the basis of current knowledge most appropriate p, so as to get the best answer out of A. This sort of predicament, in which one must balance experimentalvariation with use of the best results found through past experimentation, is known as a "bandit problem" (Gittins, 1989). The reason for the name is the following question: given a "two-armed bandit", a slot machine with two handles such that pulling each handle gives a possibly different payoff, according to what strategy should one distribute pulls among the two handles? If after a hundred pulls, the first handle seems to pay off twice as well, how much more should one pull the second handle just in case this observation is a fluke?
To be more precise, the bandit problem associated with adaptation of parameters is as follows. In practice, one would seek to optimize C(X,Y) with respect to Y by varying Y about the current optimal value according to some probability distribution. The problem is: what probability distribution? One could, of course, seek to determine this adaptively, but this leads to a regress: how does one solve the bandit problem associated with this adaptation?
I propose a motor control hierarchy which is closely analogous to the perceptual hierarchy, but works in the opposite direction. In the motor control hierarchy, the lower levels deal directly with muscle movements, with bodily functions; whereas the higher levels deal with patterns in bodily movements, with schemes for arranging bodily movements. This much is similar to the perceptual hierarchy. But in the motor control hierarchy, the primary function of a processor on level n is to instruct processors on level n-1 as to what they should do next. The most crucial information transmission is top-down. Bottom-up information transmission is highly simplistic: it is of the form "I can do what you told me to do with estimated effectiveness E".
Let us be more precise. When we say a processor on the n'th level tells a processor on the n-1'th level what to do, we mean it gives it a certain goal and tells it to fulfill it. That is, we mean: it poses it a certain optimization problem. It tells it: do something which produces a result as near to this goal as possible. The processor on the n-1'th level must then implement some scheme for solving this problem, for approximating the desired goal. And its scheme will, in general, involve giving instructions to certain n-2'nd level processors. The important point is that each level need know nothing about the operation of processors 2 or 3 levels down from it. Each processor supplies its subordinates with ends, and the subordinates must conceive their own means. As with the perceptual hierarchy, consciousness plays a role only on certain relatively high levels. So, from the point of view of consciousness, the motor control hierarchy has no definite end. But, from the point of view of external reality, there is an indisputable bottom level: physical actions. The lowest level of the motor control hierarchy therefore has no subordinates except for physical, nonintelligent systems. It must therefore prescribe means, not merely ends.
Now, where do these "schemes" for optimization come from? Some are certainly preprogrammed -- e.g. a human infant appears to have an inborn"sucking reflex". But -- as observed above -- even a cursory examination of motor development indicates that a great deal of learning is involved.
Let us assume that each processor is not entirely free to compute any function within its capacity; that it has some sort of general "algorithm scheme", which may be made more precise by the specification of certain "parameter values". Then there is first of all the problem of parameter adaptation: given an optimization problem and a method of solution which contains a number of parameter values, which parameter values are best? In order to approximately solve this problem according to the scheme given above, all that is required is an estimate of how "effective" each parameter value tends to be. In the motor control hierarchy, a processor on level n must obtain this estimate from the processors on level n-1 which it has instructed. The subordinate processors must tell their controlling processor how well they have achieved their goal. The effectiveness with which they have achieved their goal is a rough indication of how effective the parameter values involved are for that particular problem.
So, on every level but the lowest, each processor in the hierarchy tells certain subordinate lower-level processors what to do. If they can do it well, they do it and are not modified. But if then cannot do their assigned tasks well, they are experimentally modified until they can do a satisfactory job. The only loose end here is the nature of this experimental modification. Parameter adaptation is only part of the story.
MOTOR CONTROL AND ASSOCIATIVE MEMORY
Knowing how effective each vector of parameter values is for each particular problem is useful, but not adequate for general motor control. After all, what happens when some new action is required, some action for which optimal parameter values have not already been estimated? It would be highly inefficient to begin the parameter optimization algorithm from some random set of values. Rather, some sort of educated guess is in order. This means something very similar to analogical reasoning is required. Presented with a new task, a motor control processor must ask: what parameter values have worked for similar tasks?
So, each motor control processor must first of all have access to the structurally associative memory, from which it can obtain information as to which tasks are similar to which tasks. And it must also have access to a memory bank storing estimates of optimal parameter values for given tasks. In this way it can select appropriate schemes for regulating action.
Based on the biological facts reviewed above, it is clear that this aspect of motor control is native to the motor cortex rather than the cerebellum. To learn a complex action, the brain must invoke the greater plasticity of the cortex.
LEARNING TO THROW
Introspectively speaking, all this is little more than common sense. To figure out how to throw a certain object, we start out with the motions familiar to us from throwing similar objects. Then, partly consciously but mainly unconsciously, we modify the "parameters" of the motions: we change the speed of our hand or the angle at which the object is tilted. Based on trial-and-error experimentation with various parameters, guided by intuition, we arrive at an optimal, or at least adequate, set of motions.
This process may be simple or sophisticated. For instance, when first throwing a frisbee with a hole in the middle, one throws it as if it were an ordinary frisbee; but then one learns the subtle differences. In this case the major problem is fine-tuning the parameters. But when learning to throw a shot-put, or a football, the only useful item to be obtained from memory is the general scheme of "throwing" -- all the rest must be determined by conscious thought or, primarily, experiment.
And when learning to juggle, or when learning to throw for the first time, the mind must synthesize whole new patterns of timing and coordination: there is not even any "scheme" which can be applied. Fragments of known programs must be pieced together and augmented to form a new program, which then must be fine-tuned.
More and more difficult tasks require higher and higher levels of the motor control hierarchy -- both for learning and for execution. Even the very low levels of the motor control hierarchy are often connected to the perceptual hierarchy; but the higher levels involve a great deal of interaction with yet other parts of the mind.
In Chapter 6 we used Edelman's theory of Neural Darwinism to explore the nature of neural analogy. However, we did not suggest how the "lower-to-intermediate-level" details discussed there might fit into a theory of higher-level brain function. It is possible to give a partial Neural-Darwinist analysis of the perceptual and motor hierarchies. This entails wandering rather far from the biological data; however, given the current state of neuroscience, there is little choice.
Assume that the inputs of certain neural clusters are connected to sensory input, and that the outputs of certain clusters are connected to motor controls. The purpose of the brain is to give "appropriate" instructions to the motor controls, and the determination of appropriateness at any given time is in large part dependent upon the effects of past instructions to the motor controls -- i.e. on using sensory input to recognize patterns between motor control instructions and desirability of ensuing situation.
In order to make sense of this picture, we must specify exactly howappropriateness is to be determined. Toward this end I will delineate a hierarchy of maps. Maps which are connected to both sensory inputs and motor controls (as well as, naturally, other clusters) we shall call level 1 maps. These maps are potentially able to "sense" the effect of motor control instructions, and formulate future motor control instructions accordingly.
One question immediately arises: How are appropriate level 1 maps arrived at and then maintained? In the simple Hebb rule discussed in Chapter 6, we have a mechanism by which any map, once repeatedly used, will be reinforced and hence maintained; but this says nothing about the problem of arriving at an appropriate map in the first place. Rather than proposing a specific formula, let us dodge the issue by asserting that the appropriateness of level 1 maps should be determined on the basis of the degree to which the levels of certain chemical substances in their vicinity are maintained within biologically specified "appropriate" bounds. This cowardly recourse to biological detail can serve as the "ground floor" of an interesting general definition of appropriateness.
Define a map which is not a level-1 map to be appropriate to the extent that the maps or motor controls to which it outputs are appropriate. The idea is that a map is appropriate to the extent that the entities over which it has (partial) control are appropriate. The appropriateness of a level 1 map is partially determined by the extent to which it directs motor controls to make appropriate actions. And in the long run -- barring statistical fluctuations -- this is roughly equivalent to the extent to which it represents an emergent pattern between 1) results of motor control and 2) appropriateness as measured by sensory data and otherwise. This is the crucial observation. In general, the appropriateness of a map is determined by the extent to which it directs other maps to effect appropriate actions, either directly on motor controls or on other maps. And, barring statistical fluctuations, it is plain that this is roughly equivalent to the extent to which it represents an emergent pattern between 1) results of outputs to other maps and 2) inputs it obtains from various sources.
It is important to remember that we are not hypothesizing the brain to contain distinct "pattern recognition processors" or "analogical reasoning processors" or "memory cells": our model of mind is schematic and logical, not necessarily physical; it is a model of patterns and processes. We have hypothesized a neural mechanism which tends to act much like a pattern recognition processor, and that is all that can reasonably be expected.
Now, let us go beyond the level 1 maps. Define a degree 2 map as a map which outputs to level 1 maps (as well as possibly inputting from level 1 maps and other maps and outputting to other maps). Define a degree 3 map as one which outputs to degree 2 maps (as well as possibly level 1 maps, etc.). One may define maps of degree 4, 5, 6,.. in a similar way. The level of a map is then defined as the highest degree to which it possesses. If a level k map accepted inputs only from maps of level k-1 or lower, the network of maps would have a strictly hierarchical structure. There would be a "top" level n, and our definition of appropriateness would define appropriateness on all levels less than n in terms of top-level appropriateness, but say nothing about theappropriateness of a map on level n. But in fact, the maps of the brain are arranged in a far less orderly fashion. Although there is a bottom level -- the level of perception and action -- there is no distinct top level.
The nonhierarchical interconnection of the maps of the brain implies that the evaluation of appropriateness is a very tricky matter. If A and B both input and output to each other, then the appropriateness of A is determined in part as an increasing function of the appropriateness of B, and vice versa. The hierarchy of maps does bottom out at level 1, but it also re-enters itself multiply. In a very rough way, this explains a lot about human behavior: our internal definition of "appropriateness" is determined not only by low-level biological factors but also by a subtle, probably unstable dynamical system of circularly reinforcing and inhibiting patterns.
We have not yet specified what exactly happens to inappropriate maps. Clearly, an inappropriate map should be dissolved, so that it will no longer direct behavior, and so that a new and hopefully better map can take its place. The easiest way to effect this would be to inhibit the connections between clusters of the map -- to decrease their conductances (roughly speaking, proportionally to the lack of appropriateness). Naturally, if a connection belonged to more than one map, this decrease would be mitigated by the increase afforded by membership in other maps, but this effect need not rob inhibition of its effectiveness.
Biologically, how might such a system of inhibition work? It is known that if a once-frequently-used connection between clusters is unused for a time, its conductance will gradually revert to the level of other infrequently-used neurons. Naturally, the presence of inhibitory connections between individual neurons plays a role in this. However, it is not presently known whether this effect is sufficient for the suppression of inappropriate maps.
At this point, just as we are beginning to move toward the higher levels of mental process, we must abandon our discussion of the brain. We have already left our starting point, Neural Darwinism, too far behind. On the positive side, we have constructed a hierarchy of neural maps which is, intuitively, a combination of the perceptual and motor control hierarchies: it recognizes patterns and it controls actions, using a simple form of analogy, in an interconnected way. However, we have accounted for only a few of the simpler aspects of the master network to be described in Chapter 12 -- we have spoken only of pattern recognition and a simple form of structural analogy. We have said nothing of induction, deduction, Bayesian inference, or modeling or contextual analogy, or even more general forms of structural analogy. It is difficult to see how these subtler aspects of mentality could be integrated into the Neural Darwinist framework without altering it beyond recognition. It seems to me that they may require somewhat more structure than the self-organizing network of maps which Neural Darwinism proposes.