Chaotic Logic -- Copyright Plenum Press © 1994 |
Hundreds of thousands of pages have been written on the question: what is mind? Here I will dispense with the question immediately. In good mathematical form, I will define it away. A mind is the structure of an intelligent system.
This definition has its plusses and minuses. One may endlessly debate whether it captures every nuance of the intuitive concept of mind. But it does situate mind in the right place: neither within the physical world, nor totally disconnected from the physical world. If a mind is the structure of a certain physical system, then mind is made of relations between physical entities.
The question, then, is whether this system of relations that is mind has any characteristic structure. Are all minds somehow alike? Locomotion can be achieved by mechanisms as different as legs and the wheel -- is this kind of variety possible for the mechanisms of intelligence? I suggest that it is not. There is much room for variety, but the logic of intelligence dictates a certain uniformity of overall structure. The goal of this chapter is to outline what this uniform global structure is.
Of course, one cannot reasonably define mind in terms of intelligence unless one has a definition of intelligence at hand. So, let us say that intelligence is the ability to optimize complex functions of complex environments. By a "complex environment," I mean an environment which is unpredictable on the level of details, but somewhat structurally predictable. And by a "complex function," I mean a function whose graph is unpredictable on the level of details, but somewhat structurally predictable.
The "complex function" involved in the definition of intelligence may be anything from finding a mate to getting something to eat to building a transistor or browsing through a library. When executing any of these tasks, a person has a certain goal, and wants to know what set of actions to take in order to achieve it. There are many different possible sets of actions -- each one, call it X, has a certain effectiveness at achieving the goal.
This effectiveness depends on the environment E, thus yielding an "effectiveness function" f(X,E). Given an environment E, the person wants to find X that maximizes f -- that is maximally effective at achieving the goal. But in reality, one is never given complete information about the environment E, either at present or in the future (or in the past, for that matter). So there are two interrelated problems: one must estimate E, and then find the optimal X based on this estimate.
If you have to optimize a function that depends on a changing environment, you'd better be able to predict at least roughly what that environment is going to do in the future. But on the other hand, if the environment is too predictable, it doesn't take much to optimize functions that depend on it. The interesting kind of environment is the kind that couples unpredictability on the level of state with rough predictability on the level of structure. That is: one cannot predict the future state well even from a good approximation to the present and recent past states, but one can predict the future structure well from a good approximation to the present and recent past structure.
This is the type of partial unpredictability meant in the formulation "Intelligence is the ability to optimize complex functions of partially unpredictable environments." In environments displaying this kind of unpredictability, prediction must proceed according to pattern recognition. An intelligent system must recognize patterns in the past, store them in memory, and construct a model of the future based on the assumption that some of these patterns will approximately continue into the future.
Is there only one type of structure capable of doing this? I claim the answer is yes.
My hypothesis is a simple one: every mind is a superposition of two structures: a structurallyassociative memory (also called "heterarchical network") and a multilevel control hierarchy ("perceptual-motor hierarchy" or "hierarchical network"). Both of these structures are defined in terms of their action on certain patterns. By superposing these two distinct structures, the mind combines memory, perception and control in a creative an effective way.
Let us begin with multilevel control. To solve a problem by the multilevel methodology, one divides one's resources into a number of levels -- say, levels ...,3,2,1,0. Level 0 is the "bottom level", which contains a number of problem-solving algorithms. Each process on level N contains a number of subsidiary processes on levels k = 1, 2, ..., N-1 -- it tells them what to do, and in return they give it feedback as to the efficacy of its instructions.
This is a simple idea of very broad applicability. One clear-cut example is the hierarchical power structure of the large corporation. Level 0 consists of those employees who actually produce goods or provide services for individuals outside the company. Level 1 consists of foremen and other low-level supervisors. And so on. The highest level comprises the corporate president and the board of directors.
A vivid example is the problem of perception. One has a visual image P, and one has a large memory consisting of various images z1, z2,..., zM. One wants to represent the perceived image in terms of the stored images. This is a pattern recognition problem: one wants to find a pair of the form (y,z), where y*z=P and z is a subset of {z1,...,zM}. In this case, the multilevel methodology takes the form of a hierarchy of subroutines. Subroutines on the bottom level -- level 0 -- output simple patterns recognized in the input image P. And, for i>0, subroutines on level i output patterns recognized in the output of level i-1 subroutines. In some instances a subroutine may also instruct the subroutines on the level below it as to what sort of patterns to look for.
It appears that this is one of the key strategies of the human visual system. Two decades ago, Hubel and Wiesel (Hubel, 1988) demonstrated that the brain possesses specific neural clusters which behave as subroutines for judging the orientation of line segments. Since that time, many other neural clusters executing equally specific visual "subroutines" have been found. As well as perhaps being organized in other ways, these clusters appear to be organized in levels.
At the lowest level, in the retina, gradients are enhanced and spots are extracted -- simple mechanical processes. Next come simple moving edge detectors. The next level up, the second level up from the retina, extracts more sophisticated information from the first level up from the retina -- and so on. Admittedly, little is known about the processes two or more levels above the retina. It is clear, however, that there is a very prominent hierarchical structure, although it may be supplemented by more complex forms of parallel information processing (Ruse and Dubose, 1985).
To be extremely rough about it, one might suppose that level 1 corresponds to lines. Then level 2 might correspond to simple geometrical shapes, level 3 might correspond to complex geometrical shapes, level 4 might correspond to simple recognizable objects or parts of recognizable objects, level 5 might correspond to complex recognizable objects, and level 6 might correspond to whole scenes. To say that level 4 processes recognize patterns in the output of level 3 processes is to say that simple recognizable objects are constructed out of complex geometrical shapes, rather than directly out of lines or simple geometrical shapes. Each level 4 process is the parent, the controller, of those level 3 nodes that correspond to those complex geometrical shapes which make up the simple object which it represents. And it is the child, the controlee, of at least one of the level 5 nodes that corresponds to a complex object of which it is a part (or perhaps even of one of the level 6 nodes describing a scene of which it is a part -- level crossing like this can happen, so long as it is not the rule).
My favorite way of illustrating this multilevel control structure is to mention the three-level "pyramidal" vision processing parallel computer developed by Levitan and his colleages at the University of Massachusetts. The bottom level deals with sensory data and with low-level processing such as segmentation into components. The intermediate level takes care of grouping, shape detection, and so forth; and the top level processes this information "symbolically", constructing an overall interpretation of the scene. The base level is a 512X512 square array of processors each doing exactly the same thing to different parts of the image; and the middle level is composed of a 64X64 square array of relatively powerful processors, each doing exactly the same thing to different parts of the base-level array. Finally, the top level contains 64 very powerful processors, each one operating independently according to LISP programs. The intermediate level may also be augmented by additional connections. This three-level perceptual hierarchy appears be be an extremely effective approach to computer vision.
That orders are passed down the perceptual hierarchy was one of the biggest insights of the Gestalt psychologists. Their experiments (Kohler, 1975) showed that we look for certain configurations in our visual input. We look for those objects that we expect to see, and we look for those shapes that we are used to seeing. If a level 5 process corresponds to an expected object, then it will tell its children to look for the parts corresponding to that object, and its children will tell their children to look for the complex geometrical forms making up the parts to which they refer, et cetera.
In its motor control aspect, this multilevel control network serves to send actions from the abstract level to the concrete level. Again extremely roughly, say level 1 represents muscle movements, level 2 represents simple combinations of muscle movements, level 3 represents medium-complexity combinations of muscle movements, and level 4 represents complex combinations of movements such as raising an arm or kicking a ball. Then when a level 4 process gives an instruction to raise an arm, it gives instructions to its subservient level 3 processes, which then give instructions to their subservient level 2 processes, which given instructions to level 1 processes, which finally instruct the muscles on what to do in order to kick the ball. This sort of control moves down the network, but of course all complex motions involve feedback, so that level k processes are monitoring how well their level k-1 processes are doing their jobs and adjusting their instructions accordingly. Feedback corresponds to control moving up the network.
In a less abstract, more practically-oriented language, Bernstein (see Whiting, 1984) has given a closely related analysis of motor control. And a very similar hierarchical model of perception and motor control has been given by Jason Brown (1988), under the name of "microgenesis." His idea is that lower levels of the hierarchy correspond to older, evolutionarily prior forms of perception and control.
Let us sum up. The multilevel control methodology, in itself, has nothing to do with patterns. It is a verysimple and general way of structuring perception and action: subprocesses within subprocesses within subprocesses, each subprocess giving orders to and receiving feedback from its subsidiaries. In this general sense, the idea that the mind contains a multilevel control hierarchy is extremely noncontroversial. But psychological multilevel control networks have one important additional property. They are postulated to deal with questions of pattern. As in the visual system, the processes on level N are hypothesized to recognize patterns in the output of the processes on level N-1, and to instruct these processes in certain patterns of behavior. It is pattern which is passed between the different levels of the hierarchy.
Finally, there is the question of how an effective multilevel control network could ever come about. As there is no "master programmer" determining which control networks will work better for which tasks, the only way for a control network to emerge is via directed trial and error. And in this context, the only natural method of trial and error is the one known as genetic optimization or genetic programming. These fancy words mean simply that
1) subnetworks of the control network which seem to be working ineffectively are randomly varied
2) subnetworks of the control network which seem to be working ineffectively are a) swapped with one another, or b) replaced with other subnetworks.
This substitution may perhaps be subject to a kind of "speciation," in which the probability of substituting subnetwork A for subnetwork B is roughly proportional to the distance between A and B in the network.
Preliminary computer simulations indicate that, under appropriate conditions, this sort of process can indeed converge on efficient programs for executing various perceptual and motor tasks. However, a complete empirical study of this sort of process remains to be undertaken.
So much for the multilevel control network. Let us now turn to long-term memory. What I call "structurally associative memory" is nothing but a long-term memory model which the connections between processes aredetermined not by control structures, nor by any arbitrary classification system, but by patterned relations.
The idea of associative memory has a long psychological history. Hundreds, perhaps thousands of experiments on priming indicate that verbal, visual and other types of memory display associativity of access. For instance, if one has just heard the word "cat," and one is shown the picture of a dog, one will identify it as a "dog" very quickly. If, on the other hand, one has just heard the word "car" and one is shown the picture of a dog, identification of the dog as a "dog" will take a little bit longer.
Associative memory has also proved very useful in AI. What could be more natural than to suppose that the brain stores related entities near to each other? There are dozens of different associative memory designs in the engineering and computer science literature. Kohonen's (1984) associative memory model was one of the landmark achievements of early neural network theory; and Kanerva's (1988) sparse distributed memory, based on the peculiar statistics of the Hamming distance, has yielded many striking insights into the nature of recall.
Psychological studies of associative memory tend to deal with words or images, where the notion of "association" is intuitively obvious. Engineering associative memories use specialized mathematical definitions of association, based on inner products, bit string comparisons, etc. Neither of these paradigms seems to have a reasonably general method of defining association, or "relatedness."
The idea at the core of the structurally associative memory is that relatedness should be defined in terms of pattern. In the structurally associative memroy, an entity y is connected to another entity x if x is a pattern in y. Thus, if w and x have common patterns, there will be many nodes connected to both w and x. In general, if there are many short paths from w to x in the structurally associative memory, that means that w and x are closely related; that their structures probably intersect.
On the other hand, if y is a pattern emergent between w and x, y will not necessarily connect to w or x, but it will connect to the node z = w U x, if there is such a node. One might expect that, as a rough rule, z would be higher on the multilevel control network than w or z, thus interconnecting the two networks in a very fundamental way.
The memory of a real person (or computer) can never be truly associative -- sometimes two dissimilar things will be stored right next to each other, just by mistake. But it can be approximately structurally associative, and it can continually reorganize itself so as to maintain a high degree of structural associativity despite a continual influx of new information.
In The Evolving Mind this reorganization is shown to imply that structurally associative memories evolve by natural selection -- an entity stored in structurally associative memory is likely to "survive" (not be moved) if it fits in well with (has patterns in common with, generates emergent pattern cooperatively with, etc.) its environment, with those entities that immediately surround it.
More specifically, this reorganization must be understood to take place on many different levels. There is no "memory supervisor" ruling over the entire long term memory store, mathematically determining the optimal "location" for each entity. So, logically, the only form which reorganization can take is that of directed, locally governed trial and error.
How might this trial and error work? The most plausible hypothesis, as pointed out in The Structure of Intelligence, is as follows: one subnetwork is swapped with another; or else subnetwork A is merely copied into the place of subnetwork B. All else equal, substitution will tend to take place in those regions where associativity is worse; but there may also be certain subnetworks that are protected against having their sub-subnetworks removed or replaced.
If the substitution(s) obtained by swapping or copying are successful, in the sense of improving associativity, then the new networks formed will tend not to be broken up. If the substitutions are unsuccessful, then more swapping or copying will be done.
Finally, these substitutions may take place in a multilevel manner: large networks may be moved around, and at the same time the small networks which make them up may be internally rearranged. The multilevel process will work best if, after a large network is moved, a reasonable time period is left for its subnetworks to rearrange among themselves and arrive at a "locally optimal" configuration. This same "waiting" procedure may be applied recursively: after a subnetwork is moved,it should not be moved again until its sub-subnetworks have had a chance to adequately rearrange themselves. Note that this reorganization scheme relies on the existence of certain "barriers." For instance, suppose network A contains network B, which contains network C. C should have more chance of being moved to a given position inside B than to a given position out of B. It should have more chance of moving to a given position inside A-B, than to a given position outside A (here A-B means the portion of A that is not in B). And so on -- if A is contained in Z, C should have more chance of being moved to a position in Z-A than outside Z.
In some cases these restrictions may be so strong as to prohibit any rearrangement at all: in later chapters, this sort of comprehensive rearrangement protection will be identified with the more familiar concept of reality. In other cases the restrictions may be very weak, allowing the memory to spontaneously direct itself through a free-floating, never-ending search for perfect associativity.
In this context, I will discuss the psychological classification of people into thin-boundaried and thick-boundaried personality types. These types would seem to tie in naturally with the notion of rearrangement barriers in the structurally associative memory. A thick-boundaried person tends to have generally stronger rearrangement barriers, and hence tends to reify things more, to be more resistant to mental change. A thin-boundaried person, on the other hand, has generally weaker rearrangement barriers, and thus tends to permit even fixed ideas to shift, to display a weaker grasp on "reality."
The strength and placement of these "rearrangement barriers" might seem to be a sticky issue. But the conceptual difficulty is greatly reduced if one assumes that the memory network is "fractally" structured -- structured in clusters within clusters ... within clusters, or equivalently networks within networks ... within networks. If this is the case, then one may simply assume that a certain "degree of restriction" comes along with each cluster, each network of networks of ... networks. Larger clusters, larger networks, have larger degrees of restriction.
The only real question remaining is who assigns this degree. Are there perhaps mental processes which exist mainly to adjust the degrees of restriction imposed by other processes? This is a large question, and a complete resolution will have to wait till later. Partof the answer, however, will be found in the following section, in the concept of the dual network.
Neither a structurally associative memory nor a multilevel control network can, in itself, lead to intelligence. What is necessary is to put the two together: to take a single set of entities/processes, and by drawing a single set of collections between them, structure them both according to structural associativity and according to multilevel control. This does not mean just drawing two different graphs on the same set of edges: it means that the same connections must serve as part of a structurally associative memory and part of a multilevel control network. Entities which are connected via multilevel control must, on the whole, also be connected via structural associativity, and vice versa.
A moment's reflection shows that it is not possible to superpose an arbitrary associative memory structure with a multilevel control hierarchy in this way. In fact, such superposition is only possible if the entities stored in the associative memory are distributed in an approximately "fractal" way (Barnsley, 1988; Edgar, 1990).
In a fractally distributed structurally associative memory, on the "smallest" scale, each process is contained in a densely connected subgraph of "neighbors," each of which is very closely related to it. On the next highest scale, each such neighborhood is connected to a collection of "neighboring neighborhoods," so that the elements of a neighboring neighborhood are fairly closely related to its elements. Such a neighborhood of neighborhoods may be called a 2'nd-level neighborhood, and in an analogous manner one may define k'th-level neighborhoods. Of course, this structure need not be strict: may be breaks in it on every level, and each process may appear at several different vertices.
A good way to understand the fractal structure of the heterarchical network is to think about the distribution of subjects in a large library. One has disciplines, sub-disciplines, sub-sub-disciplines, and so forth -- clusters within clusters within clusters, rather than a uniformly distributed field of subjects. And a good way to visualize the superposition of a hierarchical network on this structure is to postulate a head librarian dealing with each discipline, an assistant librarian dealing with each sub-sub-discipline, anassistant assistant librarian dealing with each sub-sub-sub-discipline, and so on. If one imagines that each librarian, assistant librarian, etc., gives her subsidiaries general goals and lets them work out their own strategies, then one has a control hierarchy that works approximately according to the multilevel methodology. The hierarchy of control is lined up perfectly with the fractal heterarchy of conceptual commonality.
A dual network, then, is a collection of processes which are arranged simultaneously in an hierarchical network and an heterarchical network. Those processes with close parents in the hierarchical network are, on the whole, correspondingly closely related in the heterarchical network.
This brings us back to the problem of rearrangement barriers. The rearrangement barriers of the associative memory network may be set up by the heterarchical network, the multilevel control network. And, strikingly, in the dual network architecture, substituting of subnetworks of the memory network is equivalent to genetic optimization of the control network. The same operation serves two different functions; the quest for associativity and the quest for efficient control are carried out in exactly the same way. This synergy between structure and dynamics is immensely satisfying.
But, important and elegant as this is, this is not the only significant interaction between the two networks. A structurally associative memory is specifically configured so as to support analogical reasoning. Roughly speaking, analogy works by relating one entity to another entity with which it shares common patterns, and the structurally associative memory stores an entity near those entities with which it shares common patterns. And the hierarchical network, the perceptual-motor hierarchy, requires analogical reasoning in order to do its job. The purpose of each cluster in the dual network is to instruct its subservient clusters in the way that it estimates will best fulfill the task given to it by its master cluster -- and this estimation is based on reasoning analogically with respect to the information stored in its memory bank.
Let's get a little more concrete. The brain is modeled as a dual network of neural networks. It is considered to consist of "level k clusters" of autonomous neural networks, each one of which consists of 1) a number of level k-1 clusters, all related to each other, 2) some networks that monitor and control these level k-1clusters. The degree of control involved here may be highly variable. However, the neurological evidence shows that entire knowledge bases may be outright moved from one part of the brain to another (Blakeslee, 1991), so that in some cases the degree of control is very high.
For example, a level 2 cluster might consist of processes that recognize shapes of various sorts in visual inputs, together with a network regulating these processes. This cluster of shape recognition processes would be organized according to the principle of structurally associative memory, so that e.g. the circle process and the ellipse process would be closer to each other than to the square process. This organization would permit the regulating process to execute systematic analogical search for a given shape: if in a given situation the circle process were seen to be fairly successful, but the square process not at all successful, then the next step would be to try out those processes near to the circle process.
After hinting at the dual network model in The Structure of Intelligence, and presenting it fully in The Evolving Mind, I came across two other models of mind which mirror many of its aspects. First of all, I learned that many cognitive scientists are interested in analyzing thought as a network of interconnected " schema" (Arbib and Hesse, 1986). This term is not always well defined -- often a "schema" is nothing more than a process, an algorithm. But Arbib equates "schema" with Charles S. Peirce's "habit," bringing it very close to the concept of pattern. The global architecture of this network of schema is not discussed, but the connection is there nonetheless.
Also, I encountered the paper "Outline for a Theory of Intelligence" by James S. Albus (1991), Chief of the Robot Systems Division of the National Institute of Standards and Technology. I was pleased to find therein a model of mind strikingly similar to the dual network, complete with diagrams such as Figure 6. Albus's focus is rather different than mine -- he is concerned with the differential equations of control theory rather than the algorithmic structure of reasoning and memory processes. But the connection between the fractal structure of memory and the hierarchical structure of control, which is perhaps the most essential component in the dual network, is virtually implicit in his theory.
Putting the schema theory developed by cognitive scientists together with the global structure identified by Albus through his robotics work, one comes rather close to a crude version of the dual network model. This is not how the dual network model was conceived, but it is a rather satisfying connection. For the dual network structure is, after all, a rather straightforward idea. What is less obvious, and what has not emerged from cognitive science or engineering, is the dynamics of the dual network. The way the dual network unifies memory reorganization with genetic optimization has not previously been discussed; nor has the dynamics of barrier formation and its relationship with consciousness, language and perception (to be explored in Chapter Six).
The dual network model, as just presented, dismisses the problem of predicting the future rather cursorily. But this is not entirely justified. Prediction of the behavior of a complex system is an incredibly very difficult task, and one which lies at the very foundation of intelligence. The dual network model has no problem incorporating this sort of prediction, but something should be said about how its prediction processes work, rather than just about how they are interconnected.
One way to predict the future of a system, given certain assumptions about its present and about the laws governing its behavior, is to simply simulate the system. But this is inefficient, for a very simple physicalistic reason. Unlike most contemporary digital computers, the brain works in parallel -- there are a hundred billion neurons working at once, plus an unknown multitude of chemical reactions interacting with and underlying this neural behavior. And each neuron is a fairly complicated biochemical system, a far cry from the on-off switch in a digital computer. But when one simulates a system, one goes one step at a time. To a certain extent, this wastes the massive parallelism of the brain.
So, the question is, is simulation the best a mind can do, or are there short-cuts? This question ties in with some pressing problems of modern mathematics and theoretical computer science. One of the biggest trends in modern practical computer science is the development of parallel-processing computers, and it is of great interest to know when these computers can outperform conventional serial computers, and by what margin.
For a simple mathematical example, let us look to the theory of finite fields. A finite field is a way of doing arithmetic on a bounded set of integers. For instance, suppose one takes the field of size 13 (the size must be a prime or a prime raised to some power). Then, in this field the largest number is 12. One has, for example, 12 + 1 = 0, 10 + 5 = 2, 3 x 5 = 2, and 8 x 3 = 12. One can do division in a finite field as well, although the results are often counterintuitive -- for instance, 12/8 = 3, and 2/3 = 5 (to see why, just multiply both sides by the denominator).
In finite field theory there is something called the "discrete logarithm" of a number, written dlogb(n). The discrete logarithm is defined just like the ordinary logarithm, as the inverse of exponentiation. But in a finite field, exponentiation must be defined in terms of the "wrap-around" arithmetic illustrated in the previous paragraph. For instance, in the field of size 7, 34 = 4. Thus one has dlog3(4) = 4. But how could one compute the log base 3 of 4, without knowing what it was? The powers of 3 can wrap around the value 7 again and again -- they could wrap around many times before hitting on the correct value, 4.
The problem of finding the discrete logarithm of a number is theoretically easy, in the sense that there are only finitely many possibilities. In our simple example, all one has to do is take 3 to higher and higher powers, until all possibilities are covered. But in practice, if the size of the field is not 7 but some much larger number, this finite number of possibilities can become prohibitively large.
So, what if one defines the dynamical system nk = dlogb(nk-1)? Suppose one is given n1, then how can one predict n1000? So far as we know today, there is better way than to proceed in order: first get n2, then n3, then n4, and so on up to n999 and n1000. Working on n3 before one knows n2 is essentially useless, because a slight change in the answer for n2 can totally chagne the answer for n3. The only way to do all 1000 steps in parallel, it seems, would be to first compute a table of all possible powers that one might possibly need to know in the course of calculation. But this would require an immense number of processors; at least the square of the size of the field.
This example is, incidentally, of more than academic interest. Many cryptosystems in current use are reliant on discrete logarithms. If one could devise a quickmethod for computing them, one could crack all manner of codes; and the coding theorists would have to come up with something better.
More physicalistic dynamical systems appear to have the same behavior. The classic example is the "logistic" iteration xk = cxk-1(1-xk-1), where c=4 or c assumes certain values between 3.8 and 4, and the xk are discrete approximations of real numbers. This equation models the dynamics of certain biological populations, and it also approximates the equations of fluid dynamics under certain conditions.
It seems very, very likely that there is no way to compute xn from x1 on an ordinary serial computer, except to proceed one step at a time. Even if one adds a dozen or a thousand or a million processors, the same conclusion seems to hold. Only if one adds a number of processors roughly proportional to 2n can one obtain a significant advantage from parallelism.
In general, all systems of equations called chaotic possess similar properties. These include equations modeling the weather, the flow of blood through the body, the motions of planets in solar systems, and the flow of electricity in the brain. The mathematics of these systems is still in a phase of rapid development. But the intuitive picture is clear. To figure out what the weather will be ninety days from now, one must run an incredibly accurate day-by-day simulation -- even with highly parallel processing, there is no viable alternate strategy.
A mind is the structure of an intelligent system, and intelligence relies on prediction, memory and optimization. Given the assumption that some past patterns will persist, a mind must always explore several different hypotheses as to which ones will persist. It must explore several different possible futures, by a process of predictive extrapolation. Therefore, intelligence requires the prediction of the future behavior of partially unpredictable systems.
If these systems were as chaotic as xk = 4xk(1-xk), all hope would be lost. But the weather system is a better example. It is chaotic in its particular details -- there is no practical way, today in 1992, to determine the temperature on July 4 1999 in Las Vegas. But there are certain persistent patterns that allow one to predict its behavior in a qualitative way. After all, the temperature on July 4 1999 in Las Vegas will probably be around 95-110 Fahrenheit. One can make probabilistic, approximate predictions -- one can recognize patterns in the past and hope/assume that they will continue.
Our definition of intelligence conceals the presupposition that most of the prediction which the mind has to do is analogous to this trivial weather prediction example. No step-by-step simulation is required, only inductive/analogical reasoning, supported by memory search. However, the fact remains that sometimes the mind will run across obstinate situations -- prediction problemss that are not effectively tackled using intuitive memory or using parallel-processing shortcuts. In these cases, the mind has no choice but to resort to direct simulation (on some level of abstraction).
The brain is a massively parallel processor. But when it runs a direct simulation of some process, it is acting like a serial processor. In computerese, it is running a virtual serial machine. The idea that the parallel brain runs virtual serial machines is not a new one -- in Consciousness Explained Daniel Dennett proposes that consciousness is a virtual serial machine run on the parallel processor of the brain. As will be seen in Chapter Six, although I cannot accept Dennett's reductionist analysis of consciousness, I find a great deal of merit in this idea.
To proceed further with my formal theory of intelligence, I must now introduce some slightly technical definitions. The concept of a structured transformation system will be absolutely essential to the theory of language and belief to be given in later chapters. But before I can say what a structured transformation system is, I must define a plain old transformation system.
In words, a transformation system consists of a set I of initials, combined with a set of T transformation rules. The initials are the "given information"; the transformation rules are methods for combining andaltering the initials into new statements. The deductive system itself, I will call D(I,T).
For instance, in elementary algebra one has transformation rules such as
X = Y implies X+Z = Y+Z, and XZ = YZ
(X + Y) + Z = X + (Y+Z)
X - X = 0
X + 0 = X
X + Y = Y + X
If one is given the initial
2q - r = 1
one can use these transformation rules to obtain
q = (1 + r)/2.
The latter formula has the same content as the initial, but its form is different.
If one had a table of numbers, say
r q
1 1
2 3/2
3 2
4 5/2
5 3
...
99 50
then the "q=(1+r)/2" would be a slightly more intense pattern in one's table than "2q+r=1." For the work involved in computing the table from "2q+r=1" is a little greater -- one must solve for q each time r is plugged in, or else transform the equation into "q=(1+r)/2."
Thus, although in a sense transformation systems add no content to their initials, they are capable of producing new patterns. For a list of length 100, as given above, both are clearly patterns. But what if the list were of length 4? Then perhaps "2q + r=1" would not be a pattern: the trouble involved in using it might be judged to exceed the difficulty of using the list itself. But perhaps q = (1+r)/2 would still be a pattern. It all depends on who's doing the judging of complexities -- but for any judge there is likely to be some list length for which one formula is a pattern and the other is not.
This is, of course, a trivial example. A better example is Kepler's observation that planets move in ellipses. This is a nice compact statement, which can be logically derived from Newton's Three Laws of Motion. But the derivation is fairly lengthy and time-consuming. So if one has a brief list of data regarding planetary position, it is quite possible that Kepler's observation will be a significant pattern, but Newton's Three Laws will not. What is involved here is the complexity of producing x from the process y. If this complexity is too great, then no matter how simple the process y, y will not be a pattern in x.
In this section I will give a brief formal treatment of "transformation systems." Let W be any set, let A be a subset of W, called the set of "expressions"; and let I = {W1, W2, ..., Wn} be a subset of W, called the set of initials. Let W* denote the set {W,WxW,WxWxW,...). And let T = {F1,F2,...,Fn} be a set of transformations; that is, a set of functions each of which maps some elements of W* into elements of A. For instance, if W were a set of propositions, one might have F1(x,y)= x and y, and F2(x) = not x.
Let us now define the set D(I,T) of all elements of S which are derivable from the assumptions I via the transformations T. First of all, it is clear that I should be a subset of D(I,T). Let us call the elements of I the depth-zero elements of D(I,T). Next, what about elements of the form x = Fi(A1,...,Am), for some i, where each Ak=Ij for some j? Obviously, these elements are simple transformations of the assumptions; they should be elements of D(I,T) as well. Let us call these the depth-one elements of D(I,T). Similarly, one may define an element x of S to be a depth-n element of D(I,T) if x=Fi(A1,...,Am), for some i, where each of the Ak is a depth-p element of D(I,T), for some p<n. Finally, D(I,T) may then be defined as the set of all x which are depth-n elements of D(I,T) for some n.
For example, if the T are rules of logic and the I are some propositions about the world, then D(I,T) is the set of all propositions which are logically equivalent to some subset of I. In this case deduction is a matter of finding the logical consequences of I, which are presumably a small subset of the total set S of all propositions. This is the general form of deduction. Boolean logic consists of a specific choice of T; andpredicate calculus consists of an addition onto the set T provided by Boolean logic.
It is worth noting that, in this approach to deduction, truth is inessential. In formal logic it is conventional to assume that one's assumptions are "true" and one's transformations are "truth-preserving." However, this is just an interpretation foisted on the deductive system after the fact.
The set (I,T) constructed above might be called a transformation system. It may be likened to a workshop. The initials I are the materials at hand, and the transformations T are the tools. D(I,T) is the set of all things that can be built, using the tools, from the materials.
What is lacking? First of all, blueprints. In order to apply a transformation system to real problem, one must have some idea of which transformations should be applied in which situations.
But if an intelligence is going to apply a transformation system, it will need to apply it in a variety of different contexts. It will not know exactly which contexts are going to arise in future. It cannot retain a stack of blueprints for every possible contingency. What it needs is not merely a stack of blueprints, but a mechanism for generating blueprints to fit situations.
But, of course, it already has such a mechanism -- its innate intelligence, its ability to induce, to reason by analogy, to search through its associative memory. What intelligence needs is a transformation system structured in such a way that ordinary mental processes can serve as its blueprint-generating machine.
In SI this sort of transformation system is called a "useful deductive system." Here, however, I am thinking more generally, and I will use the phrase structured transformation system instead. A structured transformation system is a transformation system with the property that, if a mind wants to make a "blueprint" telling it how to construct something from the initials using the transformations, it can often approximately do so by reasoning analogically with respect to the blueprints from other construction projects.
Another way to put it is: a structured transformation system, or STS, is transformation system with the property that the proximity between x and y in an ideal structurally associative memory is correlatedwith the similarity between the blueprint sets corresponding to x and y. A transformation system is structured if the analogically reasoning mind can use it, in practice, to construct things to order. This construction need not be infallible -- it is required only that it work approximately, much of the time.
One formal definition goes as follows. Let x and y be two elements of D(I,T), and let GI,T(x) and GI,T(y) denote the set of all proofs in the system (I,T) of x and y respectively. Let U equal the minimum over all functions v of the sum a|v| + B, where B is the average, over all pairs (x,y) so that x and y are both in D(I,T), of the correlation coefficient between
d#[St(x union v)-St(v), St(y union v) - St(v)]
and
d*[GI,T(x),GI,T(y)].
Then (I,T) is structured to degree U.
Here d#(A,B) is the structural complexity of the symmetric difference of A and B. And d* is a metric on the space of "set of blueprints," so that the d*[GI,T(x),GI,T(y)] denotes of the distance between the set of proofs of x and the set of proofs of y.
If the function v were omitted, then the degree of structuredness of U would be a measure of how true it is that structurally similar constructions have similar blueprint sets. But the inclusion of the function v broadens the definition. It need not be the case that similar x and y have similar blueprint sets. If x and y display similar emergent patterns on conjunction with some entity v, and x and y have similar blueprint sets, then this counts as structuredness too.
What do STS's have to do with prediction? To make this connection, it suffices to interpret the depth index of an element of D(I,T) as a time index. In other words, one may assume that to apply each transformation in T takes some integer number of "time steps," and consider the construction of an element in D(I,T) as a process of actual temporal construction. This is a natural extension of the "materials, tools and blueprints" metaphor introduced above.
A simulation of some process, then, begins with an initial condition (an element of I) and proceeds to apply dynamical rules (elements of T), one after the other. In the case of a simple iteration like xk = cxk-1(1-xk-1), the initial condition is an approximation of a real number, and there is only one transformation involved, namely the function f(x) = cx(1-x) or some approximation thereof. But in more complex simulations there may be a variety of different transformations.
For instance, a numerical iteration of the form xk = f(k,xk-1) rather than xk = f(xk-1) requires a different iteration at each time step. This is precisely the kind of iteration used to generate fractals by the iterated function system method (Barnsley, 1988). In this context, oddly enough, a random or chaotic choice of k leads to a more intricately structured trajectory than an orderly choice of k.
So, the process of simulating a dynamical system and the process of making a logical deduction are, on the broadest level, the same. They both involve transformation systems. But what about the structured part? What would it mean for a family of simulations to be executed according to a structured transformation system?
It would mean, quite simply, that the class of dynamical rule sequences that lead up to a situation is correlated with the structure of the situation. With logical deduction, one often knows what one wants to prove, and has to find out how to prove it -- so it is useful to know what worked to prove similar results. But with simulation, it is exactly the reverse. One often wants to know what the steps in one's transformation sequence will lead to, because one would like to avoid running the whole transformation sequence through, one step at a time. So it is useful to know what has resulted from running through similar transformation sequences. The same correlation is useful for simulation as for deduction -- but for a different reason.
Actually, this is an overstatement. Simulation makes some use of reasoning from similarity of results to similarity transformation sequences -- because one may be able to guess what the results of a certain transformation sequence will be, and then one will want to know what similar transformation sequences have led to, in order to assess the plausibility of one's guess. And deduction makes some use of reasoning from similarity of transformation sequences to similarity of results -- one may have an idea for a "proof strategy," and use analogical reasoning to make a guess at whether this strategy will lead to anything interesting. There is adistinction between the two processes, but it is not precisely drawn.
In conclusion, I propose that most psychological simulation and deduction is done by structured transformation systems. Some short simulations and deductions may be done without the aid of structure -- but this is the exception that proves the rule. Long chains of deductive transformations cannot randomly produce useful results. And long chains of dynamical iterations, if unmonitored by "common sense", are likely to produce errors -- this is true even of digital computer simulations, which are much more meticulous than any program the human brain has ever been known to run.
Psychologically, structured transformation systems are only effective if run in parallel. Running one transformation after another is very slow. Some simulations, and some logical deductions, will require this. But the mind will do its utmost to avoid it. One demonstration of this is the extreme difficulty of doing long mathematical proofs in one's head. Even the greatest mathematicians used pencil and paper, to record the details of the last five steps while they filled up their minds with the details of the next five.