Structure of Intelligence -- Copyright Springer-Verlag © 1993

Back to Structure of Intelligence Contents

Chapter 6

Analogy

6.0 The Structure-Mapping Theory of Analogy

    Induction, as we have analyzed it, requires a store of patterns on which to operate. We have not said how these patterns are to be obtained. Any general global optimization algorithm could be applied to the problem of recognizing patterns in an environment. But pattern recognition is a difficult problem, and a mind needs rapid, reasonably accurate solutions. Not just any algorithm will do.

    One might propose to determine, by induction, an effective pattern recognition algorithm. But although this might be possible, it would be very slow. Such a process probably occurred during the evolution of intelligent species, but within the lifespan of one organism there is simply not time.

    In Chapter 9, I will propose that intelligent entities solve pattern recognition problems with a "perceptual hierarchy" that applies the multilevel philosophy of global optimization sketched in Chapter 2. Among other things, this perceptual hierarchy makes continual use of two processes: analogy and deduction. And deduction, it will be argued in Chapter 8, is also essentially dependent on analogy. Hence analogical reasoning is an essential part of the picture of intelligence drawn at the end of the previous chapter.

WHAT IS ANALOGY

    What I mean by analogy is, roughly, reasoning of the form "A is similar to B in respect X, therefore A is also similar to B in respect Y." As with induction, it is difficult to say exactly why analogy works as well as it does. But there is no disputing its effectiveness. Bronowski (1956) has driven the point home so forcefully that an extended quote seems appropriate:

            Man has only one means to discovery, and that is to find likeness between things. To him, two trees are like two shouts and like two parents, and on this likeness he has built all mathematics. A lizard is like a bat and like a man, and on such likenesses he has built the theory of evolution and all biology. A gas behaves like a jostle of billiard balls,and on this and kindred likenesses rests much of our atomic picture of matter.

            In looking for intelligibility in the world, we look for unity; and we find this (in the arts as well as in science) in its unexpected likenesses. This indeed is man's creative gift, to find or make a likeness where none was seen before -- a likeness between mass and energy, a link between time and space, an echo of all our fears in the passion of Othello.

            So, when we say that we can explain a process, we mean that we have mapped it in the likeness of another process which we know to work. We say that a metal crystal stretches because its layers slide over one another like cards in a pack, and then that some polyester yarns stretch and harden like a metal crystal. That is, we take from the world round us a few models of structure and process (the particle, the wave, and so on), and when we research into nature, we try to fit her with these models.

Even more intriguingly, Bronowski goes on to relate analogy with structure:

            Yet one powerful procedure in research, we know, is to break down complex events into simpler parts. Are we not looking for the understanding of nature in these? When we probe below the surface of things, are we not trying, step by step, to reach her ultimate and fundamental constituents?

            We do indeed find it helpful to work piecemeal. We take a sequence of events or an assembly to pieces: we look for the steps in a chemical reaction, we carve up the study of an animal into organs and cells and smaller units within a cell. This is our atomic approach, which tries always to see in the variety of nature different assemblies from a few basic units. Our search is for simplicity, in that the distinct units shall be few, and all units of one kind identical.

            And what distinguishes one assembly of these units from another? the elephant from the giraffe, or the right-handed molecule of sugar from the left-handed? The difference is in the organization of the units into the whole; the difference is in the structure. And the likenesses for which we look are also likenesses of structure.

            This is the true purpose of the analytic method in science: to shift our gaze from the thing or event to its structure. We understand a process, we explain it, when we lay bare in it a structure which is like one we have met elsewhere.

    What Bronowski observed in the history and psychology of science, Gentner and Gentner (1983) have phrased in a more precise and general way. They speak of the "Generative Analogy Hypothesis" -- the hypothesis that analogies are used in generating inferences. And in order to test this hypothesis, they setforth a specific theoretical framework for analogical processing, called "structure-mapping." According to this framework, analogical reasoning is concerned with deriving statements about a target domain T from statements about a base domain B. Each domain is understood to consist of a number of "nodes" and a collection of relations between these nodes. Essentially, a node may be any sort of entity -- an object, a color, etc. A structure-mapping begins with a relation which takes certain base nodes into certain target nodes: if the source nodes are (b1,...,bn) and the target nodes are (t1,...,tn), it is a map M(bi)=tj, where i ranges over some subset of (1,...,n). Analogy occurs when it is assumed that a relation which holds between bi and bk also holds between M(bi) and M(bk).

    The theory of structure-mapping analogy states that reasoning of this form is both common and useful. This hypothesis has been verified empirically -- e.g. by studying the way people reason about electricity by analogy to water and other familiar "base" domains. Furthermore, the evidence indicates that, as Gentner and Gentner put it, relations "are more likely to be imported into the target if they belong to a system of coherent, mutually constraining relationships, the others of which map into the target." If a relation is part of a larger pattern of relationships which have led to useful analogies, people estimate that it is likely to lead to a useful analogy.

    The structure-mapping theory of analogy -- sketched by Bronowski and many others and formalized by Gentner and Gentner -- clearly captures the essence of analogical reasoning. But it is not sufficiently explanatory -- it does not tell us, except in a very sketchy way, why certain relations are better candidates for analogy than others. One may approach this difficult problem by augmenting the structure-mapping concept with a more finely-grained pattern-theoretic approach.

INDUCTION, DEDUCTION, ANALOGY

    Peirce proclaimed the tendency to take habits to be the "one law of mind", and he divided this law into three parts: deduction, induction, and abduction or analogy. The approach of computer science and mathematical logic, on the other hand, is to take deductive reasoning as primary, and then analyze induction and analogy as deviations from deduction. The subtext is that deduction, being infallible, is the best of the three. The present approach is closer to Peirce than it is to the standard contemporary point of view. When speaking in terms of pattern, induction and analogy are more elementary than deduction. And I will argue that deduction cannot be understood except in the context of a comprehensive understanding of induction and analogy.

STRUCTURAL SIMILARITY

    As in Chapter 3, define the distance between two sequences f and g as d#(f,g)=%(P(f)-P(g))U(P(g)-P(f)%#. And define the approximation to d#(f,g) with respect to a given set of functions S as

dS(f,g)=%[(S%P(f))-(S%P(g))]U[(S%P(g))-(S%P(f))]%#. This definition is the key to our analysis of analogy.

    A metric is conventionally defined as a function d which satisfies the following axioms:

1) d(f,g) % d(g,h) + d(f,h)

2) d(f,g) = d(g,f)

3) d(f,g) % 0

4) d(f,g)=0 if and only if f=g.

Note that d# is not a metric, because it would be possible for P(f) to equal P(g) even if f and g were nonidentical. And it would not be wise to consider equivalence classes such that f and g are in the same class if and only if d#(f,g)=0, because even if d#(f,g)=0, there might exist some h such that d#(Em(f,h),Em(g,h)) is nonzero. That is, just because d#(f,g)=0, f and g are not for all practical purposes equivalent. And the same argument holds for dS -- for dS(f,g)=0 does not in general imply dS(Em(f,h),Em(g,h)), and hence there is little sense in identifying f and g. A function d which satisfies the first three axioms given above might be called a pseudometric; that is how d# and dS should be considered.

TRANSITIVE REASONING

    To understand the significance of this pseudometric, let us pause to consider a "toy version" of analogy that might be called transitive reasoning. Namely, if we reason that "f is similar to g, and g is similar to h, so f is similar to h," then we are reasoning that "d(f,g) is small, and d(g,h) is small, so d(f,h) is small." Obviously, the accuracy of this reasoning is circumstance-dependent. Speaking intuitively, in the following situation it works very well:

                g f

                        h

But, of course, one may also concoct an opposite case:

        f        g h

Since our measure of distance, d#, satisfies the triangle inequality, it is always the case that d#(f,h) % d#(g,h) + d#(f,g). This puts a theoretical bound on thepossible error of the associated form of transitive reasoning. In actual circumstances where transitive reasoning is utilized, some approximation to d# will usually be assumed, and thus the relevant observation is that dS(f,h) % dS(g,h) + dS(f,g) for any S. The fact that dS(f,h) is small, however, may say as much about S as about the relation between f and h. The triangle inequality is merely the final phase of transitive reasoning; equally essential is to the process is the pattern recognition involved in approximating d#.

6.1 A Typology of Analogy

    Analogy is far more powerful than transitive reasoning; nonetheless, according to the present analysis it is nothing more than a subtler way of manipulating the pattern distance. I will introduce three forms of analogical reasoning -- structural analogy, modeling, and contextual analogy -- and propose a unified structure for analogical reasoning which encompasses all of them. It is perhaps not obvious that all conceivable cases of analogical reasoning are included in this formulation -- but I have been unable to find a counterexample. What I am attempting here is vaguely similar to Aristotle's list of the seventeen forms of syllogistic logic. He did not prove his list exhaustive, but it nonetheless served well for a long time.

STRUCTURAL ANALOGY

    Consider the following approach to recognizing patterns in an entity x.

1. dS(x%,x) is small.

2. (y,z) is, approximately, a pattern in x%.

3. Thus something near (y,z) is, approximately, a pattern in x.

The justification of the process is obvious: dS is an approximation of d#, and d# is a measure of difference in pattern. In general, if d#(A,B) is small, then most of the patterns in A are also patterns in B, and vice versa. And as this distance gets smaller, the reasoning gets more certain.

    This process I will call structural analogy; it is the simplest of the three forms. Quite simply, X is structurally analogous to x% if x and x% have similar structure. Roughly speaking, structural analogy is the assumption that if x and x% have some similar structures, then they have other similar structures.

    For a less formal example, consider the following experiment, described by Ornstein (1986, p.159):

        A man comes in to give a lecture to students. Afterward, the students are told that he is either a student or a professor and are asked to rate the lecture. Not surprisingly, they rate the lecture as being better if a professor has given it, but what is surprising is that they rate the lectureras taller if he was identified as a professor.

It would appear that what is happening here is an unconscious, erroneous structural analogy. The students observe that the lecturer has something in common with the loose collection of patterns that might be called "high status" or "superiority". Therefore they assume, unconsciously, that he is likely to have other patterns in common with this same collection. This assumption is not overpoweringly strong, but it is enough to bias their perceptions.    

MODELING

    Let us begin with a rough example. Suppose one seeks a method of making housing less expensive (i.e. one seeks a certain pattern in the domain of knowledge about housing, which domain we may call x). Then one might decide to reason by analogy to the automobile industry. Automobiles bear some similarity to houses -- both are designed to contain humans, both are manufactured in great number throughout the developed world, both represent fairly major investments for the average person, etc. There are obviously significant differences also, but the association is by no means arbitrary. One might reasonably ask: how does one make inexpensive cars?

    In reasoning thus, one would be beginning (Step 1 in the process of structural analogy) with the observation that the entity "housing" is in some measure similar to the entity "cars". In the notation given above, if the domain of knowledge about housing is x, then the domain of knowledge about cars is x%.

    Cars, one might then observe, are much less expensive when mass-produced than when custom-built and custom-designed. And cars are being made cheaper and cheaper by more and more efficient methods of mass production. This is Step 2 in the process of structural analogy: one is recognizing a pattern in x%.     Finally, one might map this reasoning back to the domain of houses, searching in the vicinity of "cars are made less expensive through mass production" and finding "houses are made less expensive through mass production" (Step 3). The validity of this hypothesis could then be tested by exploring whether it is in fact possible to make the production of housing less expensive through mass production -- by exploring the feasibility of trucking or air-lifting a pre-assembled house to a given location, et cetera.

    This may be interpreted as an illustration of structural analogy: the structure of the housing industry is somewhat similar to the structure of the auto industry, so further similarities are assumed to exist. But it also suggests a slightly different form of analogical reasoning, which I shall call modeling. For in this example the relation between x% and x was exactly the same as the relation between (y,z) and the pattern finally found in x. That is, if we define a function f by f(g(cars))= g(housing), then f(inexpensive cars) = inexpensive housing, and also f(mass-produced cars) = mass-produced housing. This suggests thefollowing general form of reasoning:

1. dS(f(x),x) is small.

2. (y,z) is, approximately, a pattern in f(x).

3. Thus something near (f-1(y),f-1(z)) is, approximately, a pattern in x.

In general, f-1 may be multivalued; it may be a relation rather than a function. This poses a moderate computational difficulty (Grandy, 1985), which may be dealt with by maximum-entropy methods like those discussed in Chapter 9.

BRAINSTORMING

    Why would modeling analogy work? If f(w) is always structurally similar to w, for any w, then it is just a special form of structural analogy, and it is justified whenever structural analogy is justified. Otherwise, however, it is a question of whether the function f "preserves patterns". There might be some functions f for which Steps 2 and 3 taken exclusively of Step 1 would be a plausible form of reasoning. In such cases it would follow from "(y,z) is a pattern in x" that "something near (f(y),f(z)) is probably a pattern in f(x)." But, on the other hand, there are certainly many functions which do not preserve patterns in this sense.

    In some cases, the choice of a function f which does not generally tend to preserve patterns may be highly valuable. This is one way of understanding the process of "brainstorming": one transforms a problem x into a context f(x) which is fairly unlikely to have much to do with the original context, seeks an answer in this context, and then tries to map the answer back. For instance, in thinking about international combat, one might say to oneself "now, pretend each country is a little boy, and the various parties of the war are just little boys squabbling." In this case we would be selecting f so that f(g(nations))=g(boys).     Let us assume that one has mapped the particularities of a war into the particularities of a squabble, and then thought of a way to resolve the squabble. Then in order to get anything useful one would have to map one's answer back into the domain of nations, i.e. one would have to take f-1 of one's answer. In this case it would seem that f is only moderately pattern-preserving, and such an experiment would have a fair but not excellent chance of yielding a relevant suggestion.

    Mathematically this form of reasoning may be viewed as the attempt to find a function f, mapping a given domain D into a given range R, that is approximately "topologically semiconjugate" to the structure operator S(x). That is, f maps D to R and, ideally, S(f(x))=f(S(x)). Often R may be understood as a model of D, in the sense that each element x of D may be associated with an element of R in such a way that the relation of x to other elements y of D is similar to the relation of f(x) to the elements f(y) of R.

CONTEXTUAL ANALOGY

    Only one form of analogy remains. What if x and x% have no inherent structural commonality, but are related to other patterns in similar ways? One might still reason analogically that x and x% are similar. Or one might reason analogically from the observation that x and x% are related to different patterns in similar ways. Let us call this contextual analogy. Such reasoning is not encompassed under structural analogy or modeling -- a new formula is required.     To say that x and x% are related to other patterns in similar ways, is to say that d#(Em(x%,w%),Em(x,w)) is small. Thus contextual analogy rests on those aspects of x which do not manifest themselves in the internal structure of x, but which nonetheless emerge when x is conjoined with other entities.

    For instance, suppose w=w% is a codebook which contains several different codes, including Code A and Code B. Suppose x is a message in Code A, and x% is a message in Code B -- and suppose x and x% both convey the same message. Then x and x% may not appear similar; they may have no virtually no patterns in common.

    To see this, observe that the meaning conveyed by a coded message, say x, is not generally a pattern in that coded message. If the meaning of the message x is called Mx, and the function (contained in the codebook) which translates x into Mx is called F, then we may write F(x)=Mx. But (F,Mx) is a pattern in X only if a%F% + b%Mx% + cC(F,Mx) < %x%, and this is not at all inevitable. Perhaps %Mx% is less than %x%, but if the codebook is large or difficult to use, then %F% or C(F,Mx) may be sizeable. And the same could be said for x%.

    So in this example, d#(x,x%) may be small, but what about d#[Em(x%,w), Em(x,w)]? Clearly, Mx is an element of Em(x%,w), and Mx% is an element of Em(x%,w), so that if Mx and Mx% are equal, this distance is small. Contextual analogy would occur, for instance, if one reasoned that, because the messages x and x% have similar meanings, they may have come from the same person.

    This example is not at all artificial; in fact it may turn out to be central to mental function. The mind works in terms of symbol systems, which may be considered as highly sophisticated codebooks. Street signs are coded messages, as are sentences. Suppose one reasons that sentence A has a similar meaning to sentence B, another, and therefore that the person uttering sentence A may have a similar attitude to the person uttering sentence B. This is contextual analogy, because the meaning of a sentence is not actually a pattern in that sentence, but rather a pattern emergent between that sentence and the "codebook" of language.

A GENERAL ANALOGY ALGORITHM

    One may incorporate all three forms of analogy in one general process:

1. dS(St(f(x%)%w%)-St(w%),St(x%w)-St(w)) is small.

2. (y,z) is, approximately, an element of St(f(x%)%v%)-St(v%).

3. Thus something near (f-1(y),f-1(z)) is, approximately, an element of

St(x%v)-St(v), where v=f-1(v%) is perhaps a good guess.

    In the case of structural analogy, f is the identity mapping, and w%, v and v% are the empty set, but x% is different from x.

    In the case of modeling, x=x% and w%, v and v% are the empty set, but f is not the identity mapping.

    In the case of contextual analogy, f is the identity mapping but w% is not the empty pattern; v and v% may or may not be the empty set. If v and v% are the empty set, then one has the form of reasoning "x and x% have similar relations to y and y% respectively; therefore x and x% may be structurally similar." And if neither v nor v% is the empty set, one has the form of reasoning "x and x% have similar relations to y and y%; therefore they may also have similar relations to v and v%."

    The reason for expressions such as St(x%v)-St(v) is to specify that, for instance, either patterns in x or patterns in Em(x,v) are desirable, but not patterns in v alone.

     The general algorithm -- which I will call, simply, analogy -- hence includes the three forms of analogy discussed above. It also encompasses hybrid forms. For instance, if x=x%, but f is not the identity and neither v, v% nor w% are the empty set, then one is mapping x to a model range via f and executing contextual rather than structural analogy in this range.

    This general idea could be implemented in many different ways. For instance, consider the following hybridization of structural and modeling analogy:

1. dS(x%,x) is small.

2. Select (y,z) which is, approximately, a pattern in f(x).

3. Check if something near (f-1(y),f-1(z)) is, approximately, a pattern in x%.

     If this doesn't turn out to be the case, and f is reasonably near the identity, then check if something near (y,z) is, approximately, a pattern in x%.

In this implementation, modeling is considered partly as a tool for searching the space near (y,z).

    Each step of the process of analogy involves a significant effort: first, to pick x% or f; second, to recognize a pattern in f(x%); third, searching in a certain vicinity for a pattern in x, and fourth, inverting f. It would be possible to execute the latter three tasks by some general optimization algorithm, and to pick x% or f at random. But in order to reap the full power of analogicalreasoning, it is necessary to have access to an intelligently structured database. In the next section, we shall consider the hypothesis that memory is structured specifically so as to efficiently supply the process of analogy with the data it requires.

ANALOGIES AMONG DIGRAPHS

    It should be noted that an entity which characteristically seeks certain types of patterns will therefore characteristically make certain types of analogies. In other words, different analogies may occur to different entities, depending on the idiosyncracies of the pattern recognition algorithms involved. This point of view may be helpful in relating the present approach to other analyses of analogy.

    For instance, Poetzsche's (1988) theory of analogy deals only with structural analogy and applies to precisely those entities which:

1) recognize patterns only in digraphs.

2) compute structural similarity using only patterns (x,y) of the form y={v,w},

where v = "construct x by connecting node n1 of z to node m1 of w, node n2 of z to node m2 of w,..." and * is defined accordingly.

In other words, according to Poetszche's definitions, two digraphs are structurally similar if they have a common subgraph.

    This scheme deals well with processes that can be easily decomposed into subprocesses. In particular, Poetszche is concerned with robot motion: his goal is to construct a robot which can figure out how to complete a task based on knowledge of how to complete similar tasks. Toward this end, he has designed a robot which internally expresses each task as a digraph of "elementary tasks". Given a new task, the robot searches its memory for a task with a structurally similar digraph and then (by structural analogy) uses this as the starting point for figuring out how to do the task at hand.

6.2 Analogy and Induction

    Induction and analogy are obviously closely related. In induction one assumes the future will be similar to the past, and tries to guess which of a set of past patterns will continue into the future. In analogy one assumes that similar entities will have similar patterns, and directs pattern recognition on this basis. The difference is that in analogy, one is merely trying to locate patterns in some entity x, by analogy to some entity x%. In induction, on the other hand, one assumes that a complete catalogue of patterns has already been recognized, and one tries to make a coherent model out of these patterns. The two processes complement each other.

BACKGROUND KNOWLEDGE

    Many AI researchers view analogical reasoning with a great deal of skepticism. For instance, Bipin Indurkhya has argued that

            One naturally asks: Given some state of background knowledge, what justification is there, if any, that similarities in certain respects determine similarities in other respects? ...an inference from analogy cannot be justified on the basis of existing similarity between the source and the target alone. The justification, if any, must come from the background knowledge in some other form.... Once it is realized that an inference based only on some existing similarity between the source and the target -- and nothing else -- is about as justified as a random inference, one learns to exercise extreme caution in deriving an inference from analogy. One seeks justification in other places; it must be some other piece of knowledge, some piece of fact, which "justifies" why the existing similarities determine the inferred ones. And if no such justification can be found, the so-called analogical inference is to be properly discarded. (1988, pp. 224-25)

In the notation introduced above, the "source" is x% and the "target" is x. The point is that, in general, there is no reason to believe that the existence of some similarities between x and x% implies the existence of other similarities.

Indurkhya thinks this implication must be drawn from "facts" and "background information."

    Really, there are two questions here. The first is, when is it justifiable to assume that the existence of a degree D1 of similarity between x and x% implies a degree D2 of similarity between x and x%. In other words, how can one tell when reasoning by analogy is justifiable at all; and when is it justifiable, how can one tell to what extent? The other is, how is it possible to tell, based on the type of similarity between x and x%, what kinds of further similarities to look for. That is: in Step 2 of the analogy algorithm, is there some way to determine what sorts of patterns (y,z) should be looked at on the basis of the similarities used for the computation? Indurkhya, expressing a view which is widespread in the AI community, states that both of these questions should be answered by reference to "background knowledge".

    Does this requisite background knowledge perhaps come from deduction? This may be true in certain instances. But, as we shall see, deductive reasoning is only useful when it is paired with analogical reasoning. The analogies used in executing deductive reasoning cannot all be executed on the basis of background information obtained by deduction.

    Does the background knowledge come from experience? If so, it does not come directly from raw, unprocessed sense perception; it comes out of the process of induction. Induction requires pattern recognition. And I suggest that effective general pattern recognition requires analogy. From this it follows thatthe background information Indurkha requires cannot come entirely from experience: either it comes from some other source or it is not always necessary.

    However, although induction may not be used to justify every particular analogy, it may still be used to justify analogy in general. This may sound contradictory, but the paradox is only apparent. Not all the analogies used in inductive reasoning can be justified inductively. However, if analogies that are not justified by induction are used anyway, and they happen to work, then the tendency to take habits implies that analogy can be expected to work in the future.

I find this argument rather convincing. On the basis of ample real-world experience, we know that analogy has often worked in the past. We have obtained useful results from it so often that its effectiveness must be considered a very intense pattern in the past. Therefore, by the tendency to take habits -- by induction -- it is relatively likely that analogy will work in the future. Hence it is advisable to reason by analogy.

    Another way to phrase this is to postulate a "spatial" tendency to take habits, to the effect that analogy does in fact tend to work more often than it would in a randomly selected universe obeying the temporal tendency to take habits. This strategy seems much less elegant than in the temporal, inductive tendency to take habits; it has more of an ad hoc flavor. But it does yield the desired result.

6.3 Hierarchical Analogy

    We have not yet discussed the possibility that analogies might be justified by analogy. Obviously, analogy as a general mode of thought cannot be justified by analogy; that would be circular reasoning. But, just as particular analogies can be justified by inductive or deductive background knowledge, so can particular analogies be justified by analogical background knowledge. In some cases, the mind may use analogical reasoning to determine how probable it is that similarity A between x and x% will imply similarity B between x and x%.

y observing which similarities have, in similar situations, led to which other similarities. Actually, this sort of analogical background information is a special kind of inductive background information, but it is worth distinguishing.

    Let us be more precise. Assume that processor P1 executes a long sequence of analogical reasoning processes, and processor P2 observes this sequence, recording for each instance a vector of the form (w,w%,f,w,w%,v,v%,R,r), where r is a number measuring the total prominence of the patterns recognized in that instance, and R is the set of patterns located.

    The prominence of a pattern may -- as in the previous chapter -- be defined as the product of its intensity with its importance. The prominence of a set of patterns S may be crudely defined as %S%K, where S is the structural complexity %S% of the set and K is some number representing the prominence of the set. A very crude way to define K is as the average over all (y,z) in S of theimportance of (y,z). A more accurate definition could be formulated by a procedure similar to Algorithm 3.1.

    Then processor P2 can seek to recognize patterns (y,z) with the property that when x is a pattern in (w,w%,f,w,w%,v,v%,R), r tends to be large. This is an optimization problem: maximize the correlation of r with the intensity of x, over the space of patterns in the first six components of the vector. But it is a particularly difficult optimization problem in the following sense: determining what entities lie in the space over which optimization is taking place is, in itself, a very difficult optimization problem. In other words, it is a constrained optimization problem with very unpleasant constraints. One very simple approach would be the following:

1. Using straightforward optimization or analogy, seek to recognize patterns in (x,x%,f,w,w%,v,v%,R).

2. Over the space of patterns recognized, see which ones correlate best with large r.

3. Seek to recognize new patterns in (x,x%,f,w,w%,v,v%,R) in the vicinity of the answer(s) obtained in Step 2.

Perhaps some other approach would be superior, but the difficulty is that one cannot expect to find patterns in a given narrow vicinity merely because functions in that region correlate well with r. The focus must be on the location of patterns, not the search for large correlation with r.

    In this way analogy could be used to determine which analogies are likely to pay off. This might be called second-level analogy, or "learning by analogy about how to learn by analogy." And the same approach could be applied to the analogies involved in analyzing analogies, yielding third-level analogy, or "learning by analogy how to learn by analogy how to learn by analogy." Et cetera. These are tremendously difficult optimization problems, so that learning on these levels is likely to be rather slow. On the other hand, each insight on such a high level will probably have a great impact on the effectiveness of lower-level analogies.

    Let us be more precise about these higher levels of learning. A processor which learns by second level analogy must be connected to a processor which learns by analogy, in such a way that is has access to the inputs and the outputs of this processor. Similarly, a processor which learns by third level analogy must be connected to a processor which learns on the second level in such a way that it has access to the inputs and the outputs of this second-level processor -- and the inputs and outputs of this second-level processor include all the inputs and outputs of at least one first-level processor. In general, the absolute minimum number of inputs required for an n'th-level analogy processor is proportional to n: this is the case, for instance, if every n'th level processor is connected to exactly one (n-1)'th level processor. If each n-level processor is connected to k (n-1)'th level processors for some k>1, then the number of inputs required for an n'th level processor is [1-kn+1]/[1-k].

    In general, if a set of analogical reasoning processors -- Nk learning on level k, k%n -- is arranged such that each processor learning on level k is connected to all the inputs and outputs of some set of nk processors on level k-1, then the question of network architecture is the question of the relation between the Nk and the nk. For instance, if Nknk=8Nk-1, then each (k-1)-level processor is being analyzed by eight different k-level processors; but if Nknk=Nk-1, then each (k-1)-level processor is being analyzed by only one k-level processor.

    This is an hierarchical analogy network: a hierarchy of processors learning by analogy how to best learn by analogy how to best learn by analogy how to best learn by analogy... how to best learn by analogy. As will be explained in later chapters, in order to be effective it must be coupled with a structurally associative memory network, which provides a knowledge base according to which the process of analogy can be executed.

6.4 Structural Analogy in the Brain

    The neurons of the cortex are organized in clusters, each containing 50 to 10,000 neurons. The neurons of each cluster are connected primarily to other neurons in the same cluster. Edelman (1988) has proposed that it makes sense to think of connections between clusters, not just individual neurons, as being reinforced or inhibited; and he has backed this up with a detailed mathematical model of neural behavior. Following this line of thought, it is the nature of the interaction between neural clusters which interests us here. We shall show that, according to a simple model inspired by Edelman's ideas, the interaction of neural clusters gives rise to a simple form of structural analogy.

    Edelman's theory of "Neural Darwinism" divides the evolution of the brain into two phases. The first, which takes place during fetal development, is the phase of cluster formation. And the second, occurring throughout the remainder of life, is the phase of repermutation: certain arrangements of clusters are selected from the set of possible arrangements. This is not an outlandish hypothesis; Changeux (1985), among others, has made a similar suggestion. Edelman, however, has formulated the theory as a sequence of specific biochemical hypotheses, each of which is supported by experimental results.

    Roughly speaking, a set of clusters which is habitually activated in a certain order is called a map. Mental process, according to Edelman's theory, consists of the selection of maps, and the actual mapping of input -- each map receiving input from sensory sources and/or other maps, and mapping its output to motor control centers or other maps. Mathematically speaking, a map is not necessarily a function, since on different occasions it may conceivably give different outputs for the same input. A map is, rather, a dynamical system in which the output yt at time t and the internal state St at time t are determined by yt=f(xt-1,St-1), St=g(xt-1,Mt-1), where xt is the input at time t. And it is a dynamical system with a special structure: each map may be approximately represented as a serial/parallel composition of simple maps which are composedof single clusters. One key axiom of Edelman's theory is the existence of numerous "degenerate" clusters, clusters which are different in internal structure but, in many situations, act the same (i.e. often produce the same output given the same input). This implies that each map is in fact a serial/parallel combination of simple component maps which are drawn from a fairly small set, or at least extremely similar to elements of a fairly small set.

    There is much more to Neural Darwinism than what I have outlined here -- for instance, I have not even mentioned Edelman's intriguing hypotheses as to the role of adhesion molecules in the development of mentality. But there are nevertheless a number of mysteries at the core of the theory. Most importantly, it is not known exactly how the behavior of a cluster varies with its strengths of interconnection and its internal state.

    Despite this limitation, however, Neural Darwinism is in my opinion the only existing effective theory of low-to-intermediate-level brain structure and function. Philosophically, it concords well with our theory of mind: it conceives the brain as a network of "maps", or "functions". In previous chapters we spoke of a network of "programs," but the terms "program" and "map" are for all practical purposes interchangeable. The only difference is that a program is discrete whereas the brain is considered "continuous," but since any continuous system may be approximated arbitrarily closely by some discrete system, this is inessential.

    So, consider a set of n processors Pi, each one possessing an internal state Sit at each time t. Assume, as above, that the output yit at time t and the internal state Sit at time t are determined by yit=fi(xit-1,Sit-1), Sit=gi(xit-1,Sit-1), where xit is the input at time t. Define the "design" of each processor as the set of all its possible states. Assume the processors are connected in such a way that the input of each processor is composed of subsets of the output of some set of k processors (where k is small compared to n, say O(logn)). This is a basic "network of processors".

    Note that we have defined f and g to vary with i. Strictly speaking, this means that the different processors are different dynamical systems. In this case, what it really means, intuitively, is that fi and gi may vary with the pattern of flow of the dynamical system. In fact, from here on we shall assume gi=g for all i; we shall not consider the possibility of varying the gi. However, we shall be concerned with minor variations in the fi; in particular, with "strengthening" and "weakening" the connections between one processor and another.     For this purpose, we may as well assume that the space of outputs yit admits is Rn or a discrete approximation of some subset thereof. Let si,j denote the scalar strength of connection from Pi to Pj. For each Pi, let jfit denote the portion of the graph of fi which is connected to Pj at time t. Assume that if j and l are unequal, then jfir and lfit are disjoint for any t and r; and that jfir= si,j jfit for all t and r. According to all this, then, the only possible variations of fi over time are merely variations of strength of connectivity, or "conductance".

    As observed above, our model of mind is expressed as a network of processors, and Edelman's theory expresses brain function as a result of thedynamics of a network of neuronal clusters, which are specialized processors. In the context of neuronal clusters, the "design" as defined above is naturally associated with the graph of interconnection of the neurons in the cluster; a particular state is then an assignation of charge levels to the neurons in the cluster, and a specification of the levels of various chemicals, most importantly those concerned with the modification of synaptic strength. According to Edelman's theory, the designs of the million or so processors in the brain's network are highly repetitive; they do not vary much from a much smaller set of fundamental designs. And it is clear that all the functions fi regulating connection between neuronal clusters are essentially the same, except for variations in conductance.

THE NOISY HEBB RULE

    D.O. Hebb, in his classic Organization of Behavior (1949), sought to explain mental process in terms of one very simple neural rule: when a synaptic connection between two neurons is used, its "conductance" is temporarily increased. Thus a connection which has proved somehow useful in the past will be reinforced. This provides an elegant answer to the question: where does the learning take place? There is now physiological evidence that Hebbian learning does indeed occur. Two types of changes in synaptic strength have been observed (Bliss, 1979): "post-tetanic potentiation", which lasts for at most a few minutes, and "enhancement", which can last for hours or days.

    Hebb proceeded from this simple rule to the crucial concepts of the cell-assembly and the phase-sequence:

        Any frequently repeated, particular stimulation will lead to the slow development of a "cell-assembly", a diffuse structure comprising cells in the cortex and diencephalon... capable of acting briefly as a closed system, delivering facilitation to other such systems and usually having a specific motor facilitation. A series of such events constitutes a "phase sequence" -- the thought process. Each assembly action may be aroused by a preceding assembly, by a sensory event, or -- normally -- by both.

This theory has been criticized on the physiological level; but this is really irrelevant. As Hebb himself said, "it is... on a class of theory that I recommend you to put your money, rather than any specific formulation that now exists" (1963, p.16). The more serious criticism is that Hebb's ideas do not really explain much about the most interesting aspects of mental process. Simple stimulus-response learning is a long way from analogy, associative memory, deduction, and the other aspects of thought which Hebb hypothesizes to be special types of "phase sequences".

    Edelman's ideas mirror Hebb's on the level of neural clusters rather than neurons. In the notation given above, Edelman has proposed that if the connection from P1 to P2 is used often over a certain interval of time, then its"conductance" s1,2 is temporarily increased. Physiologically, this is a direct consequence of Hebb's neuron-level principle; it is simply more specific. It provides a basis for the formation of maps: sets of clusters through which information very often flows according to a certain set of paths. Without this Hebbian assumption, lasting maps would occur only by chance; with it, their emergence from the chaos of neural flow is virtually guaranteed. At bottom, what the assumption amounts to is a neural version of the principle of induction. It says: if a pathway has been useful in the past, we shall assume it will be useful in the future, and hence make it more effective.

    Unfortunately, there is no reason to believe that the cluster-level interpretation of Hebb's theory is sufficient for the explanation of higher mental processes. By constructing a simulation machine, Edelman has shown that a network of entities much like neural clusters, interacting according to the Hebbian rule, can learn to perceive certain visual phenomena with reasonable accuracy. But this work -- like most perceptual biology -- has not proceeded past the lower levels of the perceptual hierarchy.

    In order to make a bridge between these neural considerations and the theory of mind, I would like to propose a substantially more general hypothesis: that if the connection between P1 and P3 is used often over a certain interval of time, and the network is structured so that P2 can potentially output into P3, then the conductance s2,3 is likely to be temporarily increased.     

    As we shall see, this "noisy Hebb rule" leads immediately to a simple explanation of the emergence of analogy from a network of neural clusters. Although I know of no evidence either supporting or falsifying the noisy Hebb rule, it is certainly not biologically unreasonable. One way of fulfilling it would be a spatial imprecision in the execution of Hebbian neural induction. That is, if the inductive increase in conductance following repeated use of a connection were by some means spread liberally around the vicinity of the connection, this would account for the rule to within a high degree of approximation. This is yet another case in which imprecision may lead to positive results.

THE NOISY HEBB RULE AND STRUCTURAL ANALOGY

    Consider the situation in which two maps, A and B, share a common set of clusters. This should not be thought an uncommon occurrence; on the contrary, it is probably very rare for a cluster to belong to one map only. Let B-A (not necessarily a map) denote the set of clusters in B but not A. The activation of map A will cause the activation of some of those clusters in map B-A which are adjacent to clusters in A. And the activation of these clusters may cause the activation of some of the clusters in B-A which are adjacent to them -- and so on. Depending on what is going on in the rest of B, this process might peter out with little effect, or it might result in the activation of B. In the latter case, what has occurred is the most primitive form of structural analogy.

    Structural analogy, as defined earlier, may be very roughly described asreasoning of the form: A and B share a common pattern, so if A is useful, B may also be useful. The noisy Hebb rule involves only the simplest kind of common pattern: the common subgraph. But it is worth remembering that analogy based on common subgraphs also came up in the context of Poetszche's approach to analogical robot learning. Analogy by common subgraphs works. There is indeed a connection between neural analogy and conceptual analogy.     And -- looking ahead to chapter 7 -- it is also worth noting that, in this simple case, analogy and structurally associative memory are inextricably intertwined: A and B have a common pattern and are consequently stored near each other (in fact, interpenetrating each other); and it is this associative storage which permits neural analogy to lead to conceptual analogy.