Structure of Intelligence -- Copyright Springer-Verlag 1993

Back to Structure of Intelligence Contents

Chapter 9


9.0 The Perceptual Hierarchy

    In accordance with the philosophy outlined in Chapter 5, I define perception as pattern recognition. Pattern recognition is, of course, an extremely difficult optimization problem. In fact, the task of recognizing all the patterns in an arbitrary entity is so hard that no algorithm can solve it exactly -- this is implied by Chaitin's (1987) algorithmic-information-theoretic proof of Godel's Theorem. As usual, though, exact solutions are not necessary in practice. One is, rather, concerned with finding a reasonably rapid and reliable method for getting fairly decent approximations.

    I propose that minds recognize patterns according to a multilevel strategy. Toward this end, I hypothesize a hierarchy of perceptual levels, each level recognizing patterns in the output of the level below it, and governed by the level immediately above it. Schematically, the hierarchy may be understood to extend indefinitely in two directions (Fig. 4). It will often be convenient to, somewhat arbitrarily, pick a certain level and call it the zero level. Then, for n= ...-3,-2,-1,0, 1,2,3,..., the idea is that level n recognizes patterns in the output of level n-1, and also manipulates the pattern-recognition algorithm of level n-1.


    Physically speaking, any particular mind can deal only with a finite segment of this hierarchy. Phenomenologically speaking, a mind can never know exactly how far the hierarchy extends in either direction.

    One may analyze consciousness as a process which moves from level to level of the perceptual hierarchy, but only within a certain restricted range. If the zero level is taken to represent the "average" level of consciousness, and consciousness resides primarily on levels from -L to U, then the levels below

-L represent perceptions which are generally below conscious perception. And,on the other hand, the levels above U represent perceptions that are in some sense beyond conscious perception: too abstract or general for consciousness to encompass.

    Consciousness can never know how far the hierarchy extends, either up or down. Thus it can never encounter an ultimate physical reality: it can never know whether a perception comes from ultimate reality or just from the next level down.     

    Perception and motor control might be defined as the link between mind and reality. But this is a one-sided definition. Earlier we defined intelligence by dividing the universe into an organism and an environment. From this "God's-eye" point of view an organism's perceptual and motor systems are the link between that organism and its environment. But from the internal point of view, from the point of view of the conscious organism, there can be no true or ultimate reality, but only the results of perception.

    Therefore, in a sense, the result of perception is reality; and the study of perception is the study of the construction of external reality. One of the aims of this chapter and the next is to give a model of perception and motor control that makes sense from both points of view -- the objective and the subjective, the God's-eye and the mind's-eye, the biological and the phenomenological.


    Fodor (1983) has proposed that, as a general rule, there are a number of significant structural differences between input systems and central processing systems. He has listed nine properties which are supposed to be common to all the input systems of the human brain: the visual processing system, the auditory processing system, the olfactory and tactile processing systems, etc.:

1. Input systems are domain specific: each one deals only with a certain

specific type of problem.

2. Input systems operate regardless of conscious desires; their operation is mandatory.

3. The central processing systems have only limited access to the

representations which input systems compute.

4. Input systems work rapidly.

5. Input systems do most of their work without reference to what is going on

in the central processing systems, or in other input systems.

6. Input systems have "shallow" output, output which is easily grasped by central processing systems.

7. Input systems are associated with fixed neural architecture.

8. The development of input systems follows a certain characteristic pace

and sequence.

    I think these properties are a very good characterization of the lower levelsof the perceptual hierarchy. In other words, it appears that the lower levels of the perceptual hierarchy are strictly modularized. Roughly speaking, say, levels -12 to -6 might be as depicted in Figure 5, with the modular structure playing as great a role as the hierarchical structure.

    If, say, consciousness extended from levels -3 to 3, then it might be that the modules of levels -12 to -6 melded together below the level of consciousness. In this case the results of, say, visual and auditory perception would not present themselves to consciousness in an entirely independent way. What you saw might depend upon what you heard.


    A decade and a half ago, Hubel and Wiesel (1988) demonstrated that the brain possesses specific neural clusters which behave as processors for judging the orientation of line segments. Since then many other equally specific visual processors have been found. It appears that Area 17 of the brain, the primary visual cortex, which deals with relatively low-level vision processing, is composed of various types of neuronal clusters, each type corresponding to a certain kind of processing, e.g. line orientation processing.

    And, as well as perhaps being organized in other ways, these clusters do appear to be organized in levels. At the lowest level, in the retina, gradients are enhanced and spots are extracted -- simple mechanical processes. Next come simple moving edge detectors. The next level, the second level up from the retina, extracts more sophisticated information from the first level up -- and so on. Admittedly, little is known about the processes two or more levels above the retina. It is clear (Uhr, 1987), however, that there is a very prominent hierarchical structure, perhaps supplemented by more complex forms of parallel information processing. For instance, most neuroscientists would agree that there are indeed "line processing" neural clusters, and "shape processing" neural clusters, and that while the former pass their results to the latter, the latter sometimes direct the former (Rose and Dobson, 1985).

    And there is also recent evidence that certain features of the retinal image are processed in "sets of channels" which proceed several levels up the perceptual hierarchy without intersecting each other -- e.g. a set of channels for color, a set of channels for stereoposition, etc. This is modular perception at a level lower than that considered by Fodor. For instance, Mishkin et al (1983) have concluded from a large amount of physiological data that two major pathways pass through the visual cortex and then diverge in the subsequent visual areas: one pathway for color, shape and object recognition; the other for motion and spatial interrelations. The first winds up in the inferior temporal areas; the second leads to the inferior parietal areas.

    And, on a more detailed level, Regan (1990) reviews evidence for three color channels in the fovea, around six spatial frequency channels from each retinal point, around eight orientation channels and eight stereomotion channels, two orthree stereoposition channels, three flicker channels, two changing-size channels, etc. He investigates multiple sclerosis by looking at the level of the hierarchy -- well below consciousness -- at which the various sets of channels intersect.


     If one needs to compute the local properties of a visual scene, the best strategy is to hook up a large parallel array of simple processors. One can simply assign each processor to a small part of the picture; and connect each processor to those processors dealing with immediately neighboring regions. However, if one needs to compute the overall global properties of visual information, it seems best to supplement this arrangement with some sort of additional network structure. The pyramidal architecture (Fig. 6) is one way of doing this.

    A pyramidal multicomputer is composed of a number of levels, each one connected to the levels immediately above and below it. Each level consists of a parallel array of processors, each one connected to 1) a few neighboring processors on the same level, 2) one or possibly a few processors on the level immediately above, 3) many processors on the level immediately below. Each level has many fewer processors than the one immediately below it. Often, for instance, the number of processors per level decreases exponentially.

    Usually the bottom layer is vaguely retina-like, collecting raw physical data. Then, for instance, images of different resolution can be obtained by averaging up the pyramid: assigning each processor on level n a distinct set of processors on level n-1, and instructing it to average the values contained in these processors.

    Or, say, the second level could be used to recognize edges; the third level to recognize shapes; the fourth level to group elementary shapes into complex forms; and the fifth level to compare these complex forms with memory.

    Stout (1986) has proved that there are certain problems -- such as rotating a scene by pi radians -- for which the pyramidal architecture will perform little better than its base level would all by itself. He considers each processor on level n to connect to 4 other processors on level n, 4 processors on level n-1, and one processor on level n+1. The problem is that, in this arrangement, if two processors on the bottom level need to communicate, they may have to do so by either 1) passing a message step by step across the bottom level, or 2) passing a message all the way up to the highest level and back down.

    However, Stout also shows that this pyramidal architecture is optimal for so-called "perimeter-bound" problems -- problems with nontrivial communication requirements, but for which each square of s2 processors on the base level needs to exchange only O(s) bits of information with processors outside that square. An example of a perimeter-bound problem is labeling all the connected components of an image, or finding the minimum distance between one component and another.

    In sum, it seems that strict pyramidal architectures are very good at solving problems which require processing that is global, but not too global. When a task requires an extreme amount of global communications, a parallel architecture with greater interconnection is called for -- e.g. a "hypercube" architecture.

    Thinking more generally, Levitan et al (1987) have constructed a three-level "pyramidal" parallel computer for vision processing. As shown in Figure 7, the bottom level deals with sensory data and with low-level processing such as segmentation into components. The intermediate level takes care of grouping, shape detection, and so forth; and the top level processes this information "symbolically", constructing an overall interpretation of the scene. The base level is a 512x512 square array of processors each doing exactly the same thing to different parts of the image; and the middle level is composed of a 64x64 square array of relatively powerful processors, each doing exactly the same thing to different parts of the base-level array. Finally, the top level contains 64 very powerful processors, each one operating independently according to programs written in LISP (the standard AI programming language). The intermediate level may also be augmented by additional connections, e.g. a hypercube architecture.

    This three-level perceptual hierarchy appears be an extremely effective approach to computer vision. It is not a strict pyramidal architecture of the sort considered by Stout, but it retains the basic pyramidal structure despite the presence of other processes and interconnections.


    In sum, it is fairly clear that human perception works according to a "perceptual hierarchy" of some sort. And it is also plain that the perceptual hierarchy is a highly effective way of doing computer vision. However, there is no general understanding of the operation of this hierarchy. Many theorists, such at Uttal (1988), suspect that such a general understanding may be impossible -- that perception is nothing more than a largely unstructured assortment of very clever tricks. In 1965, Hurvich et al made the following remark, and it is still apt: "the reader familiar with the visual literature knows that this is an area of many laws and little order" (p.101).

    I suggest that there is indeed an overall structure to the process. This does not rule out the possibility that a huge variety of idiosyncratic tricks are involved; it just implies that these tricks are not 100% of the story. The structure which I will propose is abstract and extremely general; and I am aware that this can be a limitation. As Uttal has observed,

        Perceptual psychophysics has long been characterized by experiments specific to a microscopically oriented theory and by theories that either deal with a narrowly defined data set at one extreme or, to the contrary, a global breadth that is so great that data are virtually irrelevant to theirconstruction. Theories of this kind are more points of view than analyses. (p.290)

Uttal would certainly put the theory given here in the "more point of view than analysis" category. However, it seems to me that, if the gap between psychophysical theory and data is ever to be bridged, the first step is a better point of view. And similarly, if the gap between biological vision and computer vision is ever to be closed, we will need more than just superior technology -- we will need new, insightful general ideas. Therefore I feel that, at this stage, it is absolutely necessary to study the abstract logic of perception -- even if, in doing so, one is guided as much by mathematical and conceptual considerations as by psychophysical or other data.

9.1 Probability Theory

    The branch of mathematics known as probability theory provides one way of making inferences regarding uncertain propositions. But it is not a priori clear that it is the only reasonable way to go about making such inferences. This is important for psychology because it would be nice to assume, as a working hypothesis, that the mind uses the rules of probability theory to process its perceptions. But if the rules of probability theory were just an arbitrary selection from among a disparate set of possible schemes for uncertain inference, then there would be little reason to place faith in this hypothesis.

    Historically, most attempts to derive general laws of probability have been "frequentist" in nature. According to this approach, in order to say what the statement "the probability of X occurring in situation E is 1/3" means, one must invoke a whole "ensemble" of situations. One must ask: if I selected an situation from among an ensemble of n situations "identical" to E, what proportion of the time would X be true? If, as n tended toward infinity, this proportion tended toward 1/3, then it would be valid to say that the probability of X occurring in situation E is 1/3.

    In some cases this approach is impressively direct. For instance, consider the proposition: "The face showing on the fair six-sided die I am about to toss will be either a two or a three". Common sense indicates that this proposition has probability 1/3. And if one looked at a large number of similar situations -- i.e. a large number of tosses of the same die or "identical" dice -- then one would indeed find that, in the long run, a two or a three came up 1/3 of the time.

    But often it is necessary to assign probabilities to unique events. In such cases, the frequency interpretation has no meaning. This occurs particularly often in geology and ecology: one wishes to know the relative probabilities of various outcomes in a situation which is unlikely ever to recur. When the problem has to do with a bounded region of space, say a forest, it is possible to justify this sort of probabilistic reasoning using complicated manipulations of integral calculus. But what is really required, in order to justify the generalapplication of probability theory, is some sort of proof that the rules of probability theory are uniquely well-suited for probable inference.

    Richard Cox (1961) has provided such a proof. First of all, he assumes that any possible rule for assigning a "probability" to a proposition must obey the following two rules:

        The probability of an inference on given evidence determines the probability of its contradictory on the same evidence (p.3)

        The probability on given evidence that both of two inferences are true is determined by their separate probabilities, one on the given evidence, the other on this evidence with the additional assumption that the first inference is true (p.4)

The probability of a proposition on certain evidence is the probability that logically should be assigned to that proposition by someone who is aware only of this evidence and no other evidence. In Boolean notation, the first of Cox's rules says simply that if one knows the probability of X on certain evidence, then one can deduce the probability of -X on that same evidence without using knowledge about anything else. The second rule says that if one knows the probability of X given certain evidence E, and the probability that Y is true given EX, then one can deduce the probability that Y is true without using knowledge about anything else.

    These requirements are hard to dispute; in fact, they don't seem to say very much. But their simplicity is misleading. In mathematical notation, the first requirement says that P(XY%E)= F[(X%E),(Y%XE)], and the second requirement says that P(-X%E)=f[P(X%E)], where F and f are unspecified functions. What is remarkable is that these functions need not remain unspecified. Cox has shown that the laws of Boolean algebra dictate specific forms for these functions.

    For instance, they imply that G[P(XY%E)] = CG[P(X%E)]G[P(Y%XE)], where C is some constant and G is some function. This is almost a proof that for any measure of probability P, P(XY%E)=P(X%E)P(Y%XE). For if one sets G(x)=x, this rule is immediate. And, as Cox points out, if P(X%E) measures probability, then so does G[P(X%E)] -- at least, according to the two axioms given above. The constant C may be understood by setting X=Y and recalling that XX=X according to the axioms of Boolean algebra. It follows by simple algebra that C = G[P(X%XE)] -- i.e., C is the probability of X on the evidence X, the numerical value of certainty. Typically, in probability theory, C=1. But this is a convention, not a logical requirement.

    As for negation, Cox has shown that if P(X)=f[P(-X)], Boolean algebra leads to the formula Xr+[f(X)]r=1. Given this, we could leave r unspecified and use P(X)r as the symbol of probability; but, following Cox, let us take r=1.

    Cox's analysis tells us in exactly what sense the laws of probability theory are arbitrary. All the laws of probability theory can be derived from the rules P(X%E)=1-P(-X%E), P(XY%E)=P(X%E)P(Y%XE). And these rules areessentially the only ways of dealing with negation and conjunction that Boolean algebra allows. So, if we accept Boolean algebra and Cox's two axioms, we accept probability theory.

    Finally, for a more concrete perspective on these issues, let us turn to the work of Krebs, Kacelnik and Taylor (1978). These biologists studied the behavior of birds (great tits) placed in an aviary containing two machines, each consisting of a perch and a food dispenser. One of the machines dispenses food p% of the times that its perch is landed on, and the other one dispenses food q% of the times that its perch is landed on. They observed that the birds generally visit the two machines according to the optimal strategy dictated by Bayes' rule and Laplace's Principle of Indifference -- a strategy which is not particularly obvious. This is a strong rebuttal to those who raise philosophical objections against the psychological use of probability theory. After all, if a bird's brain can use Bayesian statistics, why not a human brain?


    Assume that one knows that one of the propositions Y1,Y2,...,Yn is true, and that only one of these propositions can possibly be true. In mathematical language, this means that the collection {Y1,...,Yn) is exhaustive and mutually exclusive. Then, Bayes' rule says that


        P(Yn%X) = %%%%%%%%%%%%%%%%


In itself this rule is unproblematic; it is a simple consequence of the two rules of probable inference derived in the previous section. But it lends itself to controversial applications.

    For instance, suppose Y1 is the event that a certain star system harbors intelligent life which is fundamentally dissimilar from us, Y2 is the event that it harbors intelligent life which is fundamentally similar to us, and Y3 is the event that it harbors no intelligent life at all. Assume these events have somehow been precisely defined. Suppose that X is a certain sequence of radio waves which we have received from that star system, and that one wants to compute P(Y2%X): the probability, based on the message X, that the system has intelligent life which is fundamentally similar to us. Then Bayes' rule applies: {Y1,Y2,Y3} is exhaustive and mutually exclusive. Suppose that we have a good estimate of P(X%Y1), P(X%Y2), and P(X%Y3): the probability that an intelligence dissimilar to us would send out message X, the probability that an intelligence similar to us would send out message X, and the probability that an unintelligent star system would somehow emit message X. But how do we know P(Y1), P(Y2) and P(Y3)?

    We cannot deduce these probabilities directly from the nature of messages received from star systems. So where does P(Yi%X) come from? This problem,at least in theory, makes the business of identifying extraterrestrial life extremely tricky. One might argue that it makes it impossible, because the only things we know about stars are derived from electromagnetic "messages" of one kind or another -- light waves, radio waves, etc. But it seems reasonable to assume that spectroscopic information, thermodynamic knowledge and so forth are separate from the kind of message-interpretation we are talking about. In this case there might be some kind of a priori physicochemical estimate of the probability of intelligent life, similar intelligent life, and so forth. Carl Sagan, among others, has attempted to estimate such probabilities. The point is that we need some kind of prior estimate for the P(Yi), or Bayes' rule is useless here.

    This example is not atypical. In general, suppose that X is an effect, and {Yi} is the set of possible causes. Then to estimate P(Y1%X) is to estimate the probability that Y1, and none of the other Yi, is the true cause of X. But in order to estimate this using Bayes' rule, it is not enough to know how likely X is to follow from Yi, for each i. One needs to know the probabilities P(Yi) -- one needs to know how likely each possible cause is, in general.

    One might suppose these problems to be a shortcoming of Bayes' rule, of probability theory. But this is where Cox's demonstration proves invaluable. Any set of rules for uncertain reasoning which satisfy his two simple, self-evident axioms -- must necessarily lead to Bayes' rule, or something essentially equivalent with a few G's and r's floating around. Any reasonable set of rules for uncertain reasoning must be essentially identical to probability theory, and must therefore have no other method of deducing causes from effects than Bayes' rule.

    The perceptive reader might, at this point, accuse me of inconsistency. After all, it was observed above that quantum events may be interpreted to obey a different sort of logic. And in Chapter 8 I raised the possibility that the mind employs a weaker "paraconsistent" logic rather than Boolean logic. How then can I simply assume that Boolean algebra is applicable?

    However, the inconsistency is only apparent. Quantum logic and paraconsistent logic are both weaker than Boolean logic, and they therefore cannot not lead to any formulas which are not also formulas of Boolean logic: they cannot improve on Bayes' rule.

    So how do we assign prior probabilities, in practice? It is not enough to say that it comes down to instinct, to biological programming. It is possible to say something about how this programming works.


    Laplace's "Principle of Indifference" states that if a question is known to have exactly n possible answers, and these answers are mutually exclusive, then in the absence of any other knowledge one should assume each of these answers to have probability 1/n of being correct.

    For instance, suppose you were told that on the planet Uxmylarqg, thepredominant intelligent life form is either blue, green, or orange. Then, according to the Principle of Indifference, if this were the only thing you knew about Uxmylargq, you would assign a probability of 1/3 to the statement that it is blue, a probability of 1/3 to the statement that it is green, and a probability of 1/3 to the statement that it is orange. In general, according to the Principle of Indifference, if one had no specific knowledge about the n causes {Y1,...,Yn} which appear in the above formulation of Bayes' rule, one would assign a probability P(Yi)=1/n to each of them.

    Cox himself appears to oppose the Principle of Indifference, arguing that "the knowledge of a probability, though it is knowledge of a particular and limited kind, is still knowledge, and it would be surprising if it could be derived from... complete ignorance, asserting nothing". And in general, that is exactly what the Principle of Indifference does: supplies knowledge from ignorance. In certain specific cases, it may be proved to be mathematically correct. But, as a general rule of uncertain inference, it is nothing more or less than a way of getting something out of nothing. Unlike Cox, however, I do not find this surprising or undesirable, but rather exactly what the situation calls for.

9.2 The Maximum Entropy Principle

    If the Principle of Indifference tells us what probabilities to assign given no background knowledge, what is the corresponding principle for the case when one does have some background knowledge? Seeking to answer this question, E.T. Jaynes studied the writings of J. Willard Gibbs and drew therefrom a rule called the maximum entropy principle. Like the Principle of Indifference, the maximum entropy principle is provably correct in certain special cases, but in the general case, justifying it or applying it requires ad hoc, something-out-of-nothing assumptions.

    The starting point of the maximum entropy principle is the entropy function

    H(p1,...,pn) = - [p1logp1 + p2logp2 + ... + pnlogpn],

where {Yi} is an exhaustive, mutually exclusive collection of events and pi=P(Yi). This function first emerged in the work of Boltzmann, Gibbs and other founders of thermodynamics, but its true significance was not comprehended until Claude Shannon published The Theory of Communication (1949). It is a measure of the uncertainty involved in the distribution {pi}.

    The entropy is always positive. If, say, (p1,...,pn)=(0,0,1,..,0,0,0), then the entropy H(p1,...,pn) is zero -- because this sort of distribution has the minimum possible uncertainty. It is known which of the Yi is the case, with absolute certainty. On the other hand, if (p1,...,pn)=(1/n,1/n,...,1/n), then H(p1,...,pn)=logn, which is the maximum possible value. This represents themaximum possible uncertainty: each possibility is equally likely.

    The maximum entropy principle states that, for any exhaustive, mutually exclusive set of events (Y1,...,Yn), the most likely probability distribution (p1,...,pn) with respect to a given set of constraints on the Yi is that distribution which, among all those that satisfy the constraints, has maximum entropy. The "constraints" represent particular knowledge about the situation in question; they are what distinguishes one problem from another.

    For instance, what if one has absolutely no knowledge about the various possibilities Yi? Then, where pi=P(Yi), can we determine the "most likely" distribution (p1,...,pn) by finding the distribution that maximizes H(p1,...,pn)? It is easy to see that, given no additional constraints, the maximum of H(p1,...,pn) occurs for the distribution (p1, (1/n,1/n,...,1/n). In other words, when there is no knowledge whatsoever about the Yi, the maximum entropy principle reduces to the Principle of Indifference.


    In thermodynamics the Yi represent, roughly speaking, the different possible regions of space in which a molecule can be; pi is the probability that a randomly chosen molecule is in region Yi. Each vector of probabilities (p1,...,pn) is a certain distribution of molecules amongst regions. The question is, what is the most likely way for the molecules to be distributed? One assumes that one knows the energy of the distribution, which is of the form E(p1,...,pn)=c1p1+...+cnpn, where the {ci} are constants obtained from basic physical theory. That is, one assumes that one knows an equation E(p1,...,pn)=K. Under this assumption, the answer to the question is: the most likely (p1,...,pn) is the one which, among all those possibilities that satisfy the equation E(p1,...,pn)=K, maximizes the entropy H(p1,...,pn). There are several other methods of obtaining the most likely distribution, but this is by far the easiest.

    What is remarkable is that this is not just an elegant mathematical feature of classical thermodynamics. In order to do the maximum entropy principle justice, we should now consider its application to quantum density matrices, or radio astronomy, or numerical linear algebra. But this would take us too far afield. Instead, let us consider Jaynes's "Brandeis dice problem", a puzzle both simple and profound.

    Consider a six-sided die, each side of which may have any number of spots between 1 and 6. The problem is (Jaynes, 1978):

        suppose [this] die has been tossed N times, and we are told only that the average number of spots up was not 3.5, as we might expect from an 'honest' die but 4.5. Given this information, and nothing else, what probability should we assign to i spots on the next toss? (p.49)

Let Yi denote the event that the next toss yields i spots; let pi=P(Yi). The information we have may be expressed as an equation of the formA(p1,...,pn)=4.5, where A(p1,...,pn)=(p1+...+pn)/n is the average of the pi. This equation says: whatever the most likely distribution of probabilities is, it must yield an average of 4.5, which is what we know the average to be.

    The maximum entropy principle says: given that the average number of spots up is 4.5, the most likely distribution (p1,...,pn) is the one that, among all those satisfying the constraint A(p1,...,pn)=4.5, maximizes the entropy H(p1,...,pn). This optimization problem is easily solved using Lagrange multipliers, and it has the approximate solution (p1,...,pn) = (.05435, .07877, .11416, .16545, .23977, .34749). If one had A(p1,...,pn)=3.5, the maximum entropy principle would yield the solution (p1,...,pn)=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6); but, as one would expect, knowing that the average is 4.5 makes the higher numbers more likely and the lower numbers less likely.

    For the Brandeis dice problem, as in the case of classical thermodynamics, it is possible to prove mathematically that the maximum entropy solution is far more likely than any other solution. And in both these instances the maximization of entropy appears to be the most efficacious method of locating the optimal solution. The two situations are extremely similar: both involve essentially random processes (dice tossing, molecular motion), and both involve linear constraints (energy, average). Here the maximum entropy principle is at its best.


    The maximum entropy principle is most appealing when one is dealing with linear constraints. There is a simple, straightforward proof of its correctness. But when talking about the general task of intelligence, we are not necessarily restricted to linear constraints. Evans (1978) has attempted to surmount this obstacle by showing that, given any constraint F(p1,...,pn)=K, the overwhelmingly most likely values pi=P(Yi) may be found by maximizing

    H(p1,...,pn) - H(k1,...,kn) = p1log(p1/k1) + ... + pnlog(pn/kn)

where k=(k1,k2,...,kn) is some "background distribution". The trouble with this approach is that the only known way of determining k is through a complicated sequence of calculations involving various ensembles of events.

    Shore and Johnson (1980) have provided an alternate approach, which has been refined considerably by Skilling (1989). Extending Cox's proof that probability theory is the only reasonable method for uncertain reasoning, Shore and Johnson have proved that if there is any reasonably general method for assigning prior probabilities in Bayes' Theorem, it has to depend in a certain way upon the entropy. Here we will not require all the mathematical details; the general idea will suffice.

    Where D is a subset of {Yi}, and C is a set of constraints, let f[D%C] denote the probability distribution assigned to the domain D on the basis of the constraints C. Let m={m1,m2,} denote some set of "background information" probabilities. For instance, if one actually has no backgroundinformation, one might want to implement the Principle of Indifference and assume mi=1/n, for all i.

    Assume f[D%C] is intended to give the most likely probability distribution for D, given the constraints C. Then one can derive the maximum entropy principle from the following axioms:

Axiom I: Subset Independence

    If constraint C1 applies in domain D1 and constraint C2 applies in domain D2, then f[D1%C1]%f[D2%C2] = f[D1%D2%C1%C2]. (Basically, this means that if the constraints involved do not interrelate D1 and D2, neither should the answer). This implies that f[D%C] can be obtained by maximizing over a sum of the form S(p,m)=m1Q(p1)+...+mnQ(pn), where Q is some function.

Axiom II: Coordinate Invariance

    This is a technical requirement regarding the way that f[(p1,...,pn)%C] relates to f[(p1/q1,....,pn/qn)%C]: it states that if one expresses the regions in a different coordinate system, the probabilities do not change. It implies that S(p,m)=m1Q(p1/m1)+...+mnQ(pn/mn).

Axiom III: System Independence

    Philosophically, this is the crucial requirement. "If a proportion q of a population has a certain property, then the proportion of any sub-population having that property should properly be assigned as q.... For example, if 1/3 of kangaroos have blue eyes... then [in the absence of knowledge to the contrary] the proportion of left-handed kangaroos having blue eyes should be 1/3"

It can be shown that these axioms imply that f[Y%C] is proportional to the maximum of the entropy H(p1,...,pn) subject to the constraints C, whatever the constraints C may be (linear or not). And since it must be proportional to the entropy, one may as well take it to be equal to the entropy.

    These axioms are reasonable, though nowhere near as compelling as Cox's stunningly simple axioms for probable inference. They are not simply mathematical requirements; they have a great deal of philosophical substance. What they do not tell you, however, is by what amount the most likely solution f[Y%C] is superior to all other solutions. This requires more work.

    More precisely, one way of summarizing what these axioms show is as follows. Let m=(m1,...,mn) be some vector of "background" probabilities. Then f[D%C] must be assigned by maximizing the function


    Evans has shown that, for any constraint C, there is some choice of m for which the maximum entropy principle gives an distribution which is not only correct but dramatically more likely than any other distribution. It is implicit, though not actually stated, in his work that given the correct vector (m1,...,mn), the prior probabilities {pi} in Bayes' formula must be given by

    pi = exp[aS/Z],

where S= S(p,m) as given above, Z=exp(aS)/[n(], and a is aparameter to be discussed below. Skilling has pointed out that, in every case for which the results have been calculated for any (m1,...,mn), with linear or nonlinear constraints, this same formula has been the result. He has given a particularly convincing example involving the Poisson distribution.

    In sum: the maximum entropy principle appears to be a very reasonable general method for estimating the best prior probabilities; and it often seems to be the case that the best prior probabilities are considerably better than any other choice. Actually, none of the details of the maximum entropy method are essential for our general theory of mentality. What is important is that, in the maximum entropy principle, we have a widely valid, practically applicable method for estimating the prior probabilities required by Bayes' Theorem, given a certain degree of background knowledge. The existence of such a method implies the possibility of a unified treatment of Bayesian reasoning.


    In order to use Bayes' rule to determine the P(Yi%X), one must know the P(X%Yi), and one must know the P(Yi). Determining the P(X%Yi) is, I will propose, a fundamentally deductive problem; it is essentially a matter of determining a property of the known quantity Yi. But the P(Yi) are a different matter. The maximum entropy principle is remarkable but not magical: it cannot manufacture knowledge about the P(Yi) where there isn't any. All it can do is work with given constraints C and given background knowledge m, and work these into a coherent overall guess at the P(Yi). In general, the background information about these probabilities must be determined by induction. In this manner, Bayes' rule employs both inductive and deductive reasoning.


    It is essential to note that the maximum entropy method is not entirely specified. Assuming the formulas given above are accurate, there is still the problem of determining the parameter a. It appears that there is no way to assign it a universal value once and for all -- its value must be set in a context-specific way. So if the maximum entropy principle is used for perception, the value of a must be set differently for different perceptual acts. And, furthermore, it seems to me that even if the maximum entropy principle is not a central as I am assuming, the problem of the parameter a is still relevant: any other general theory of prior probability estimation would have to give rise to a similar dilemma.

    Gull (1989) has demonstrated that the parameter a may be interpreted as a "regularizing parameter". If a is large, prior probabilities are computed in such a way that distributions which are far from the background model m are deemedrelatively unlikely. But if a is very small, the background model is virtually ignored.

    So, for instance, if there is no real background knowledge and the background model m is obtained by the Principle of Indifference, the size of a determines the tendency of the maximum entropy method to assign a high probability to distributions in which all the probabilities are about the same. Setting a high would be "over-fitting". But, on the other hand, if m is derived from real background knowledge and the signal of which the Yi are possible explanations is very "noisy," then a low a will cause the maximum entropy principle to yield an optimal distribution with a great deal of random oscillation. This is "under-fitting". In general, one has to keep the parameter a small to get any use out of the background information m, but one has to make it large to prevent the maximum entropy principle from paying too much attention to chance fluctuations of the data.


    As an alternative to setting the parameter a by intuition or ad hoc mathematical techniques, Gull has given a method of using Bayesian statistics to estimate the most likely value of a for particular p and m. Often, as in radioastronomical interferometry, this tactic or simpler versions of it appear to work well. But, as Gull has demonstrated, vision processing presents greater difficulties. He tried to use the maximum entropy principle to turn blurry pictures of a woman into accurate photograph-like images, but he found that the Bayesian derivation of a yielded fairly unimpressive results.

    He devised an ingenious solution. He used the maximum entropy principle to take the results of a maximum entropy computation using the value of a arrived at by the Bayesian method -- and get a new background distribution m'=(m1',...,mn'). Then he applied the maximum entropy principle using this new background knowledge, m'. This yielded beautiful results -- and if it hadn't, he could have applied the same method again. This is yet another example of the power of hierarchical structures to solve perceptual problems.

    Of course, one could do this over and over again -- but one has to stop somewhere. At some level, one simply has to set the value of a based on intuition, based on what value a usually has for the type of problem one is considering. This is plainly a matter of induction.

    In general, when designing programs or machines to execute the maximum entropy principle, we can set a by trial and error or common sense. But this, of course, means that we are using deduction, analogy and induction to set a. I suggest that similar processes are used when the mind determines a internally, unconsciously. This hypothesis has some interesting consequences, as we shall see.

    As cautioned above, if the maximum entropy method were proved completely incorrect, it would have no effect on the overall model of mind presented here -- so long as it were replaced by a reasonably simple formula, or collection of formulas, for helping to compute the priors in Bayes' formula; and so long as this formula or collection of formulas was reasonably amenable to inductive adjustment. However, I do not foresee the maximum entropy principle being "disproved" in any significant sense. There may be indeed be psychological systems which have nothing to do with it. But the general idea of filling in the gaps in incomplete data with the "most likely" values seems so obvious as to be inevitable. And the idea of using the maximum entropy values -- the values which "assume the least", the most unbiased values -- seems almost as natural. Furthermore, not only is it conceptually attractive and intuitively attractive -- it has been shown repeatedly to work, under various theoretical assumptions and in various practical situations.

9.3 The Logic of Perception

    Now, let us return to the perceptual hierarchy as briefly discussed in Section 9.0. I propose that this hierarchy is composed of a network of processors, each one of which operates primarily as follows:

1. Take in a set of entities consisting of stimuli, patterns in stimuli, patterns in patterns in stimuli, etc.

2. Use Bayes' rule and the maximum entropy principle (or some other tool for determining priors) -- perhaps aided by induction, deduction and analogy - to obtain a small set of most likely "interpretations" of what its input represents.

3. Seek to recognize the most meaningfully complex approximate patterns in these interpretations. Where %x% is the minimum complexity assigned to x by any processor that inputs to processor P, processor P should use % %% as its measure of complexity.

4. Output these newly recognized patterns, along with perhaps portions of its input.

    Step 3 is basically a form of Occam's razor: it states that the mind looks for the simplest interpretation of the data presented to it.

    On lower levels, this pattern recognition will have to be severely limited. Processors will have to be restricted to recognizing certain types of patterns -- e.g. lines, or simple shapes -- rather than executing the optimizations involved in pattern recognition over a general space of functions. This is similar to the situation considered in Chapter 3, when we discussed "substitution machines." A substitution machine was a very special kind of pattern, but it turned out that much more general types of patterns could be formed from hierarchies of substitution machines. Here we have a hierarchy of restricted pattern recognizers, which as a whole is not nearly so restricted, because it deals routinely with patterns in patterns, patterns in patterns in patterns, and so on.

    And what about the "aid" provided to Bayes' rule in Step 2? This also will have to be severely restricted on the lower levels, where speed is of the essence and access to long-term memory is limited. For instance, calculating the P(X%Yi) is a matter of deduction; and on lower levels this deduction may be carried out by rigid "hardware" routines, by fixed programs specific to certain types of X and Yi. But as the level becomes higher, so does the chance that a processor will refer to more general, more intricate deductive systems to compute its P(X%Yi). And, of course, one cannot use a general, flexible deductive system without recourse to sophisticated analogical reasoning and therefore to a structurally associative memory.

    Also, as the level becomes higher and higher, the P(Yi) are more and more likely to be calculated by sophisticated inductive processing rather than, say, simple entropy maximization. Technically speaking, induction may be used to provide a good background knowledge vector {m1,...,mn} and meaningful constraints C. On the lower levels, the set of possible interpretations Yi is provided by the hardware. But on the higher levels, the Yi may be entirely determined by induction: recall that the output of the general induction algorithm is a set of possible worlds. Once the level is high enough, no or essentially no entropy maximization may be necessary; the prior may be supplied entirely or almost entirely by induction. The regularization parameter a may be set very low. On the other hand, intermediate levels may get some of the Yi from induction and some from hardware, and entropy maximization may be invoked to a significant degree.

    Also, the regularization parameter a may be adapted by induction to various degrees on various levels. On very low levels, it is probably fixed. Around the level of consciousness, it is probably very small, as already mentioned. But on the intermediate levels, it may be adaptively modified, perhaps according to some specialized implementation of the adaptation-of-parameters scheme to be given in the following chapter.

    In sum: on the lower levels of the perceptual hierarchy, experience does not affect processing. The structurally associative memory is not invoked, and neither are any general pattern recognition algorithms. Lower level processors apply Bayes' rule, using hardware to set up the {Yi} and to deduce the P(X%Yi), and maximum entropy hardware to get the P(Yi). One result of this isolation is that prior knowledge has no effect on low-level pattern recognition -- e.g. familiar shapes are not necessarily more easily perceived (Kohler, 1947).

    On higher levels, however, the structurally associative memory must be invoked, to aid with the analogical reasoning required for estimating the P(X%Yi) either by induction or according to a flexible deductive system. Also, as will be discussed below, induction is required to set up the {Yi} -- which are not, as on lower levels, "wired in." And sophisticated parameter adaptation is required to intelligently the regularization parameter a and, possibly, other parameters of the process of prior estimation. The structure of the perceptual hierarchy is still important, but it is interconnected with the structure of the central processing systems related to induction, deduction, analogy and parameteradaptation.


    So far I have only discussed the progression of information upward through the perceptual hierarchy. Upward progression builds more complex, comprehensive forms out of simpler, more specific ones. But downward progression is equally essential.

    It was the genius of the Gestalt psychologists to recognize that the understanding of the whole guides the perception of the part. This may be reconciled with the present framework by assuming that, in many cases, a processor on level n and a processor on level n-1 will refer to respectively more and less "local" aspects of the same phenomenon. For instance, a processor on level -8 might refer to lines, and a processor on level -7 to shapes composed of lines. In this framework, the Gestalt insight means: the results obtained by processors on level n of the perceptual hierarchy are used to tell processors on level n-1 what to look for. Specifically, I suggest that they are used to give the processors on level n-1 some idea of what the set {Yi} should be, and what the P(X%Yi) are.

    In Gesalt Psychology, Wolfgang Kohler (1947, p.99) gave several classic examples of this kind of top-down information transmission. For instance, if someone is shown Figure 8a and asked to draw it from memory , they will correctly draw the point P on the center of the segment on which it lies. But if someone is shown Figure 8b and asked to draw it from memory, they will place P to the right of the center. And, on the other hand, if they are shown Figure 8c and asked to draw it from memory, they will usually place it to the left of the center. Hundreds of experiments point to the conclusion that this sort of thing is not a quirk of memory but rather a property of perception -- we actually see dots in different places based on their surroundings.

    This is only the most rudimentary example. It has been conclusively demonstrated (Rock, 1983) that a stationary object appears to move if it is surrounded by a moving object, that a vertical line appears tilted if it is seen within a room or rectangle that is tilted, that the perceived speed of a moving object is a function of the size of the aperture through which it is perceived, et cetera. And Figure 9 (Rock, 1983) is an example of a picture which at first looks like a meaningless arrangement of fragments, but actually looks entirely different once recognized.

    All these examples illustrate that the operation of a processor at level n can be affected by the conclusions of a processor at level n+1. Each one of these particular cases is undoubtedly highly complex. For the purpose of illustration, however, let us take a drastically oversimplified example. Say the processor which perceives points and lines is on level -8. Then perhaps the processor which put points and lines together into shapes is on level -7. According to this setup, Kohler's simple example illustrates that the operation of level -8 isaffected by the results of a level -7 computation. Roughly speaking, the processor on level -7 takes in a bunch of points and lines and guesses the "most likely shape" formed out of them. It then gives the processor on level -8 a certain probability distribution {mi}, which indicates in this case that the point is more likely to be to the right of the center of the line.


    Above I said that on lower levels, a great deal of perceptual processing is executed by "hardware", by biological programs that may have little to do with deduction, induction, analogy or probability theory. But it is very difficult to estimate exactly how much this "great deal" is -- and this difficulty has nothing to do with the specifics of the present model. There is little consensus in biology or psychology as to how much of the processing involved in perception is "thoughtlike" as opposed to "hardware-like". In fact, there is little consensus as to what "thoughtlike" means.

    In the past, there have essentially been two camps. The Gestalt psychologists believed that, in the words of Irwin Rock,

        the determinant of a perception is not the stimulus but spontaneous interactions between the representations of several stimuli or interactions between the stimulus and more central representations. Such interaction could take any form consistent with the known principles of neurophysiology. The essence of this theory is that... complex interactive events that ensue following stimulation... can allow for known effects such as those of context, constancy, contrast, perceptual changes without stimulus changes, illusions, and the like. (p.32)

Rock opposes this "spontaneous interaction theory" to the "cognitive theory", in which "reference is made... to thoughtlike processes such as description, rule following, inference or problem solving."

    The theory given here is plainly cognitive in nature. However, it leaves a great deal of room for "spontaneous interaction." First of all, as mentioned above, the theory assigns a significant role to "hardware", which in the case of the human brain is likely to be independently operating self-organizing neural circuitry. How great or how small the role of this independently operating circuitry is, we cannot yet say.

    In any event, the entire debate may be a matter of semantics. I have made no restrictions on the nature of the physical systems underlying minds, except that they must behave similarly to the way they would if they followed the algorithms given. It is certainly not unlikely that the brain "spontaneously" self-organizes in such a way as to execute processes of cognition. As observed in Chapter 1, some structure must be involved; an unstructured neural network will virtually never demonstrate intelligent behavior. The self-organizing neurodynamics in which the Gestaltists placed so much faith may indeed play adominant role.

    So, this is a cognitive theory, but it does not rule out the existence of noncognitive processes underlying or supplementing cognition.

    For instance, consider Figure 9 above. The framework I have given provides a rather cognitive analysis of this phenomenon. Suppose that, say, the part of the vision processing module that resides on level -3 contains a "shape interrelation" or "form organization" processor. Assume that this processor works by recognizing patterns in the input provided to it by level -4 processors, such as "shape detection processors". Then, once it has recognized the pattern "horse and rider", which greatly simplifies the picture, things change significantly. First of all, if the memory stores the picture it will most likely store it as a specific instantiation of the pattern "horse and rider", very differently from the way it would store it if no such prominent pattern had been recognized. And, more to the point, the level -3 processor will adjust the {Yi} used by a level -4 shape recognition processor in a very specific way: it will tell it to look for shapes that look like legs, tails, feet, et cetera. And, perhaps more significantly, it will adjust the P(Yi), too -- it will tell the shape recognition processor that a shape is more likely if it looks like some part of a horse.

    It is possible that these changes will propagate further down. For instance, suppose that the level -4 shape recognition processor receives output from a level -5 curve recognition processor. What if the new shapes on which the shape recognition processor has been instructed to concentrate all have some common factor, say a gentle sloping organic nature rather than an angular nature or a spiraling nature? Then the shape recognition processor might well instruct the curve recognition processor to look for the appropriate curves -- i.e. it might supply the curve recognition processor with new Yi, or with new P(Yi).

    Again, this analysis is highly oversimplified, but it indicates the general direction that a detailed analysis might take. It is cognitive in that it implies that reasoning is indeed occurring below the conscious level. But does how the shape recognition, or the form organization recognition, or the curve recognition take place? I have little doubt that these pattern recognition problems are solved by specific self-organizing networks of neural clusters. In this sense, "spontaneous interaction" undoubtedly plays an essential role. Neural networks seem to be very good at self-organizing themselves so as to obtain the solutions to pattern-recognition problems. But I think that certain very general cognitive structures are also necessary, in order to systematically direct these solutions toward the goal of intelligent behavior.


    I have suggested that information can progress downward through the perceptual hierarchy, but I have not yet said exactly how this information transmission is organized. The most natural strategy for this purpose is multilevel optimization.

    This is especially plain when, as in vision processing, lower and lower levels refer to more and more local aspects of the same phenomenon. In cases such as this, operations on lower levels roughly correspond to searches in smaller subsets of the space of all patterns in the phenomenon in question, so that the regulation of the perceptual hierarchy appears very much like the regulation of a search, as discussed in Chapter 2.

    In general, what the multilevel philosophy dictates is that, after a processor on level n makes its best guess as to the k most likely probability distributions pi=(p1i,...,pni), it sends down messages to L>k processors on level n-1. To each of these processors it "assigns" one of the pi -- the most likely distribution gets more processors than the second most likely, and so on. The proportion of processors to be assigned each distribution could be set approximately equal to S(pi,m)/[S(p1,m)+...+S(pk,m)], in the notation of the previous chapter.

    To the processors it has assigned pi, it sends a new background distribution mi, based on the assumption that pi is actually the case. Also, it may send the processor new possibilities Yi, based on this assumption. These are not trivial tasks: determining a new mi and determining new possibilities Yi both require deductive reasoning, and hence indirectly analogy and associative memory.

    The parameter k need not be constant from level to level, but in the case of visual perception, for instance, it seems plausible that it is approximately constant. Obviously, this requires that the number of processors increases exponentially as the level decreases. But if the individual processors on lower levels deal with progressively simpler tasks, this is not unreasonable.

    The perceptual hierarchy may thus be understood to operate by interconnecting multilevel optimization, Bayes' rule and the maximum entropy principle -- and on the higher levels integrating induction and analogy-driven deduction as well.


     Finally, let us consider the Gestaltists' basic law of visual perception: Any stimulus pattern tends to be seen in such a way that the resulting structure is as simple as the given conditions permit. This rule was formulated to explain the results of numerous well-known experiments involving, for instance, drawings with multiple interpretations. As mentioned above, it has been shown that the interpretation which one places on a drawing can affect the way one actually sees the drawing.

    The key shortcoming of this Gestaltist principle is, in my opinion, the vagueness of the word "simplicity." Some Gestaltists have implied that there is a biologically innate measure of simplicity. However, experiments indicate that perception of visual stimuli is definitely influenced by culture (Segal et al, 1966). This provides a particularly strong argument for the need for a precise definition of simplicity: it shows that simplicity is not a universal intuition, but is to some extent learned.

    Shortly after the discovery of information theory, Hochberg and McAlister (1953) attempted to use it to make Gestalt theory precise. They proposed that "other things being equal, the probabilities of occurrence of alternative perceptual responses to a given stimulus (i.e. their 'goodness') are inversely proportional to the amount of information required to define such alternatives differentially; i.e., the less the amount of information needed to define a given organization as compared to the other alternatives, the more likely that figure will be so perceived."

    They defined "goodness" as "the response frequency or relative span of time ... devoted to each of the possible perceptual responses which may be elicited by the same stimulus." And they defined information as "the number of different items we must be given, in order to specify or reproduce a given pattern or 'figure', along some one or more dimensions which may be abstracted from that pattern, such as the number of different angles, number of different line segments of unequal length, etc." Wisely, they did not blindly equate intuitive "information" with information in the communication-theoretic sense. However, their definition is not really much more precise than the standard Gestalt doctrine.

    What if we replace "amount of information needed to define" in Hochberg's hypothesis with "complexity of defining relative to the patterns already in the mind," in the sense defined in Chapter 4? This seems to me to capture what Hochberg and McAlister had in mind. The "number of different items" in a set is a crude estimate of the effort it takes the mind to deal with the set, which is (according to the present model of mind) closely related to the algorithmic complexity of the set relative to the contents of the mind. To get a better estimate one must consider not only the raw quantity of items but also the possibility that a number of items which are all minor variations on one basic form might be "simpler" to the mind than a smaller number of more various items. And this line of thought leads directly to the analysis of pattern and complexity proposed in Chapter 4.

    Next, what if we associate these "alternative perceptual responses" with complementary patterns in the set of stimuli presented, in the sense given in Chapter 4? Then we have a pattern-theoretic formulation of the Gestalt theory of perception: Among a number of complementary patterns in a given stimulus, a perceiving mind will adopt the one with the least complexity relative to its knowledge base. Note that this refers not only to visual stimuli, but to perception in general. It is easy to see that this principle, obtained as a modification of standard Gestalt theory, is a consequence of the model of perception given above. Given a set of stimuli, Bayes' rule picks the most likely underlying form. But it needs some sort of prior assumption, and on the higher levels of the perceptual hierarchy this is supplied by a network of processes involving analogy, and therefore long-term memory. Thus, to a certain extent, what we deem most likely is based on what we know.

    To a large extent, therefore, we see what we know. This does not imply that the patterns we perceive aren't "there" -- but only that, among the immensevariety of patterns in the universe, we automatically tend to see those which are more closely related to what we've seen or thought before.

    This has an interesting consequence for our analysis of induction. Above, we postulated that the universe possesses a "tendency to take habits," arguing that otherwise induction could not possibly work. But induction is only the process of recognizing patterns in what one perceives, and assuming they will continue. Therefore, if we assume that

1) as the Gestaltist rule suggests, when given a "choice" we tend to perceive what is most closely related to our knowledge base;

2) the set of "external" patterns simple enough to be contained in our minds are presented in a fairly "unbiased" distribution (e.g. a distribution fairly close to uniform, or fairly close the distribution given in which probability of occurrence is proportional to intensity, etc.);

then it follows that the universe as we perceive it must possess the tendency to take habits. Of course, this line of thought is circular, because our argument for the general Gestalt rule involved the nature of our model of mind, and our model of mind is based on the usefulness of pattern-recognitive induction, which is conditional on the tendency to take habits. But all this does serve to indicate that perception is not merely a technical issue; it is intricately bound up with the nature of mind, intelligence, and the external world.