Structure of Intelligence -- Copyright Springer-Verlag © 1993 |
Mental process involves a large variety of computational problems. It is not entirely implausible that the mind deals with each of them in a unique, context-specific way. But, unlike Minsky and many cognitive scientists, I do not believe this to be the case. Certainly, the mind contains a huge number of special-purpose procedures. But nearly all the computational problems associated with mental process can be formulated as optimization problems. And I propose that, by and large, there is one general methodology according to which these optimization problems are solved.
Optimization is simply the process of finding that entity which a certain criterion judges to be "best". Mathematically, a "criterion" is simply a function which maps a set of entities into a set of "values" which has the property that it is possible to say when one value is greater than another. So the word "optimization" encompasses a very wide range of intellectual and practical problems.
For instance, virtually all the laws of physics have been expressed as optimization problems, often with dramatic consequences. Economics, politics, and law all revolve around finding the "best" solution to various problems. Cognitive science and many forms of therapeutic psychology depend on finding the model of a person's internal state which best explains their behavior. Everyday social activity is based on maximizing the happiness and productivity of oneself and others. Hearing, seeing, walking, and virtually all other aspects of sensation and motor control may be viewed as optimization problems. The Traveling Salesman problem is an optimization problem -- it involves finding the shortest path through n cities. And, finally, the methodological principle known as Occam's razor suggests that the best explanation of a phenomenon is the simplest one that fits all the facts. In this sense, all inquiry may be an optimization problem, the criterion being simplicity.
Some of these optimization problems have been formulated mathematically --e.g. in physics and economics. For others, such as those of politics and psychology, no useful formalization has yet been found. Nonmathematical optimization problems are usually solved by intuition, or by the application of extremely simple, rough traditional methods. And, despite a tremendous body of sophisticated theory, mathematical optimization problems are often solved in a similar manner.
Although there are dozens and dozens of mathematical optimization techniques, virtually none of these are applicable beyond a very narrow range of problems. Most of them -- steepest descent, conjugate gradient, dynamic programming, linear programming, etc. etc. (Dixon and Szego, 1978; Torn et al, 1990) -- rely on special properties of particular types of problems. It seems that most optimization problems are, like the Traveling Salesman problem, very hard to solve exactly. The best one can hope for is a PAC solutions. And, in the "classical" literature on mathematical optimization, there are essentially only two reasonably general approaches to finding PAC solutions: the Monte Carlo method, and the Multistart method.
After discussing these methods, and their shortcomings, I will introduce the multilevel philosophy of optimization, which incorporates both the Monte Carlo and the Multistart methods in a rigid yet generally applicable framework which applies to virtually any optimization problem. I will propose that this philosophy of optimization is essential to mentality, not least because of its essential role in the perceptual and motor hierarchies, to be discussed below.
The Monte Carlo philosophy says: If you want to find out what's best, try out a lot of different things at random and see which one of these is best. If you try enough different things, the best you find will be almost certainly be a decent guess at the best overall. This is a common approach to both mathematical and intuitive optimization problems. Its advantages are simplicity and universal applicability. Its disadvantage is, it doesn't work very well. It is very slow. This can be proved mathematically under very broad conditions, and it is also apparent from practical experience. In general, proceeding by selecting things at random, one has to try an awful lot of things before one finds something good.
In contrast to the Monte Carlo philosophy, the Multistart philosophy depends on local search. It begins with a random guess x0, and then looks at all the possibilities which are very close to x0. The best from among these possibilities is called x1. Then it looks at all the possibilities which are very close to x1, selects the best, and calls it x2. It continues in this manner -- generating x3, x4, and so on -- until it arrives at a guess xn which seems to be better than anything else very close to it. This xn is called a local optimum -- it is not necessarily the best solution to the optimization problem, but it is better than anything in its immediate vicinity.
Local search proceeds by looking for a new answer in the immediate locality surrounding the best answer one has found so far. The goal of local search is to find a local optimum. But, as Figure 1 illustrates, a local optimum is not always a good answer. It could be that, although there is nothing better than xn in the immediate vicinity of xn, there is something much better than xn somewhere else.
In mathematical optimization, it is usually easy to specify what "very close" means. In other domains things may be blurrier. But that doesn't mean the same ideas aren't applicable. For instance, suppose a politician is grappling with the problem of reducing carbon monoxide emissions to a safe level. Maybe the best idea she's found so far is "Pass a law requiring that all cars made after 1995 emit so little carbon monoxide that the total level of emissions is safe". Then two ideas very near this one are: "Pass a law giving tax breaks to corporations which make cars emitting safe levels of carbon monoxide", or "Pass a law requiring that all cars made after 1992 emit so little carbon monoxide that the total level of emissions is safe." And two ideas which are not very near x0 are: "Tax automakers more and give the money to public transportation" and "Give big tax breaks to cities which outlaw driving in their downtown areas." If she decides that none of the ideas near "Pass a law requiring that all cars made after 1995 emit so little carbon monoxide that the total level of emissions is safe" is as attractive as it is, then this idea is a local optimum (from her point of view). Even if she felt that taxing automakers more and giving the money to public transportation were a better solution, this would have no effect on the fact that giving tax breaks to corporations that make safe cars was a local optimum. A local optimum is only better than those things which are very similar to it.
The Multistart philosophy says: Do a bunch of local searches, from a lot of different starting points, and take the best answer you get as your guess at the overall best.
Sometimes only one starting point is needed. For many of the optimization problems that arise in physics, one can pick any starting point whatsoever and do a local search from that point, and one is guaranteed to arrive at the absolute best answer. Mathematically, a problem of this sort is called convex. Unfortunately, most of the optimization problems that occur in politics, sensation, motor control, biology, economics and many other fields are nonconvex. When dealing with a convex optimization problem, the only thing you have to worry about is how well you go about picking the best from among those entities close to your best guess so far. Each year dozens of papers are written on this topic. But convexity is a very special property. In general, local search will not be effective unless it is applied according to the Multistart philosophy.
The Multistart philosophy works well for problems that don't have too many local optima. For instance, it would take a very long time to solve the problem in Figure 1 according to the Multistart philosophy. In this case the Monte Carlo approach would be preferable; the local searches are essentially a waste of time.
In recent years a new approach to global optimization has become popular, one which combines aspects of Monte Carlo search and local search. This method, called simulated annealing, is inspired by the behavior of physical systems. Statistical mechanics indicates that the state of many systems will tend to fluctuate in a random but directed manner.
To understand this, we must introduce the "state space" of a system, a mathematical set containing all possible states of the system. In state space, two states A and B are understood to be neighbors if there is a "simple, immediate" transition between the two. Let E(A) denote the energy of the state A.
In the particular case that the system involved is computational in nature, each of its possible states may be described by a finite sequence of zeros and ones. Then two states are neighbors if their corresponding sequences differ in exactly one place. This situation arises in "spin glass theory", a rapidly growing field which connects optimization theory and physics.
In the case of spin glasses, physics dictates that, if A and B are neighboring states, the probability of the state of the system changing from A to B is determined by 1) the quantity E(A) - E(B), and 2) the temperature, T, of the system. The schematic formula for the probability of going from state A to state B is
P(B%A) = 1/[1+exp([E(B)-E(A)]/kT)],
where k is Boltzmann's constant (Mezard, 1987).
Temperature corresponds to randomness. If T=0, the system has probability one of going to a state of lower energy, and probability zero of going to a state of higher energy. So when T=0, the system will automatically settle into a local minimum of the energy function. The higher T is, the more likely it is that the law of energy minimization will be violated; that there will be a transition to a state of higher energy. The analogy with optimization is obvious. At T=0, we have local search, and at T=infinity we have P(B%A)=1/2, so we have a random search: from any state, the chance of going to either of the two neighbors is equal. At T=infinity, the system will continue to fluctuate at random forever, never expressing a preference for any particular state or set of states. This process is called thermal annealing.
In optimization problems, one is not concerned with energy but rather with some general function f. Let us assume that this function assigns a number to each finite string of zeros and ones. Then, in order to minimize f, one may mimic the process of thermal annealing. Starting from a random initial sequence, one may either remain there or move to one of the two neighbors; and the probability of going to a given neighbor may be determined by a formula like that involved in thermal annealing.
In practice, the spin-glass formula given above is modified slightly. Starting from a random initial guess x, one repeats the following process:
1. Randomly modify the current guess x to obtain a new guess y,
2. If f(y)<f(x) then let x=y and return to Step 1,
3. If f(y)>f(x) then let x=y with probability exp([f(y)-f(x)]/T), and return to Step 1
The tricky part is the way the "temperature" T is varied as this process is repeated. One starts with a high temperature, and then gradually decreases it. The idea is that in the beginning one is locating the general region of the global minimum, so one does not want to be stuck in shallow local minima; but toward the end one is presumably already near the local minimum, so one simply wants to find it.
Philosophically, this is somewhat similar to the multilevel approach to be described in the following section. Both involve searches on various "levels" -- but here they are levels of risk, whereas with the multilevel method they are levels of "magnification". Neither approach is perfect; both tend to be too slow in certain cases. Probably the future will yield even more effective algorithms. But it is not implausible that both simulated annealing and multilevel optimization play significant roles in the function of the mind.
The basic principles of multilevel optimization were enounced in my Ph.D. thesis (Goertzel, 1989). There I gave experimental and theoretical results regarding the performance of a number of specific algorithms operating according to the multilevel philosophy. Shortly after completing this research, however, I was surprised to find that the basic idea of the multilevel philosophy had been proposed by the sociologist Etzione (1968), in his Adaptive Society, as a method for optimizing the social structure. And a few months later I became aware of the strong similarity between multilevel optimization and the "discrete multigrid" method of Achi Brandt (1984) (who introduced the term "multilevel" into numerical analysis). Brandt's ideas were introduced in the context of spin-glass problems like those described above. These parallels indicate how extremely simple and natural the idea is.
The first key concept is that the search for an optimum is to be conducted on a finite number of "levels", each one determined by a certain characteristic distance. If the levels are denoted 1,2,...,L, the corresponding distances will be denoted h1,...,hL, and we shall adopt the convention that hi<hi+1. The multilevel philosophy assumes the existence of some method of "search" which finds an optimum value "about a point x on level i." There are many ways of executing such search. For instance, one may execute Monte Carlo search over the sphere of radius hi about x. Or one may execute Monte Carlo search over the surface of the sphere of radius hi about x. Such choices constitute specific multilevel methods operating within the framework of the multilevel philosophy. The multilevel philosophy has to do not with the nature of the searches but with the relation between searches executed on various levels.
A method operating within the multilevel philosophy may or may not incorporate a "zero level," a local optimization method. First let us consider thecase L=1, with a zero level. In this case the concept is as follows. Given an initial guess x0, first execute the local optimization method is executed at this point. When the local optimization routine stops (having found a local extremum), stops proceeding fast enough (according to some preassigned threshold), or finishes a preassigned number of steps at some point w0, then search on level 1 is executed about w0, yielding a new point z0. Local optimization is then executed about z0, until it is halted by one of the three criteria, yielding a new point y0. Next, f(y0) is compared with f(x0). If f(y0) is better than f(x0), then the entire procedure is begun from y0; i.e. x0 is set equal to y0 and the algorithm is restarted. But if f(x0) is better, the program is terminated; x0 is the "answer." The idea is to avoid getting stuck in a shallow local optimum, or getting stuck crawling up an extremely gentle slope, by "jumping" away from the optimum by the nonlocal search on level 1.
If no zero level were implemented, and the local optimization routine in the above description were replaced with the identity mapping, one would still have a viable optimization method, which we shall call the one-level method. If h1 is very small, then the one-level method is a general-purpose local optimization method. In fact, in the case of a Boolean function one may take h1=1 (Hamming distance) and take the level-1 search to be an exact search on the surface of the sphere of radius 1 (there is no interior but the center). One then has the standard discrete steepest-descent method. And, in the continuous case, if one takes the level-1 search method to be a Monte Carlo search on the surface of the sphere of radius h1, then one has a simple, unoriginal approach to steepest-descent optimization which is probably as good as anything else for local optimization of functions with extremely "rugged" graphs.
Next, consider the case L=i, i>1. Here, given an initial guess x0, we first execute the algorithm for L=i-1 about this point. When the L=i-1 routine stops (having found an "answer"), stops proceeding fast enough (according to some preassigned threshold), or finishes a preassigned number of steps at some point w0, then search on level i is executed about w0, yielding a new point z0. The L=i-1 routine is then executed about z0, until it is halted by one of the three criteria, yielding a new point y0. Next, f(y0) is compared with f(x0). If f(y0) is better than f(x0), then the entire L=i procedure is begun from y0; i.e. x0 is set equal to y0 and the algorithm is restarted. But if f(x0) is better, the program is terminated; x0 is the "answer."
For L=2, this procedure, if it has a zero level, first seeks a local optimum, then seeks to jump out of it by searching on level 1, and then seeks to jump out of the result of this jumping-out by searching on level 2. L=2 without a zero level is the same as L=1 with the one-level method as a zero-level.
Similarly, the L=i procedure seeks to jump out of the result of jumping out of the result of jumping out of... the result of jumping out of the result of the lowest level.
The following instance may give an heuristic conception of the crux of the multilevel philosophy. For simplicity, we assume no zero level, and we assume the first of the three criteria for stopping search: search on level i is stoppedonly when an "answer" on level i-1 is found. The same example may just as easily be applied to the other cases.
A SIMPLE EXAMPLE
Consider a function which maps a numerical value to each house in the world, and suppose a person is trying to find the house with the highest number. If the distribution of numbers is totally random, it doesn't matter what order he checks the various houses in. But what if there is some intrinsic, perhaps subtle, structure to it? What does the multilevel philosophy tell him to do?
Starting from a randomly selected house, he should first check all houses on that block and see which one has the highest number. Then he should check the neighboring block in the direction of this optimal house. If no house on that block is better, he should call the best house he's found so far his block-level optimum. But if some house on that block is better, then he should proceed to check the neighboring block in the direction of this new optimal house. And so on, until he finds a block-level optimum.
Once he finds a block-level optimum, he should then take a rough survey of the town in which the block sits, and make a guess as to which areas will be best (say by the Monte Carlo method). He should pick a block in one of the areas judged best and execute block-level search, as described above, from this block, and so on until he reaches a new block-level optimum. Then he should compare the two block-level optima and call the best of them his tentative town-level optimum.
Then he should proceed to the town in the direction of this optimum and there execute town-level optimization as described above. He should compare his two tentative town-level optima and, if the old one is better, call it his town-level optimum. But if the new one is better, then he should proceed to the neighboring town in its direction and locate a new tentative town-level optimum. And so on, until he obtains a town-level optimum.
Then he should make a rough survey of the county in which this town sits, and make a guess as to which areas will be best (say by the Monte Carlo method). He should pick a town in one of the areas judged best and execute town-level search, as described above, from this town, and so on until he reaches a new town-level optimum. Then he should compare the two town-level optima and call the best of them his tentative county-level optimum.
Then he should proceed to the county in the direction of this optimum and there execute county-level optimization as described above. He should compare his two tentative county-level optima and, if the old one is better, call it his county-level optimum. But if the new one is better, then he should proceed to the neighboring county in its direction and locate a new tentative county-level optimum. And so on, until he obtains a county-level optimum. Applying the same logic, he could obtain state-wide, nation-wide and global optima...