Structure of Intelligence -- Copyright Springer-Verlag © 1993 |
After reading a certain amount of philosophy, it is easy to become confused as to exactly what the problem of induction is. For our purposes, however, the problem of induction is very simple and straightforward. Why is it justified for a system to assume that a pattern it has observed in its environment or itself will continue to be a pattern in its environment or itself in the future?
Leibniz (1704) was the first major philosopher to comprehend the seriousness of the problem of induction. He did not presume to have solved it, but rather claimed to have a research programme which would eventually lead to its solution. His "Universal Characteristic" was to be a systematization of all human knowledge in mathematical form, in such a way that the various possible answers to any given question could be assigned precise probabilities based on the background knowledge available.
But although Leibniz made important advances, Hume (1739) was the first to give the problem of induction its modern form. He gave a sequence of brilliant arguments to the effect that human knowledge is, in fact, induction. For instance, he spent a great deal of effort demonstrating that the "fact" that one thing causes another, say that fire causes smoke, is "known" to us only in the sense that we have seen it do so consistently in the past, and therefore assume it will continue to do so in the future. At the time this was an largely novel insight into the nature of human knowledge. The crux of his analysis of induction was his argument that it is fundamentally unjustifiable; however, without the conception of knowledge as inductive, his clever insight in this regard would have been meaningless.
After all -- to summarize and simplify Hume's argument -- how do we knowinduction works? Either we know it by "divine revelation" or "deduction" or some other principle which has nothing to do with the specific properties of the real world, or we know it because of some reasoning as to specific properties of the real world. If the latter, then how do we know these specific properties will continue into the future? This assumption is itself an act of inductive reasoning! So the two alternatives are: justify induction by some a priori principle unconnected with the real world, or justify induction by induction. And the latter is inadmissible, for it is circular reasoning. So induction is just another form of dogmatism.
I take it for granted that Hume was right -- induction is unjustifiable. But nonetheless, we execute inductions every day. As I see it, the practical problem of induction is the problem of coming up with a simple, general, useful model of the universe according to which induction is indeed possible. This is quite distinct from the philosophical problem of induction. In solving the practical problem, we are permitted to justify induction in terms of some principle divorced from observable reality. The objective is to find the best way of doing so.
A SIMPLE EXAMPLE
Consider the sequence 1010101010.... Given no background knowledge except that there will indeed be a next term, and that it will be either 0 or 1, simple intuitive inductive reasoning indicates that the next term should be a 1. One reasons: "Every time a 0 has occurred, it has been followed by a 1; hence with probability 1 this 0 is followed by a 1."
Similarly, given the sequence 010101001010101010... and the same minimal background information, one could reason: "Eight out of the nine times a 0 has occurred, it has been followed by a 1; hence with probability 8/9 this zero is followed by a 1."
But this sort of reasoning is, of course, plagued by serious problems. It makes the implicit assumption that the probability distribution of 0's and 1's will be the same in the future as it was in the past. So it makes an inductive assumption, an assumption as to the "regularity" of the world. There is no "a priori" reason that such assumptions should be justified -- but we intuitively make them, and they seem to work fairly well.
INDUCTION AND DEDUCTION
Many people would be willing to accept logical deduction on faith, and justify induction in terms of deduction. This would be one way of solving the practical problem of induction; unfortunately, however, it doesn't seem to work. Even if one does take it on faith that the universe is constituted so that the familiar rules of deductive logic are valid, there is no apparent way of solving thepractical problem of induction. The rules of deductive logic give no reason to assume that the regularities of the past will continue into the future.
In one sense this is a technical point, regarding the specific forms of deductive logic now known. It might be argued that we simply don't know the true deductive logic, which would justify induction. But that is not a very convincing argument; it is certainly not something on which to base a theory of mind.
And modern physics has added a new wrinkle to this controversy. In 1936, Von Neumann and Birkhoff proposed that a special non-classical "quantum logic" is required for the analysis of microscopic phenomena. Over the past few decades this suggestion has evolved into a flourishing research programme. Mittelstaedt (1978) and other quantum logicians contend that the choice of a deductive system must be made inductively, on the basis of what seems to work best for describing reality -- that what we call logic is not absolute but rather a product of our particular physical experience. If one accepts this postulate, then clearly there is no point in trying to justify induction deductively, for this would be just as circular as justifying induction inductively.
The American philosopher Charles S. Peirce founded a vast philosophical system on the principle of "the tendency to take habits." By this he meant, roughly, the following:
Peirce's Principle: Unless restrained by the extension of another habit, a habit will tend to extend itself.
Here "extend" is taken to refer to both space and time, and "habit" is essentially synonymous with "pattern." In a way this is a distant relative of Newton's First Law of Motion: unless somehow obstructed, an object will travel in a straight line.
Peirce never seriously sought a proof of this principle; he considered it primary and irreducible. He did, however, provide an amusing cosmogonic argument, which begins with the assumption that, out of the primordial void, various habits emerged at random, with extremely small intensities. Once the tendency to take habits comes into existence to an indefinitely small degree, he argued, then by the tendency to take habits the particular habit which is the tendency to take habits is extended -- et cetera. Hence the tendency to take habits is strengthened, and hence more and more habits which emerge to a miniscule degree will be extended, and this constitutes an extension of the tendency to take habits, and hence the tendency to take habits will be further extended -- et cetera.
Clearly, this sort of reasoning -- though intriguing -- contains all sorts of hidden assumptions; and there is not much point in debating whether or not itis really a "justification" of Peirce's principle. It is best to take the simpler path and assume Peirce's principle outright.
I will assume, first of all, that "a pattern tends to extend itself through time". This does not imply that all patterns continue indefinitely through time; this is obviously absurd. Merely a tendency is posited: merely that, given that a pattern X has occurred in the past, the probability that X occurs in the future is greater than it would be if the events at each time were determined with no reference whatsoever to previous events. Roughly speaking, this is equivalent to the hypothesis that the environment is self-organizing.
Clearly, this assumption requires that the world, considered as a dynamical system, is not highly S.S.-sensitive. If the world were highly S.S.-sensitive, then one would need essentially complete knowledge regarding the structure of the past world in order to predict the structure of the future world. But if the world possessed the tendency to take habits, then there would be a good chance the patterns one recognized would be continued to the future, thus limiting the possible degree of S.S.-sensitivity. This relationship can be spelled out in a precise inequality; but there seems little point. The basic idea should be clear. Conversely, one might wonder: does low S.S.-sensitivity inherently imply low tendency to take habits? It seems not. After all, low S.S.-sensitivity implies that it is possible to compute future structure from past structure, not that future structure is similar to past structure.
It may be worth phrasing this distinction more precisely. The structure of a dynamical system over an immediately past interval of time (the "past structure") does not, in general, determine the structure of the system over an immediately future interval of time (the "future structure"). But the past structure does place certain constraints upon the future structure; it enforces a certain probability distribution on the set of all future structures. That is to say, future structure is dependent upon past structure according to some stochastic function F. Then, all low S.S.-sensitivity says is that, if X and Y are close, F(X) and F(Y) are reasonably close (here X and Y are structures, i.e. sets of patterns). But what the tendency to take habits says is that X and F(X) are reasonably close. From this it is apparent that the tendency to take habits implies low S.S.-sensitivity, but not vice-versa.
Let us return to our previous example. In the case of 0101010101010..., the tendency to take habits translates the assumption that the next term will be a 0 into the assumption that the pattern x=y*z will be continued, where x is the sequence involved, y is the function f(A)=AAA...A which juxtaposes A n times, * is function evaluation, and z=01. Clearly %y%, %z% and C(y,z) are rather small, so that this will be a pattern in reasonably short sequences of the form 0101010101010.... It is also important to note that, in the case 0101010010101010..., the most natural application of the tendency to take habits involves the same y and z as above, but in this case as an approximate pattern. One might consider d(y*z,x)=1, since y*z may be changed into x by one insertion.
I have not yet been any more specific than Peirce as to what "the tendencyto take habits" actually is. Clearly, if one saw the sequence 0101010101010... in reality, one might or might not assume that, by induction, the next term was a 1. It would depend on context. For instance, if one were receiving such numbers as output from a computer program, and the last twelve outputs one had received were all either 0101010101000 or 1010101010111, then having seen 01010101010 one would probably assume the next term was going to be a 0. Obviously, this is also induction; one is merely inducing relative to a larger data set. What the tendency to take habits, properly formulated, should tell us is that given no other relevant knowledge, one should assume a given pattern will continue, because there is a certain tendency for patterns to continue, and if one does not assume this there is nothing to do but assume a uniform distribution on the set of possible outcomes. When two patterns are in competition -- when they cannot both continue -- then one must make a decision as to which one to assume will continue.
This example might be taken to suggest that the pattern based on the largest data set is always the best bet. But this is not the case. For what if one's computer outputs the following five sequences: 9834750940, 2345848530, 0000000000, 9875473890, 1010101010. Then when one sees for the sixth output 010101010, what is one to assume will be the last term? Does one follow the pattern extending over all five of the prior inputs, that all sequences end in 0? Or is one to obey the internal logic of the sequence, bolstered by the fact that the fifth sequence, with a very similar structure, and the third sequence, with a fairly similar structure, were each continued in a way which obeyed their internal logic? According to my intuition, this is a borderline case.
One could concoct a similar example in which the clear best choice is to obey the structure of the individual sequence. Indeed, if one simply replaced the final digit of the first sequence with a 1, then ending in 0 might still be an approximate pattern, but according to my intuition the best guess for the next term of the sixth sequence would definitely be 1.
If nothing else, the examples of the previous paragraph demonstrate that the choice of pattern is a matter of intuition. The tendency to take habits is a powerful, but it doesn't tell you what to assume when experience or logic tells you that two patterns, both historically prominent, cannot both occur. And the case of two contradictory patterns is a relatively simple one: in reality, every mind has recognized a huge set of patterns, each one contradicting numerous others.
In order to resolve this dilemma, I will propose a strengthening of Peirce's formulation of the tendency to take habits. I suggest that, when it possesses little or no information indicating the contrary, an intelligence should assume that the most intense of a group of contradictory patterns will continue. This strategy can only work, of course, if the universe operates according to the following special principle:
Strengthened Peirce's Principle: A pattern tends to continue, and the more intense a pattern it is, the more likely it is to continue.
This principle does not say how much more likely. But in general, a mind may want to consider the probability of a given pattern occurring in the future. In that case, it would need to know the exact nature of the relation between intensity and chance of continuance. One might think that this relation could be determined by induction, but this would lead to circular reasoning, for the execution of this induction would require some assumption as to the nature of this induction. At some level one must make an a priori assumption.
In sum: I have not attempted to "justify" induction; I have rather placed a condition on the universe under which a certain form of induction will generally be effective. This condition -- the strengthened Peirce's principle -- rules out high S.S.-sensitivity but not high L.- S.- and R.S.-sensitivity, and it is not implied by low S.S.-sensitivity. It is clear that, in a universe obeying the strengthened Peirce's principle, a system can achieve a degree of intelligence by recognizing patterns and assuming that they will continue.
This idea makes sense no matter how complexity is defined; it relies solely on the definition of pattern. But if one restricts attention to Turing machines, and considers complexity to mean KCS complexity, then the present approach to induction becomes extremely similar to the "classical" proposal of Solomonoff (1964). His paper, "A Formal Theory of Induction," was one of the three original sources of the KCS complexity. His essential idea was a mathematical form of Occam's razor: the simplest explanation is the one most likely to be correct. He took this as a given, defined an explanation of a binary sequence as a program computing it, and went on to construct the KCS complexity as a measure of simplicity.
The strengthened Peirce's principle is only a beginning. It is a basic philosophical assumption which ensures the possibility of intelligence through pattern recognition. All the standard mathematical methods for predicting the future are based on probability theory. In this section I will show that it is indeed possible to combine prediction based on pattern with prediction based on probability. Most of the ideas here are completely straightforward, and the reader who prefers to avoid mathematical formalism will lose little by skimming over this section.
From the abstract philosophical point of view, this approach may be controversial; it assumes that elementary probability theory works, and probability is a troublesome concept. This difficulty will be discussed in Chapter 9, in a different context. For now, let us simply apply the theory of probability without worrying about justifications.
Let P(X) denote the probability that the proposition X is true; let P(X%E) denote the probability that X is true given that the proposition E is true. Recall that P(X%E)=P(XE)/P(E), unless P(E)=0. Let us consider events Yi of the form "Pattern Pj is present in the environment to intensity Kl over time period(t,v)", where j and K are integers, and set pi=P(Yi). If there are n patterns and O%K<M, then there will be N=nM events Yi. Essentially, the task of making a coherent model may be summarized as follows. Where t denotes the present time, let Qs denote the proposition "Patterns Qs1, Qs2,..., Qsn have been observed during the interval [t-s,t] with respective intensities Ks1,...,Ksn". The question is: what is P(Yi%Qs), for each s? Strictly speaking, this is a different problem for each s. However, the knowledge generated in solving it for, say, s=1000, would obviously be useful in solving it for s=1100, 2000, or 10000. In general, the larger s is, the harder the problem is, since there will be more Qj.
The estimation of the probabilities P(Yi%Qs) is a very difficult problem. Probably there will always be a need for rough heuristics. But still, a more detailed analysis can lend some insight into what is involved. {Qsj} is the set of patterns observed in the time interval [t-s,t]. Let us assume that all the possible combinations of elements of {Qsj} have been listed in some specified order, and that {Qsj(k,1), Qsj(k,2),...} refers to the k'th one. Then for any v, and any u between s+v and t, we may introduce a set of propositions, {Ruvk}, which includes all propositions of the form "Qsj(k,1) and Qsj(k,2) and ... have been observed during (u-v,u) with respective intensities Ksj(k,1), Ksj(k,2),...." And, finally, we may define Ps(Ruvk) as the result of applying Algorithm 3.1 to the set of patterns referred to by Ruvk [in the notation of that algorithm, let Qsj(k,1)=(y1,z1), etc.]. According to the strengthened Peirce's principle, this indicates how likely it is for the k'th conjunction of patterns to occur.
So, in this framework, estimating P(Yi%Qs) comes down to first using the Ps(Ruvk) to estimate a set of most likely future scenarios, a set of subsets of {Qsi} which are intense enough and mutually consistent enough to be plausible; and second, using the Ps(Ruvk) to estimate the P(Yi%Qsj) for those Qsj in this future scenario.
The second part, estimating the P(Yi%Qsj), is relatively tractable; it doesn't require all the Ps(Ruvk), but only the relatively miniscule subset involving conjunctions of the form PjQl, where Ql is an element of some Qsj and Pj is the pattern to which the proposition Yi refers. In practice an intelligent system would probably have to forego the testing of all the Yi, and for those Yi which it selected to test it would have to ignore some of the conjunctions PjQl. For those Yi and Qsj selected, it would simply have to use the formula P(Yi%Qsj)=P(YiQsj)/P(Qsj).
Unfortunately, the task of choosing a set of most likely futures, is far, far less palatable. This is a very difficult optimization problem, and what's worse, in general it refers to the entire set of Ps(Ruvk). The array {Ruvk} is so large that no conceivable intelligence could ever compute it for a situation involving a realistic number of patterns. So in order for intelligent prediction to be possible, some sort of very rough approximation method is required.
Let's assume -- for reasons that will become clear in later chapters -- that certain patterns in Qs have been labeled as important. By what process doesn't matter right now. Define the prominence of a pattern as the product of its intensity with its importance. Then one rough approximation algorithm is asfollows. Consider perhaps seven different values of s -- e.g. a second, a minute, an hour, a day, a week, a year, a lifetime -- and seven different values of v. For each of these values of s, start with, say, the seven most prominent patterns, and consider all the various combinations of them. Estimate for each of these 896 combinations the appropriate average over all u of Ps(Ruvk), where v runs through all seven values. For each v retain the seven which yield the highest numbers (there could be some repetition in this set if the same pattern were prominent on two different time scales). Then, for each v, for each of these seven patterns P, consider all the combinations of P with the six next most prominent patterns in Qs (where s corresponds to the time scale on which P was selected as prominent). For each of these combinations, compute the appropriate average over all u of Ps(Ruvk) (where v corresponds to the time scale according to which the initial element of Ruvk was retained). Select the seven which yield the highest numbers. Then, for each v, for each of these seven patterns P, consider all the combinations of P with the next most prominent patterns in Qs....
Obviously, the numbers could be tinkered with. Perhaps they could be selected more intelligently. But the basic idea should be clear: with luck, you can accrete a reasonably likely projection on any future time scale v by starting with an intense, important pattern and progressively adding less and less intense and important patterns onto it. At each stage a number of possible additions are considered, and those which contradict least are retained. This is a very crude methodology. It relies essentially on the precept that more intense habits can be expected to continue with greater certainty -- without this assumption, one might as well start with randomly selected "important" patterns as with intense ones.
The moral of all this is that if the philosophical problem of induction is unsolvable, the general practical problem of induction is formidable. By proposing the tendency to take habits as a fundamental law, we have analyzed induction in terms of pattern recognition. By proposing that more intense patterns have more propensity to continue we have given a clue as to how to go about resolving the contradictory predictions which the tendency to take habits inevitably gives rise to. But this is nothing more than a general precept, a hint as to how to go about doing things: even if, on average, the more intense of a set of contradictory patterns prevails, it is folly to invariably assume that the more intense pattern one has perceived will prevail. To go beyond mere philosophical maxims requires a general framework for talking about the probabilities of various combinations of patterns, and this leads into a maze of sets and indices. It is extremely difficult to use past experience to tell which of two contradictory patterns will prevail.
But this task is necessary for intelligence: what we are talking about is no less than building a plausible model of the future on the basis of the past. Hopefully the homeliness of the necessary formalism does not obscure the very basic nature of the questions being addressed. Although induction is generally recognized to be essential to mentality, there has been very little work done on the general practical problem of induction. Philosophers have tirelessly debatedthe fine points of arguments for and against the justifiability and possibility of induction; and computer scientists have made a great deal of progress studying induction in limited contexts (Blum and Blum, 1975; Daley and Smith, 1986). But, cumbersome as the ideas of this section imply a hybrid pattern-theoretic/probabilistic method of predicting, on the basis of patterns recognized in the world in the past, the probability that a given pattern X will occur with intensity K in the future on a time scale [t,t+v], where t is the present.
In conclusion, let us now return to the question of intelligence. Assuming that the world is unpredictable and yet possesses the tendency to take habits to a significant degree, let us ask: how should a system act in order to approximately maximize the given "appropriateness" function A? In other words, let us ask the Kantian question: how is intelligence possible? One reasonable answer is, by repeating the following schematic steps:
1. Recognize patterns in the environment
2. Construct a model of what the future will be like, based on reasoning by induction, assuming the strengthened Peirce's principle and using probability theory. That is, where t denotes the present time, construct a set of statements of the form "according to my best guess, the probability that the environment will exhibit pattern X, with intensity K, over time interval [t,t+v], is P(X,K,v)", where P(X,K,v) is as high as possible.
3. Estimate what strategy will maximize A according to such an assumption.
In later chapters it will be proposed that this three-step process is applied recursively -- that each of the three steps will often involve the application of all three steps to certain subproblems. And the process will be embedded in a larger, subtler web of structures and processes. But nonetheless, these three steps form the core about which the ideas of the following chapters will be arranged.