Structure of Intelligence -- Copyright Springer-Verlag © 1993 Back to Structure of Intelligence Contents

# Quantifying Structure

## 3.0 Algorithmic Complexity

What does it mean to say that one thing is more complex than another? Like most words, "complexity" has many meanings. In Chapter 1 we briefly discussed the "complexity" of computation -- of problems and algorithms. In this chapter we will consider several approaches to quantifying the complexity of individual entities, beginning with the simple Kolmogorov-Chaitin-Solomonoff definition.

Throughout this chapter, when I speak of computers I will mean ordinary Turing machines, not stochastic or quantum computers. As yet, no one really knows how to deal with the complexity of objects in the context of stochastic or quantum computation, not in complete generality. Since a quantum computer can compute only those functions that a Turing machine can also compute, this limitation is not fatal.

It turns out that the easiest way to approach the complexity of objects is via the complexity of sequences of numbers. In particular, I will concentrate on binary sequences: sequences of zeros and ones. As is common in mathematics, the general issue can be resolved by considering what at first sight appears to be a very special case.

The standard approach to the complexity of binary sequences was invented independently by A.N. Kolmogorov, Gregory Chaitin, and Solomonoff (Chaitin, 1987), so we shall call it the KCS complexity. In my opinion, what the KCS definition measures is not very well described by the word "complexity." Lack of structure would be a better term.

Given any computer A, the KCS complexity of a sequence x is defined to be the length of the shortest self-delimiting program on A which computes x. The restriction to "self-delimiting" programs is necessary for technical purposes and will not worry us much here; roughly speaking, a self-delimiting program is one which contains a segment telling the computer which runs it how long it is. In the following, I may occasionally refer to "shortest programs" instead of "shortest self-delimiting programs"; but it should be implicitly understood thatall programs discussed are self-delimiting.

For instance, the KCS complexity of the sequence 10011010010010010 on an IBM PC is the length of the shortest program which, when loaded into the PC, causes it to output 10011010010010010 on the screen. In what follows, I will occasionally refer to the KCS complexity of a sequence x as KCS(x).

There is some vagueness here, as to what "length" means. For one thing, there are large differences between the various programming languages on the market today. There are a number of "high-level" languages, which allow one to type in programs vaguely resembling mathematical formulae: Fortran, Pascal, Cobol, Snobol, Ada, Prolog, C, Basic, Lisp, Forth, and so on. A program which is short in Pascal may be long in Cobol; and a program which is short in Basic may be long in Pascal. And then there is "assembly language", which refers directly to the hardware of the computer. A program in assembly language is usually very long. However, before a computer can use a program written in a high-level language, it must translate it into assembly language. (The program which does the translation is called the "compiler"). When figuring the length of a program written in Fortran, should we use the number of characters in the program as originally typed in, or the number of characters in the assembly-language translation of the program?

From the point of view of the mathematical theory of complexity, none of these issues matter. We can simply assume we are dealing with a universal Turing machine. Translating from a foreign language is essentially the same as simulating another computer. So if a sequence is long enough, its KCS complexity is essentially language-independent and computer-independent. For example, say you have a sequence x consisting of a billion 0s and 1s. Suppose it can be computed by a program of length 1,000,000 on an IBM PC. Suppose a VAX computer has been programmed to simulate a PC, and suppose this simulation program has length 200,000. Then the shortest program for computing x on the VAX cannot be any longer than 1,200,000. Because, if all else fails, one can compute x on the VAX by simulating a PC. These numbers are highly unrealistic, but the point is that as the sequences get longer and longer, the size of the simulation program remains the same. When the sequences are a trillion digits long, the 200,000 length of the simulation program will mean next to nothing.

Some of the newer programming languages are actually "universal programming languages" in a very practical sense: any contemporary programming language can be compactly written in them. For instance, one could write a C program which simulated Pascal or Lisp or Fortran, or a Lisp program that simulated Pascal or Fortran or C. (In fact, what is now known as Lisp was originally written in a simpler form of Lisp, and what is now known as C was originally written in a simpler form of C.) Let's say a certain sequence x could be computed by a very short Fortran program. Then one way to compute the sequence on a machine with a built-in Lisp compiler would be to write a Lisp program to simulate Fortran. If there were no simpler way to compute the sequence in Lisp, this might yield the shortest program forcomputing the sequence on that particular machine.

Again: the beauty of theoretical computer science is that, as long as we are talking about long sequences, we don't have to worry about the properties of specific machines or languages. This is what differentiates theoretical computer science from practical computer science. Naturally, the latter is more visible in the everyday world. However, both the theory and the practice of computer science are essential to the study of mind.

What is the KCS complexity of the sequence 0101010101010101 010101010101010101010101? It should be very small on any computer, because one can write a program saying "Print '01' twenty times". And what is the complexity of the sequence consisting of 01 repeated 1,000,000,000,000 times? This should still be very small on any computer, because one can write a program saying "Print '01' 1,000,000,000,000 times."

Note that this program is not quite as short as "Print 01 20 times". Our program for repeating '01' 1,000,000,000,000 times is 31 characters long, but our program for repeating '01' 20 times is only 16 characters long. The difference is, obviously, that it takes more space to write '1,000,000,000,000' than it does to write '20'. As n increases, the KCS complexity of repeating something over and over again n times increases. But it does not increase very fast. After all, 1,000,000,000,000 is 50,000,000,000 times as large as 20. But, according to the programs written above, the KCS complexity of repeating '01' 1,000,000,000,000 times is only 31/16 times the KCS complexity of repeating '01' 20 times.

The ratio of program sizes, here 31/16, may vary from computer to computer, from programming language to programming language. It would be different if this book were written in Spanish rather than English, because the equivalent of the word "print" would not have exactly five letters in it. But it is difficult to imagine a computer on which it would approach 50,000,000,000. Mathematically, we may say that as n gets bigger and bigger, the size of the shortest program for repeating something n times gets closer and closer to log(n). This is little more than common sense, because the number of digits in the decimal expansion of a large number n is always very close to log(n).

What about the KCS complexity of the sequence 010011000111 00001111000001111100000011111100000001111111? This depends on

whether it is shorter to say

Print 0100110001110000111100000111110000001111110000000             1111111

or to say

Do the following for k=1, then k=2, and so on up to k=7:

Print k '0's and then k '1's"

In this case the former is a bit shorter. But consider the sequence 01001100011100001111000001111100000011111100000001111111

00000000111111110000000001111111110000000000111111111100000000000

1111111111100000000000011111111111100000000000001111111111111

0000000000000011111111111111? Here there is no doubt that the latter sort of program is shorter.

Actually determining the KCS complexity of a sequence is a difficult matter. There are sequences which look completely random and can nonetheless be computed by short programs. For instance, if one printed the first ten thousand digits of the binary expansion of pi, virtually no human being would recognize any structure in it.

On the other hand, what is the complexity of a sequence x which is completely random in the sense of having no structure whatsoever? In this case the best way to compute x is to write a program saying "Print x". This program is about as long as x is. If x has n digits, this program has length n+c, for some small constant c. In the case of the program as written above, c=5. According to the KCS definition, a completely structureless sequence such as 10010100101001000011101001101001010010110001100101010001110110101 010001001010010100100101001010110101

is the most complex kind of sequence, with a complexity approximately equal to n. On the other hand, a sequence with a very simple structure, such as 1111111111111111111111, is the least complex kind of sequence, with a complexity approximately equal to log(n). Sequences with more intricate structures fall somewhere inbetween.

It can be shown that no program can compute the KCS complexity of an arbitrary sequence. For any program P, there is some X the KCS complexity of which P cannot compute.

## 3.1 Randomness

It is natural to define a random sequence as one which has no statistical regularities (von Mises, 1957). For instance, one might propose that in a random binary sequence 1 should occur exactly as often as 0. Also, one might require that the four doublets 00, 01, 10 and 11 should occur equally often. And perhaps the eight triplets 000, 001, 010, 011, 100, 101, 110, and 111 should occur with equal frequency. And so on. It would clearly be desirable to define a random sequence as one in which all subsequences of length n occur with equal frequency. Let us call this the natural definition.

Clearly, this definition does not apply to finite sequences. Each sequence of length n contains exactly one subsequence of length n (itself), so it certainly does not contain all sequences of length n equally often. According to this definition only an infinitely long sequence can be random.

Early in this century it was discovered that there is a basic flaw in this approach. The restrictions imposed by the natural definition are so stringent that no sequence, finite or infinite, can possible satisfy them. However, they are not beyond repair. A normal sequence is defined as one in which, as you go further and further out in the sequence, the frequencies of all the subsequences of length n get closer and closer to being equal. For instance, if one tooksamples of a normal sequence near the beginning, one might well find a lot more 00s than 01s, 10s or 11s. But eventually, if one took samples far enough out, one would have to find 00s, 01s, 10s and 11s equally often.

Intuitively speaking, if you tossed a coin and recorded 0 whenever tails came up, 1 whenever heads came up, you would expect the list of 0's and 1's to be a normal sequence. Essentially, a normal sequence is a sequence in which, as you go further and further out, each digit has less and less to do with the others. Just as, in a series of coin tosses, each toss has essentially nothing to do with the others.

That is one approach to randomness. There is another approach, involving the KCS definition of complexity, which also involves infinite sequences. What is remarkable is that the two different approaches turn out to be closely related.

RANDOMNESS AND COMPLEXITY

Consider an infinitely long binary sequence x. Let x[n] denote the first n terms of x. For instance, if x = 01001101010001001010..., then x[7] = 0100110. The idea behind the KCS approach to randomness is that the complexity of the infinite sequence x can be defined in terms of the complexities of the finite sequences x[n]. The first step is to ask: as n gets bigger and bigger, what happens to the KCS complexity of x[n]? If x = 0000000..., then the question has an easy answer. The sequence x[n] consists of n zeros, and KCS(x[n]) complexity is about log(n). And, intuitively speaking, if x is totally structureless, then x[n] has a KCS complexity of about n. These considerations lead up to the crucial insight, due to Kolmogorov and Per Martin-Lof. Look at what happens to the ratio KCS(x[n])/n as n gets bigger and bigger.

If, as n gets bigger and bigger, KCS(x[n])/n gets closer and closer to 1, then it follows that for large n, KCS(x[n]) is close to n. And this means that, for large n, x[n] essentially has no structure. On the other hand, if KCS(x[n])/n gets closer and closer to zero as n increases, this means that for large n there is indeed some structure in x[n]. It means that, for large n, there is indeed a better way of computing x[n] than just saying "Print 'x[n]'".

What if x looks like this: 01010000010000010001000100 00000100010001...? Here every other digit is a zero: the first, the third, the fifth, and so on. But the even-numbered digits follow no apparent pattern. What if x continued this way forever? Then x[n] could be computed by a program of the form "Print this sequence, putting a '0' between every two terms: '110010...'", where '110010...' is a finite sequence consisting of the odd-numbered terms of x[n]. How long is this program? Well, the sequence consisting of the odd-numbered terms of x[n] is about n/2 digits long. So here KCS(x[n]) is about n/2. Thus KCS(x[n])/n is about 1/2.

Ignoring a number of technical issues, we may define a random infinite sequence x as a sequence for which, as n gets bigger and bigger, KCS(x[n])/n does not approach zero, but rather approaches some other number. A randominfinite sequence x is one for which there is a fairly easy way of computing x[n], when n gets large. It can be proved that almost all infinitely long sequences are random in this sense -- algorithmically random.

One way to understand this definition is as follows: A random sequence is an infinite sequence which cannot be summed up in any formula of finite length. For instance, 00000000... can be summed up in the formula "Repeat '0' forever". And 010010001000010000010000001... can be summed up in the formula "Repeat '0' k times and then print '1', for k=1,2,3,4,...." But a random sequence x cannot be summed up in any formula, because if it could then that formula would provide a way to compute x[n].

Clearly, every sequence which is random in this sense is not normal. Think about the sequence given three paragraphs up, whose odd-numbered digits are all zeros but whose even-numbered digits have no structure to them. No matter how far out you look, 0's and 1's are not going to occur equally often in this sequence. There will always be more 0's. The best you can say about this sequence is that it has a subsequence -- the sequence of its even-numbered digits -- which looks to be normal.

PROBLEMS WITH RANDOMNESS

The theories of randomness sketched above are not very useful in practice, for obvious reasons. It only deals with infinitely long sequences. In reality, we are always faced with finite collections of data.

This restriction to infinite sequences leads to a couple of interesting conceptual paradoxes. First of all, the very proof that random sequences exist is somewhat troublesome. We have proved that almost every infinite sequence is random, but can we prove that any one particular sequence is random? We cannot, because random sequences are precisely those infinite sequences which cannot be summarized in a finite formula. In fact, the set of random sequences is precisely the set of all sequences which we cannot write down in any way. We have proved that the set exists, but we cannot demonstrate that any particular sequence belongs to it, because in order to do so we would have to write down that particular sequence. This is not exactly a logical paradox, but it is certainly disconcerting.

G. Spencer-Brown has discovered a particularly poignant way of illustrating the implications of this logical peculiarity. Suppose, he says, that you have built a random number generator -- a machine which is intended to generate numbers in such a way that each number it generates has absolutely nothing to do with the others. Then how can you test it to see if it works?

Suppose you tested it and it gave out a hundred zeros in a row. You would probably assume that it was broken. But, statistically speaking, a truly random number generator would generate a hundred zeros in a row sometimes. Not very often, but sometimes. There's no reason that rare occasion shouldn't comefirst. After all, the sequence consisting of a hundred zeros in a row is no less likely than any other sequence of a hundred numbers.

So you run it some more. And the same thing happens -- it keeps giving tens. Still, you're not really justified in concluding that it doesn't work. The same argument applies. The fact is that no matter what the machine does for the first n trials, it can be argued that a true random number generator would be just as likely to generate that sequence as any other. So no matter what the machine does, you can't judge its effectiveness.

You could examine the mechanism of your random number generator to see if it looks as though it is operating randomly. For instance, you could supply it with a mechanical coin tosser. Then you'd probably be confident its answers were random, since you're probably confident the results of a coin toss are random. But this is nothing more or less than intuition: you're assuming something is random because you haven't seen any structure to it in the past. An intuition is not the same as a theoretical guarantee.

Essentially, this paradox arises from the assumption that a random number generator must give out every sequence of length n with equal frequency. But is there any other way to define a random number generator? One could define a random number generator as a machine which generates a normal sequence of numbers, but it is easy to see that there is no way to prove a finite sequence is part of a normal sequence. This is because the definition of normality involves going "far enough out" in a sequence. Once you go far enough out in a normal sequence, all subsequences of length n must occur equally often. But as n increases, so does the precise meaning of "far enough". Say you determine that the first million terms of a sequence present all subsequences of length 1000 or less with the appropriate frequencies. That still doesn't tell you whether or not the sequence repeats a certain subsequence of length 1,000,000,000 too often. In fact, for all you know, the sequence could consist of the same million-digit-long sequence repeated over and over and over.

And remember, normality is in a sense a weaker concept than algorithmic randomness -- it says that the decimal expansion of pi is random. It is even more obvious that there is no way to tell if a finite sequence is part of an infinite (algorithmically) random sequence. After all, if we just see the fragment 01001001010011001001, how do we know it's part of a random sequence and not part of the sequence 01001001010011001001 01001001010011001001 01001001010011001001 01001001010011001001... which repeats the same fragment over and over again. This dilemma is reminiscent of the problem of induction, to be discussed in Chapter 5.

Our discussion has been phrased in terms of binary sequences, but it could be generalized to deal with any other mathematical objects in exactly the same way. The conclusions would not change a bit. The only things that are truly, mathematically random are infinitely large entities which can never even be summarized in finite formulas. And there is no practical way to tell if a given physical machine produces mathematically random sequences.

In sum: randomness is a phantom. In reality, all we can do is assume thatthose things we've detected no structure in are random. But this is a working assumption, not a mathematical guarantee. And therefore, the question of whether the mind and brain are stochastic or deterministic is a phantom as well. Thus there can never be an empirical reason to say that the mind or the brain is a stochastic computer rather than a deterministic computer. Because, scientifically speaking, to say that X is random is only to say that X has aspects in which we cannot detect any order. To declare that there is consequently no order there is unjustified.

## 3.2 Pattern

Charles S. Peirce, the turn-of-the-century American philosopher, liked to talk about the "one law of mind." He gave this law many different formulations, the most suggestive of which was only five words: "the tendency to take habits". This simple, potent idea is at the heart of the theory of mind to be presented in the following chapters. But instead, of "habits", I prefer to speak of "patterns". And rather than taking "habit" or "pattern" as a primitive undefined term, I will begin by providing a sketchy but completely rigorous mathematical theory of pattern.

As I understand it, the concept of pattern relies on two simpler ideas: combination and complexity. More precisely, in order to talk about certain entities being patterns in other entities, we must have

1) some way of combining certain pairs of entities y and z to obtain a third entity called y*z

2) some way of computing, for every entity x, a nonnegative real number %x% called the complexity of x

Any set of entities which fulfills these requirements may be called a pattern space. Formally, we may say:

Definition 3.1: A pattern space is a set (S,*,% %), where S is a set,     * is a binary operation defined on some subset of SxS, and % % is a map from S into the nonnegative real numbers.

Let's consider a simple example: Turing machines and finite binary sequences. If y is a Turing machine and z is a finite binary sequence, then there is a natural way of combining x and y -- just put y on the input tape of the Turing machine, extending to the right of the tape head. In this case, we can define x*y to be the binary sequence which appears on the tape of the Turing machine y after, having been started with z on its tape, its program finishes running. It is true that there is no guarantee the Turing machine will ever stop running. But if it doesn't, we can simply consider x*y to be undefined, and leave it at that.

As for requirement number 2, we can define the complexity %z% of a finite binary sequence z as its length. And, roughly speaking, we can define the complexity %y% of a Turing machine y as the length of its program. More precisely, %y% might be defined as length of the code number which, when fed into a certain universal Turing machine, enables that Turing machine to act exactly like machine y in every situation. Here %y% and %z% are nonnegative numbers, so the set of Turing machines and finite binary sequences is a pattern space.

Since we shall be returning to this example again and again, it is worth formulating a specific notation. Let %z%T denote the length of a finite binary sequence z, and let %y%T denote the length of the program y.

Now we are prepared to ask: what is a pattern?

First of all, a pattern is a pattern in something, in some entity x. Secondly, a pattern is an ordered pair of entities, denoted (y,z). And finally, we have what I shall call the fundamental pattern inequality:

Definition 3.2: Let a, b, and c denote constant, nonnegative numbers.

Then an ordered pair (y,z) is a pattern in x if x=y*z and

a%y% + b%z% + cC(y,z) < %x%.

The only unfamiliar term here is C(y,z). This denotes the complexity of obtaining x from (y,z). If y is a Turing machine program and z is a finite binary sequence, we shall let CT(y,z) denote the number of time steps which the Turing machine takes to stop when equipped with program y and given z as initial input.

For many purposes, the numbers a, b and c are not important. Often they can all be taken to equal 1, and then they don't appear in the formula at all. But in some cases it may be useful to set a=b=1 and c=0, for instance. Then the formula reads %y% + %z% < %x%. The constants lend the formula an element of flexibility.

Intuitively, an ordered pair (y,z) is a pattern in x if the complexity of y, plus the complexity of z, plus the complexity of getting x out of y and z, is less than the complexity of x. In other words, an ordered pair (y,z) is a pattern in x if it is simpler to represent x in terms of y and z than it is to say "x". The constants a, b and c just weight things: if a=3/4 and b=5/4, for example, then the complexity of y counts less than the complexity of z.

The definition of pattern can be generalized to ordered n-tuples, and to take into account the possibility of different kinds of combination, say *1 and *2.

Definition 3.3: An ordered set of n entities (x1,x2,...,xn) is a     pattern in x if x=x1*1x2*2...*n-1xn and a1%x1%+a2%x2%+...+an%xn%+ an+1C(x1,...,xn) < %x%, where C(x1,...,xn) is the complexity of computing x1*1x2...*n-1xn and a1,...,an+1 are nonnegative numbers.

Also, the following concept will be of use:

Definition 3.4: The intensity in x of a ordered pair (y,z) such

that y*z=x may be defined as

IN[(y,z)%x] = ( %x% - [a%y% + b%z% + cC(y,z)] )/%x%.

Obviously, this quantity is positive whenever (y,z) is a pattern in x, and negative or zero whenever it is not; and its maximum value is 1.

AN EXAMPLE: GEOMETRIC PATTERN

Most of our discussion will be devoted to Turing machines and binary sequences. However, the definition of pattern does not involve the theory of computation. Essentially, a pattern is a "representation as something simpler"; and simplicity need not necessarily be defined in terms of computation. Instead of Turing machines and binary sequences let us now consider pictures. Suppose that A is a one inch square black-and-white picture, and B is a five inch square picture made up of twenty-five non-overlapping copies of A. Intuitively, it is simpler to represent B as an arrangement of copies of A, than it is to simply consider B as a "thing in itself". Very roughly speaking, it would seem likely that part of the process of remembering what B looks like consists of representing B as an arrangement of copies of A.

This intuition may be expressed in terms of the definition of pattern. Where x and y are square regions, let:

y*1z denote the region obtained by placing y to the right of z

y*2z denote the region obtained by placing y to the left of z

y*3z denote the region obtained by placing y below z

y*4z denote the region obtained by placing y above z.

And, although this is obviously a very crude measure, let us define the complexity %x% of a square region with a black-and-white picture drawn in it as the proportion of the region covered with black. Also, let us assume that two pictures are identical if one can be obtained by a rigid motion of the other.

The operations *1, *2, *3 and *4 may be called simple operations. Compound operations are, then, compositions of simple operations, such as the operation (x*1w*2x)*4w. If y is a compound operation, let us define its complexity %y% to be the length of the shortest program which computes the actual statement of the compound operation. For instance, %(x*1w*2x)*4w% is defined to be the length of the shortest program which outputs the sequence of symbols "(x*1w*2x)*4w".

Where y is a simple operation and z is a square region, let y*z denote the region that results from applying y to z. A compound operation acts on a number of square regions. For instance, (x*1w*2x)*4w acts on w and x both. We may consider it to act on the ordered pair (x,w). In general, we may consider a compound operation y to act on an ordered set of square regions (x1,x2,...,xn), where x1 is the letter that occurs first in the statement of y, x2 is the letter that occurs second, etc. And we may define y*(x1,...,xn) to be theregion that results from applying the compound operation y to the ordered set of regions (x1,...,xn).

Let us return to the two pictures, A and B, discussed above. Let q=A*1A*1A*1A*1A. Then, it is easy to see that B=q*4q*4q*4q*4q. In other words, B = (A*1A*1A*1A*1A)*4(A*1A*1A*1A*1A)*4(A*1A*1A*1A*1A)*4

(A*1A*1A*1A*1A)*4(A*1A*1A*1A*1A). Where y is the compound operation given in the previous sentence, we have B=y*A.

The complexity of that compound operation, %y%, is certainly very close to the length of the program "Let q=A*1A*1A*1A*1A; print q*4q*4q*4q*4q". Note that this program is shorter than the program "Print (A*1A*1A*1A*1A)*4(A*1A*1A*1A*1A)*4 (A*1A*1A*1A*1A)* (A*1A*1A*1A*1A)*4(A*1A*1A*1A*1A)", so it is clear that the latter should not be used in the computation of %y%.

We have not yet discussed the term C(y,(B1,...,Bn)), which represents the amount of effort required to execute the compound operation y on the regions (x1,...,xn). For simplicity's sake, let us simply set it equal to the number of times the symbol "*" appears in the statement of y; that is, to the number of simple operations involved in y.

So, is (y,A) a pattern in B? Let us assume that the constants a, b and c are all equal to 1. We know y*A=B; the question is whether

%y%+%A%+C(y,A) < %B%.

According to the above definitions, %y% is about 37 symbols long. Obviously this is a matter of the particular notation being used. For instance, it would be less if only one character were used to denote *1, and it would be more if it were written in binary code.

And C(y,z) is even easier to compute: there are 24 simple operations involved in the construction of B from A.

So we have, very roughly speaking, 37 + %z% + 24 < %x%. This is the inequality that must be satisfied if (y,z) is to be considered a pattern in x. Rearranging, we find: %z% < %x% - 61. Recall that we defined the complexity of a region as the proportion of black which it contains. This means that (y,z) is a pattern in x if and only if it the amount of black required to draw B exceeds amount of black required to draw A by more than 61. Obviously, whether or not this is the case depends on the units of measurement.

This is a very simple example, in that the compound operation y involves only one region. In general, we may define %(x1,...,xn)%=%x1%+...+%xn%, assuming that the amount of black in a union of disjoint regions is the sum of the amounts of black in the individual regions. From this it follows that (y,(x1,...,xn)) is a pattern in x if and only if a%y% + b(%x1%+...+%xn%) + cC(y,(x1,...,xn)) < %x%.

Results similar to these could also be obtained from a different sort of analysis. In order to deal with regions other than squares, it is desirable to replace *1, *2, *3, *4 with a single "joining" operation *, namely the set-theoretic union %. Let z=(x1,...,xn), let y be a Turing machine, let f be a method for converting a picture into a binary sequence, let g be a method for converting abinary sequence into a picture. Then we have

Definition 3.5: If x = x1 % x2 %...% xn, then (y,z,f,g) is a     pattern in x if a%y%+b%z%+c%f%+d%g%+eC(y,z,f,g) < %x%.

We have not said how %f% and %g% are to be defined. This would require a detailed consideration of the geometric space containing x, which would take us too far afield. This general approach is somewhat similar to that taken in Chaitin (1978).

ORDERS OF COMPLEXITY

It should be apparent from the foregoing that complexity and pattern are deeply interrelated. In this and the following sections, we shall explore several different approaches to measuring complexity, all of which seek to go beyond the simplistic KCS approach. Remember, according to the KCS approach, complexity means structurelessness. The most "random", least structured sequences are the most complex. The formulation of this approach was a great step forward. But the next step is to give formulas which capture more of the intuitive meaning of the word "complexity".

First, we shall consider the idea that pattern itself may be used to define complexity. Recall the geometric example of the previous section, in which the complexity of a black-and-white picture in a square region was defined as the amount of black required to draw it. This measure did not even presume to gauge the effort required to represent a black-and-white picture in a square region. One way to measure the effort required to represent such a picture, call it x, is to look at all compound operations y, and all sets of square black-and-white pictures (x1,...,xn), such that y*(x1,...,xn)=x. One may then ask which y and (x1,...,xn) give the smallest value of a%y% + b(%x1% + ... + %xn%) + c(y,(x1,...,xn)). This minimal value of a%y% + b(%x1%+...+%xn%) may be defined to be the "second-order" complexity of x. The second-order complexity is then be a measure of how simply x can be represented -- in terms of compound operations on square regions.

In general, given any complexity measure % %, we may use this sort of reasoning to define a complexity measure % %%.

Definition 3.6: If % % is a complexity measure, % %% is the complexity measure

defined so that %x%% is the smallest value that the quantity a%y% + b%z% + cC(y,z) takes on, for any (y,z) such that y*z=x.

%x%% measures how complex the simplest representation of x is, where complexity is measured by % %. Sometimes, as in our geometric example, % % and % %% will measure very different things. But it is not impossible for them to be identical.

Extending this process, one can derive from % %% a measure % %%: the smallest value that the quantity

a%y%% + b%z%% + cC(y,z)

takes on, for any (y,z) such that y*z=x. %x%% measures the complexity of the simplest representation of x, where complexity is measured by % %%. It might be called second-order complexity. And from % %%, one may obtain a measure % %%, third-order complexity. It is clear that this process may be continued indefinitely.

It is interesting to ask when % % and % %% are equivalent, or almost equivalent. For instance, assume that y is a Turing machine, and x and z are binary sequences. If, in the notation given above, we let % %=% %T, then %x%% is a natural measure of the complexity of a sequence x. In fact, if a=b=1 and c=0, it is exactly the KCS complexity of x. Without specifying a, b and c, let us nonetheless use Chaitin's notation for this complexity: I(x).

Also, let us adopt Chaitin's notation I(v%w) for the complexity of v relative to w.

Definition 3.7: Let y be a Turing machine program, v and w binary

sequences; then I(v%w) denotes the smallest value the quantity a%y%T+cCT(y,w) takes on for any self-delimiting program y that computes v when its input consists of w.

Intuitively, this measures how hard it is to compute v given complete knowledge of w.

Finally, it should be noted that % % and % %% are not always substantially different:

Theorem 3.1: If %x%%=I(x), a=b=1, and c=0, then there is some K     so that for all x % %x%% - %x%%% < K.

Proof: a%y%% + b%z%% + cC(y,z) = %y%% + %z%%. So, what is the

smallest value that %y%% + %z%% assumes for any (y,z) such that y*z=x? Clearly, this smallest value must be either equal to %x%%, or very close to it. For, what if %y%% + %z%% is bigger than %x%%? Then it cannot be the smallest %y%% + %z%%, because if one took z to be the "empty sequence" (the sequence consisting of no characters) and then took y to be the shortest program for computing x, one would have %z%%=0 and %y%%=%x%%. And, on the other hand, is it possible for %y%%+%z%% to be smaller than %x%%? If %y%%+%z%% were smaller than x, then one could program a Turing machine with a program saying "Plug the sequence z into the program y," and the length of this program would be less than %x%%, or at least greater than %x%% by no more than the length of the program P(y,z)="Plug the sequence z into the program y". This length is the constant K in the theorem.

Corollary 3.1: For a Turing machine for which the program P(y,z)

mentioned in the proof is a "hardware function" which takes only one unit of length to program, % %%=% %%.

Proof: Both % %% and % %% are integer valued, and by the theorem, for any x,     %x%% % %x%% % %x%%+1.

PATTERNS IN PATTERNS; SUBSTITUTION MACHINES

We have discussed pattern in sequences, and patterns in pictures. It is also possible to analyze patterns in other patterns. This is interesting for many reasons, one being that when dealing with machines more restricted than Turing machines, it may often be the case that the only way to express an intuitively simple phenomenon is as a pattern in another pattern. This situation will arise in our analysis of the perceptual hierarchy, several chapters down the road.

Let us consider a simple example. Suppose that we are not dealing with Turing machines, but rather with "substitution machines" -- machines which are capable of running only programs of the form P(A,B,C)="Wherever sequence B occurs in sequence C, replace it with sequence A". Instead of writing P(A,B,C) each time, we shall denote such a program with the symbol (A,B,C). For instance, (1,10001,1000110001100011000110001) = 11111. (A,B,C) should be read "substitute A for B in C".

We may define the complexity %x% of a sequence x as the length of the sequence, i.e. %x%=%x%T, and the complexity %y% of a substitution program y as the number of symbols required to express y in the form (A,B,C). Then, %1000110001100011000110001%=25, %11111%= 5, and %(10001,1,z)%=11. If z=11111, (10001,1,z)= 1000110001100011000110001. For example, is (10001,1,z), 11111) a pattern in 1000110001100011000110001? What is required is that

a(11) + b(5) + cC((10001,1,z),11111) < 25.

If we take a=b=1 and c=0 (thus ignoring computational complexity), this reduces to

11 + 5 < 25.

This is true, so it is indeed a pattern.

If we take c=1 instead of zero, and leave a and b equal to one, then it will still be a pattern, as long as the computational complexity of obtaining 1000110001100011000110001 from (10001,1,11111) does not exceed 9. It would seem most intuitive to assume that this computational complexity C((10001,1,z),11111) is equal to 5, since there are 5 1's into which 10001 must be substituted, and there is no effort involved in locating these 1's. In that case the fundamental inequality reads

11 + 5 + 5 < 25,

which verifies that a pattern is indeed present.

Now, let us look at the sequence x = 1001001001001001000111001 1001001001001001001011101110100100100100100100110111 100100100100100100. Remember, we are not dealing with general Turing machines, we are only dealing with substitution machines, and the only thing a substitution machine can do is plug one sequence in for another. Anythingwhich cannot be represented in the form (A,B,C), in the notation given above, is not a substitution machine.

There are two obvious ways to compute this sequence x on a substitution machine. First of all, one can let y=(100100100100100100,B,z), and z= B 0111001B1011101110B110111B. This amounts to recognizing that 100100100100100100 is repeated in x. Alternatively, one can let y%=(100,B,z%), and let z%= BBBBBB0111001BBBBBB1011101110BBBBBB 110111BBBBBB. This amounts to recognizing that 100 is a pattern in x. Let us assume that a=b=1, and c=0. Then in the first case %y% + %z% = 24 + 27 = 51; and in the second case %y%% + %z%% = 9 + 47 = 56. Since %x% = 95, both (y,z) and (y%,z%) are patterns in x.

The problem is that, since we are only using substitution machines, there is no way to combine the two patterns. One may say that 100100100100100100 a pattern in x, that 100 is a pattern in x, that 100 is a pattern in 100100100100100100. But, using only substitution machines, there is no way to say that the simplest way to look at x is as "a form involving repetition of 100100100100100100, which is itself a repetition of 100".

Let us first consider %x%%. It is not hard to see that, of all (y,z) such that y is a substitution machine and z is a sequence, the minimum of %y% + %z% is obtained when y=(100100100100100100,B,z), and z= B 0111001 B 1011101110 B 110111 B. Thus, assuming as we have that a=b=1 and c=0, %x%%=51. This is much less than %x%, which equals 95.

Now, let us consider this optimal y. It contains the sequence 100100100100100100. If we ignore the fact that y denotes a substitution machine, and simply consider the sequence of characters "(100100100100100100,B,z)", we can search for patterns in this sequence, just as we would in any other sequence. For instance, if we let y1=(100,C,z1), and z1=CCCCCC, then y1*z1=y, %y1%=10, and %z1%=6. It is apparent that (y1,z1) is a pattern in y, since %y1% + %z1% = 10 + 6 = 16, whereas %y% = 18. By recognizing the pattern (y,z) in x, and then recognizing the pattern (y1,z1) in y, one may express both the repetition of 100100100100100100 in x and the repetition of 100 in 100100100100100100 as patterns in x, using only substitution machines.

Is (y1,z1) a pattern in x? Strictly speaking, it is not. But we might call it a second-level pattern in x. It is a pattern in a pattern in x. And, if there were a pattern (y2,z2) in the sequences of symbols representing y1 or z1, we could call that a third-level pattern in x, etc.

In general, we may make the following definition:

Definition 3.8: Let F be a map from SxS into S. Where a first-

level pattern in x is simply a pattern in x, and n is an integer greater than one, we shall say that P is an n'th-level pattern in x if there is some Q so that P is an n-1'th-level pattern in x and P is a pattern in F(Q).

In the examples we have given, the map F has been the implicit map fromsubstitution machines into their expression in (A,B,C) notation.

APPROXIMATE PATTERN

Suppose that y1*z1=x, whereas y2*z2 does not equal x, but is still very close to x. Say %x%=1000. Then, even if %y1%+%z1%=900 and %y2%+%z2%=10, (y2,z2) is not a pattern in x, but (y1,z1) is. This is not a flaw in the definition of pattern -- after all, computing something near x is not the same as computing x. Indeed, it might seem that if (y2,z2) were really so close to computing x, it could be modified into a pattern in x without sacrificing much simplicity. However, the extent to which this is the case is unknown. In order to incorporate pairs like (y2,z2), we shall introduce the notion of approximate pattern.

In order to deal with approximate pattern, we must assume that it is meaningful to talk about the distance d(x,y) between two elements of S. Let (y,z) be any ordered pair for which y*z is defined. Then we have

Definition 3.9: The ordered pair (y,z) is an approximate pattern

in x if [ 1 + d(x,y*z) ][ a%y% + b%z% + cC(y,z) ] < %x%, where a, b, c and C are defined as in the ordinary definition of pattern.

Obviously, when x=y*z, the distance d(x,y*z) between x and y*z is equal to zero, and the definition of approximate pattern reduces to the normal definition. And the larger d(x,y*z) gets, the smaller a%y%+b%z%+cC(y,z) must be in order for (y,z) to qualify as a pattern in x.

Of course, if the distance measure d is defined so that d(a,b) is infinite whenever a and b are not the same, then an approximate pattern is an exact pattern. This means that when one speaks of "approximate pattern", one is also speaking of ordinary, exact pattern.

Most concepts involving ordinary or "strict" pattern may be generalized to the case of approximate pattern. For instance, we have:

Definition 3.10: The intensity of an approximate pattern (y,z) in x is          IN[(y,z)%x] = (%x%-[1+d(x,y*z)][a%y%+b%z%+cC(y,z)])/%x%.

Definition 3.11: Where v and w are binary sequences, the

approximate complexity of v relative to w, Ia(v,w), is the smallest value that [1+d(v,y*w)][a%y%+cC(y,w)] takes on for any program y with input w.

The incorporation of inexactitude permits the definition of pattern to encompass all sorts of interesting practical problems. For example, suppose x is a curve in the plane or some other space, z is a set of points in that space, and y is some interpolation formula which assigns to each set of points a curve passing through those points. Then Ia[(y,z)%x] is an indicator of how much use it is to approximate the curve x by applying the interpolation formula y to theset of points z.

## 3.3 Meaningful Complexity

Koppel [8] has recently proposed an alternative to the KCS complexity measure. According to Koppel's measure, the sequences which are most complex are not the structureless ones. Neither, of course, are they the ones with very simple structures, like 00000000000.... Rather, the more complex sequences are the ones with more "sophisticated" structures.

The basic idea [10] is that a sequence with a sophisticated structure is part of a natural class of sequences, all of which are computed by the same program. The program produces different sequences depending on the data it is given, but these sequences all possess the same underlying structure. Essentially, the program represents the structured part of the sequence, and the data the random part. Therefore, the "sophistication" of a sequence x should be defined as the size of the program defining the "natural class" containing x.

But how is this "natural" program to be found? As above, where y is a program and z is a binary sequence, let %y% and %z% denote the length of y and z respectively. Koppel proposes the following algorithm, defined with respect to a Turing machine that has two tapes instead of just one, a program tape and a data tape:

1) search over all pairs of binary sequences (y,z) for which the two-tape     tape Turing machine with program y and data z computes x, and find those pairs for which %y% + %z% is smallest,

2) search over all pairs found in Step 1, and find the one for which %y% is biggest. This value of %z% is the "sophistication" of x.

All the pairs found in Step 1 are "best" representations of x. Step 2 searches all the "best" representations of x, and find the one with the most program (as opposed to data). This program is assumed to be the natural structure of x, and its length is therefore taken as a measure of the sophistication of the structure of x.

There is no doubt that the decomposition of a sequence into a structured part and a random part is an important and useful idea. But Koppel's algorithm for achieving it is conceptually problematic. Suppose the program/data pairs (y1,z1) and (y2,z2) both cause a Turing machine to output x, but whereas %y1%=50 and %z1%=300, %y2%=250 and %z2%=110. Since %y1%+%z1%=350, whereas %y2%+%z2%=360, (y2,z2) will not be selected in Step 1, which searches for those pairs (y,z) that minimize %y%+%z%. What if, in Step 2, (y1,z1) is chosen as the pair with maximum %y%? Then the sophistication of x will be set at %y1%=50. Does it not seem that the intuitively much more sophisticated program y2, which computes x almost as well as y1, should count toward the sophistication of x?

In the language of pattern, what Koppel's algorithm does is:

1) Locate the pairs (y,z) that are the most intense patterns in x according     to the definition of pattern with % %, a=b=1, c=0

2) Among these pairs, select the one which is the most intense pattern in x     according to the definition of pattern with % %, a=1, b=c=0.

It applies two different special cases of the definition of pattern, one after the other.

How can all this be modified to accommodate examples like the pairs (y1,z1), (y2,z2) given above? One approach is to look at some sort of combination of %y%+%z% with %y%. %y%+%z% measures the combined length of program and data, and %y% measures the length of the program. What is desired is a small %y%+%z% but a large %y%. This is some motivation for looking at (%y%+%z%)/%y%. The smaller %y%+%z% gets, the smaller this quantity gets; and the bigger %y% gets, the smaller it gets. One approach to measuring complexity, then, is to search all (y,z) such that x=y*z, and pick the one which makes (%y%+%z%)/%y% smallest. Of course, (%y%+%z%)/%y% = 1 + %z%/%y%, so whatever makes (%y%+%z%)/%y% smallest also makes %z%/%y% smallest. Hence, in this context, the following is natural:

Definition 3.12: The crudity of a pattern (y,z) is %z%/%y%.

The crudity is simply the ratio of data to program. The cruder a pattern is, the greater the proportion of data to program. A very crude pattern is mostly data; and a pattern which is mostly program is not very crude. Obviously, "crudity" is intended as an intuitive opposite to "sophistication"; however, it is not exactly the opposite of "sophistication" as Koppel defined it.

This approach can also be interpreted to assign each x a "natural program" and hence a "natural class". One must simply look at the pattern (y,z) in x whose crudity is the smallest. The program y associated with this pattern is, in a sense, the most natural program for x.

LOGICAL DEPTH

Bennett [9], as mentioned above, has proposed a complexity measure called "logical depth", which incorporates the time factor in an interesting way. The KCS complexity of x measures only the length of the shortest program required for computing x -- it says nothing about how long this program takes to run. Is it really correct to call a sequence of length 1000 simple if it can be computed by a short program which takes a thousand years to run? Bennett's idea is to look at the running time of the shortest program for computing a sequence x. This quantity he calls the logical depth of the sequence.

One of the motivations for this approach was a desire to capture the sense in which a biological organism is more complex than a random sequence. Indeed, it is easy to see that a sequence x with no patterns in it has the smallest logicaldepth of any sequence. The shortest program for computing it is "Print x", which obviously runs faster than any other program computing a sequence of the same length as x. And there is no reason to doubt the hypothesis that biological organisms have a high logical depth. But it seems to us that, in some ways, Bennett's definition is nearly as counterintuitive as the KCS approach.

Suppose there are two competing programs for computing x, program y and program y'. What if y has a length of 1000 and a running time of 10 minutes, but y' has a length of 999 and a running time of 10 years. Then if y' is the shortest program for computing x, the logical depth of x is ten years. Intuitively, this doesn't seem quite right: it is not the case that x fundamentally requires ten years to compute.

At the core of Bennett's measure is the idea that the shortest program for computing x is the most natural representation of x. Otherwise why would the running time of this particular program be a meaningful measure of the amount of time x requires to evolve naturally? But one may define the "most natural representation" of a given entity in many different ways. Bennett's is only the simplest. For instance, one may study the quantity dC(y,z) + e%z%/%y% + f(%y%+%z%), where d, e and f are positive constants defined so that d+e+f=3.     The motivation for this is as follows. The smaller %z%/%y% is, the less crude is the pattern (y,z). And, as indicated above, the crudity of a pattern (y,z) may be interpreted as a measure of how natural a representation it is. The smaller C(y,z) is, the less time it takes to get x out of (y,z). And, finally, the smaller %y%+%z% is, the more intense a pattern (y,z) is. All these facts suggest the following:

Definition 3.13: Let m denote the smallest value that the quantity

dC(y,z) + e%z%/%y% + f(%y%+%z%) assumes for any pair (y,z) such that x=y*z (assuming there is such a minimum value). The meaningful complexity of x may then be defined as the time complexity C(y,z) of the pattern (y,z) at which this minimum m is attained.

Setting d=e=0 reduces the depth complexity to the logical depth as Bennett defined it. Setting e=0 means that everything is as Bennett's definition would have it, except that cases such as the patterns (y1,z1), (y2,z2) described above are resolved in a more intuitive matter. Setting f=0 means that one is considering the time complexity of the most sophisticated -- least crude, most structured -- representation of x, rather than merely the shortest. And keeping all the constants nonzero ensures a balance between time, space, and sophistication.

Admittedly, this approach is not nearly so tidy as Bennett's. Its key shortcoming is its failure to yield any particular number of crucial significance -- everything depends on various factors which may be given various weights. But there is something to be said for considering all the relevant factors.

## 3.4 Structural Complexity

We have discussed several different measures of static complexity, which measure rather different things. But all these measures have one thing in common: they work by singling out the one pattern which minimizes some quantity. It is equally interesting to study the total amount of structure in an entity. For instance, suppose x and x% both have KCS complexity A, but whereas x can only be computed by one program of length A, x% can be computed by a hundred totally different programs of length A. Does it not seem that x% is in some sense more complex than x, that there is more to x% than to x?

Let us define the structure of x as the set of all (y,z) which are approximate patterns in x, and denote it St(x). Then the question is: what is a meaningful way to measure the size of P(x). At first one might think to add up the intensities [1+d(y*z,x)][a%y%+b%z%+cC(y,z)] of all the elements in P(x). But this approach has one crucial flaw, revealed by the following example.

Say x is a sequence of 10,000 characters, and (y1,z1) is a pattern in x with %z1%=70, %y1%=1000, and C(y1,z1)=2000. Suppose that y1 computes the first 1000 digits of x from the first 7 digits of z1, according to a certain algorithm A. And suppose it computes the second 1000 digits of x from the next 7 digits of z1, according to the same algorithm A. And so on for the third 1000 digits of z2, etc. -- always using the same algorithm A.

Next, consider the pair (y1,z1) which computes the first 9000 digits of x in the same manner as (y2,z2), but computes the last 1000 digits of x by storing them in z1 and printing them after the rest of its program finishes. We have %z2%=1063, and surely %y2% is not much larger than %y1%. Let's say %y2%=150. Furthermore, C(y2,z2) is certainly no greater than C(y1,z1): after all, the change from (y1,z1) to (y2,z2) involved the replacement of serious computation with simple storage and printing.

The point is that both (y1,z1) and (y2,z2) are patterns in x, but in computing the total amount of structure in x, it would be foolish to count both of them. In general, the problem is that different patterns may share similar components, and it is unacceptable to count each of these components several times. In the present example the solution is easy: don't count (y2,z2). But one may also construct examples of very different patterns which have a significant, sophisticated component in common. Clearly, what is needed is a general method of dealing with similarities between patterns.

Recall that Ia(v%w) was defined as the approximate version of the effort required to compute v from w, so that if v and w have nothing in common, Ia(v,w)=Ia(v). And, on the other hand, if v and w have a large commoncomponent, then both Ia(v,w) and Ia(w,v) are very small. Ia(v%w) is defined only when v and w are sequences. But we shall also need to talk about one program being similar to another. In order to do this, it suffices to assume some standard "programming language" L, which assigns to each program y a certain binary sequence L(y). The specifics of L are irrelevant, so long as it is computable on a Turing machine, and it does not assign the same sequence to any two different programs.

The introduction of a programming language L permits us to define the complexity of a program y as Ia(L(y)), and to define the complexity of one program y1 relative to another program y2, as Ia(L(y1)%L(y2)). As the lengths of the programs involved increase, the differences between programming languages matter less and less. To be precise, let L and L1 be any two programming languages, computable on Turing machines. Then it can be shown that, as L(y1) and L(y2) approach infinity, the ratios Ia(L(y1))/Ia(L1(y1)) and Ia(L(y1)%L(y2))/Ia(L1(y1)%L1(y2)) both approach 1.

Where z is any binary sequence of length n, let D(z) be the binary sequence of length 2n obtained by replacing each 1 in z with 01, and each 0 in z with 10. Where w and z are any two binary sequences, let wz denote the sequence obtained by placing the sequence 111 at the end of D(w), and placing D(z) at the end of this composite sequence. The point, as usual, is that 111 cannot occur in either D(z) or D(w), so that wz is essentially w juxtaposed with z, with 111 as a marker inbetween.

Now, we may define the complexity of a program-data pair (y,z) as Ia(L(y)z), and we may define the complexity of (y,z) relative to (y1,z1) as Ia(L(y)z%L(y1)z1). And, finally, we may define the complexity of (y,z) relative to a set of pairs {(y1,z1),(y2,z2),...,(yk,zk)} to be Ia(L(y)z%L(y1)z1L(y2)z2...L(yk)zk). This is the tool we need to make sense of the phrase "the total amount of structure of x".

Let S be any set of program-data pairs (x,y). Then we may define the size %S% of S as the result of the following process:

Algorithm 3.1:

Step 0. Make a list of all the patterns in S, and label them     (y1,z1), (y2,z2), ..., (yN,zN).

Step 1. Let s1(x)=Ia(L(y1)z1).

Step 2. Let s2(x)=s1(x)+Ia(L(y2)z2)I(L(y1)z1).

Step 3. Let s3(x)=s2(x)+Ia(L(y3)z3%L(y1)z1L(y2)z2))

Step 4. Let s4(x)=s3(x)+Ia(L(y4)z4%L(y1)z1L(y2)z2L(y3)z3))

...

Step N. Let %S%=sN(x)=sN-1(x)+

Ia(L(yN)zN%L(y1)z1L(y2)z2)...L(yN-1)zN-1)

At the k'th step, only that portion of (yk,zk) which is independent of {(y1,z1),...,(yk-1,zk-1)} is added onto the current estimate of %S%. For instance, in Step 2, if (y2,z2) is independent of (y1,z1), then this step increases the initial estimate of %S% by the complexity of (y2,z2). But if (y2,z2) is highly dependent on (y1,z1), not much will be added onto the first estimate. It is not difficult to see that this process will arrive at the same answer regardless of the order in which the (yi,zi) appear:

Theorem 3.2: If Ia=I, the result of the above algorithm is invariant under permutation of the (yi,zi).

Where St(x) is the set of all patterns in x, we may now define the structural complexity of x to be the quantity %St(x)%. This is, I suggest, the sense of the word "complexity" that one uses when one says that a person is more complex than a tree, which is more complex than a bacterium. In a way, structural complexity measures how many insightful statements can possibly be made about something. There is much more to say about a person than about a tree, and much more to say about a tree than a bacterium.

ALGEBRA AND TOPOLOGY OF PATTERN SPACE

From the definition of structural complexity we may obtain the extremely useful notion of structural similarity, which we shall also refer to as pattern-distance. As the name suggests, this is a measure of how "close" two entities are, structurally speaking. We denote the structural similarity of x and y by d#(x,y), and define it as the structural complexity of the symmetric difference of St(x) and St(y). It measures the amount of pattern in x but not y, or in y but not x. This concept will play a central role in our treatment of memory and analogy.

The following concept, though it refers only to structure, not structural complexity, is equally essential.

Definition 3.14: The emergence between x and y is defined as

Em(x,y) = St(xUy) - St(y) - St(y)

The intuitive meaning of emergence should be clear: it is what is present in the whole but not the parts. In the realm of pattern, the whole is in general more than the sum of the parts, in the sense that not all the patterns in the whole are patterns in the parts.

Finally, the following idea, though not so crucial, sheds some light into certain matters.

Definition 3.15: (y1,z1) is said to be complementary to (y2,z2) in x tothefollowing extent:

1 - IN(x,y1,z1)/[IN(L(y1),y2,z2) + IN(z1,y2,z2)]. If y1 is complementary to y2 in x and y2 is complementary to y1 in x, then y1 and y2 are said to be complementary in x.

Complementarity, intuitively, is a very weak form of negation. If y1 is highly complementary to y2 in x, that means that although one can effectively represent x in terms of either y1 or y2, once one has represented x in terms of y1, one cannot effectively represent the elements of this representation in terms of y2. If y1 and y2 are both entirely "independent" in St(x), this will usually be the case.

Crude intuitive examples of this phenomenon may be drawn from nearly any field of study. For instance, in quantum mechanics one may represent an electron as a wave or as a particle, but once one has represented it as either one cannot interpret one's representation in terms of the other one. Or: one may view the American economy as a battleground of class struggle, or as an arena of gradual evolution by natural selection through competition -- but once one has diagrammed the economy in terms of conflicting classes, one cannot analyze the diagram in Social Darwinist terms; and vice versa. Of course, these are little more than analogies; in order to make such applications at all precise one would have to provide exact definitions of the terms involved.

COMPUTABLE APPROXIMATIONS

The structural complexity of x is a measure of the total size of the set of all regularities in x. Unfortunately, as I noted above, it is uncomputable. Step 0 of Algorithm 3.1, in its full generality, is not programmable. Therefore, in order to apply the notion of structural complexity to real-world problems, it is necessary to restrict oneself to some particular class of patterns. One way to do this is via schematic structural complexity (SSC), a simple notion that will lead us naturally to more interesting complexity measures such as the Lempel-Ziv complexity and the n'th order Boolean complexities.

In the theory of genetic classifier systems (Goldberg, 1989), a schema is a sequence such as 1001**1011101010*1**111, where * is a "don't care" symbol signifying that either a 0 or a 1 may occupy the indicated place. Let us consider a slightly more general notion of schema, in which each "don't care" symbol has a tag (*1,*2, etc.), and each "don't care" symbol may stand for a binary sequence of any length. And let us define a schematizer as a function which maps schema into binary sequences, by inserting in place of each occurence of the "don't care" symbol *i some binary sequence wi. Each pair (schematizer, schema) is a schematic program; it defines a unique binary sequence.

The schematic structural complexity (from here on, SSC) of a binary sequence may be described in terms of Algorithm 3.1 with a modified initial step.

Algorithm 3.2:

Step 0'. Make a list of all schematic programs that are patterns in x, and label them y1,...,yN

Steps 1-N. As in Algorithm 3.1

The result of this algorithm may be called the schematic size of x, the well-definition of which follows from the following generalization of Theorem 2.1.

Theorem 3.1. The result of Algorithm 4.1 is invariant with respect to permutation of the yi.

In analogy to structural complexity, the SSC of a sequence x may be defined as the schematic size of the set of all patterns in x, S(x).

A schematic program represents the simplest sort of compression: it compresses an image or sequence by abstracting repeated figures. A more flexible approach may be obtained by considering more general programs. For instance, one might define a metaschematizer as a map from schema to schema, which takes in a schema and replaces each occurence of the "don't care" symbol *i with some schema Si. A metaschematic program is then any program whose action may be represented in the form schematizer(m1(m2(...(mk(schema))...)), where the mi are metaschematizers.

Given this, one may define the "metaschematic structural complexity" of a sequence in an obvious way. This complexity measure recognizes not only repeated figures, but repeated figures within repeated figures and so on. It is new in detail but not in concept -- it is a very close approximation to the Lempel-Ziv complexity (Lempel and Ziv, 1978).

The Lempel-Ziv complexity misses a great deal of structure. By definition, no computable approximation to the structural complexity can capture every type of structure. However, there is a large gap between repetitions and completely general patterns. One way to fill this gap is with the n'th order Boolean structural complexity. This complexity measure may be developed by analogy to the schematic and metaschematic structural complexities.

We will need to use vector Boolean operations, which act coordinatewise on Boolean sequences. For example the vector negation of 1010110 is 0101001; the vector conjunction of 11010 and 01011 is 11011. And we will say that a Boolean function (of an arbitrary number of variables) is of order n if it can be written using less than n+1 disjunctions and conjunctions.

An n'th order Boolean schematic program for computing an array x may be defined as a pair (schema, schematizer), where the schema is a sequence composed of of 0's, 1's, unlabeled "don't care" symbols *, and k different labeled "don't care" symbols *1,...,*k. The schematizer consists of: 1) a memory, which consists of a set of l binary sequences, and 2) a collection of k vector Boolean functions f1,...,fk of order n, each of which takes exactly l sequences as arguments. The k'th array Boolean function computes the array to be substituted for the "don't care" symbol *i.

From the n'th order Boolean schematic programs one may obtain n'th order Boolean metaschematic programs in a natural way. The n'th order Boolean structural complexity is then defined in terms of these programs, just as the metaschematic structural complexity is defined in terms of ordinary metaschematic programs. It should be clear that any pattern can be expressed as an n'th order Boolean schematic program for some n. Therefore, the n'th order Boolean programs are one way of forming a bridge between Lempel-Ziv complexity and general structural complexity.

These approximations are rather technical; they are not as conceptually elegant as the structural complexity itself. However, it is clear that at least the n'th order Boolean complexity is relevant to mentality, because no brain can compute Boolean functions of arbitrarily high order.