The Evolving Mind -- Copyright Gordon and Breach © 1993 |
The argument-structure of this book is a complex one. We shall begin with the evolution of immune systems, then turn to the evolution of species, and only then return to our main topic of interest, evolutionary processes in the mind and brain.
But before we can speak about evolution at all, we will need to discuss a few important background ideas, mostly from theoretical computer science. These ideas are not usually mentioned in connection with evolutionary theory. However, I believe that they are absolutely essential to the understanding of the self-organizing processes involved in evolutionary systems.
Let us begin with a simple yet crucial ontological axiom, which Gregory Bateson called the "Metapattern."
Recall that, more than two millenia ago, Democritus suggested that every physical object is made of "atoms." By "atom," he simply meant "something so small you can't break it down into component parts." What we today call an atom is not a Democritan atom, because it can be broken down into protons, neutrons and electrons (and electrons can apparently be broken down into quarks _ there is no telling how far down this hierarchy extends; in fact, the late Stanislaw Ulam proposed that inside every particle, no matter how small, there is some smaller particle). Bateson's Metapattern states, roughly, that patterns are the Democritan atoms of the biological world. That everything having to do with biological systems can be explained in terms of patterns _ and that, on the level of biological, self-organizing systems, patterns cannot be "decomposed" into anything besides more patterns.
Bateson applied this idea, this Metapattern, to the study of cultural and biological evolution, and human learning. However, he was not the first to sense its importance. Benjamin Whorf used it as a tool for probing the intricacies of psycho-anthropolinguistic structures. And the American philosopher Charles S. Peirce _ most noted for his theory of "pragmatism" which William James borrowed and altered _ founded his vast metaphysical/epistemological/semiotic system on a very similar concept. He used the word "habit" rather than "pattern," but the coreidea was the same.
Although the Metapattern is a very simple idea, it is far from trivial. It leads to a fundamentally novel perspective on important biological issues. To see this, one need merely compare the present treatment of evolution with the thermodynamics-based approach pursued by, e.g., Jantsch (1984) and Prigogine, Nicolis and Babyolantz (1972). The pattern-theoretic approach does not exclude thermodynamic concepts such as entropy, phase transition and so forth _ we will discuss a certain "phase transition" in Section 7.2. But neither does it emphasize them. Similarly, the thermodynamics-based approach does not inherently exclude pattern-based principles of organization; it simply does not give a language in which to speak of them. This comparison will be pursued further in Chapter 3.
1.1 A FORMAL THEORY OF PATTERN
In order to make practical use of the Metapattern, we will require a better idea of what a pattern actually is.
What is meant intuitively by a "pattern" is, essentially, a representation as something simpler. When one represents x in terms of something simpler than x, one has obtained a pattern in x. A representation is a kind of relation, between the representer and the thing represented. So, as one would expect, the Metapattern reduces self-organization to certain relations between entities.
Let us be more precise. Assume that the entity z represents the entity x. Then, since z is not x, to see that z corresponds to x requires some process of transformation, translation or interpretation. Let us call this process y. Then, we shall say that a process-entity pair (y,z) is a pattern in an entity x if
1)applying the process y to the entity z yields the entity x
2)the complexity of the process y, plus the complexity of the entity z, plus the complexity of applying y to z, is less than the complexity of x
Intuitively, what this means is that the process-entity pair (y,z) is a short-cut to x.
It is not difficult to formalize this definition. First of all, we may call the result of applying the process y to the representation z, y*z. The symbol "*" is thus being used to denote the operation of applying a process to an entity. If z is a representation of x, and if y reveals this, then we must have y*z=x. Next, let us assume that we have some way of measuring how complex an entity or a process is. For any entity orprocess G, let %G% denote the complexity of G: the larger %G% is, the more complex G is, the less simple it is. For simplicity, we will assume that complexities are non-negative real numbers. Finally, let us assume that we have some way of measuring how hard it is to apply a given process to a given entity. Let us denote the difficulty (or complexity) of applying y to z by C(y,z). Then, we may say that a process-entity pair (y,z) is a pattern in x if
y*z=x
and
%y%+%z%+C(y,z) < %x%.
This says that the representation of x in terms of z yields an increase in simplicity, even when all auxiliary processes and entities are taken into account.
To see the potential difficulties with this formalism, let us preview an example which will be considered in more detail in Chapter 5. Suppose that x is a picture of a flower on a television screen. To store this picture in computer memory requires a great deal of memory space. However, Michael Barnsley (1988) has devised a clever technique for compressing this picture into a very small computer file. Let us call this small file z. The process y, then, is the process of decompression, of expanding the full picture from the file. C(y,z) is the difficulty of expanding the full picture from the compressed file, which is not very great. Now, y*z, the decompressed file, may not be exactly the same as the original picture x. Nonetheless, we should like to say that the process has achieved a representation of the flower as something simpler.
So, to make our formalism fit with intuition, we will have to introduce another process y% defined by y%*z = touch-up * ( y*z ), where touch-up is a process that corrects the errors in y*z, and turns it into x. It is (y%,z) which is actually a pattern in x. But if not much touching up is required _ if y% can be computed from y without much work (in the language to be introduced below, if I(y%%y) is small) _ then we may say that (y,z) is very close to a pattern in x. It is approximately a pattern, but not exactly.
Another way to deal with this is to redefine the criterion for pattern to say that
[1+d(y*z,x)] [%y% + %z%] < %x%,
where d(y*z,x) is some measure of the distance between y*z and x (e.g. the proportion of bits different, or some constant multiple thereof).
When, in the following pages, we speak about minds or brains or antibodies "recognizing patterns," it is to be understood that these patterns are not necessarily exact: all biological systems make some errors, so that the patterns involved may be approximate.
1.2 ALGORITHMIC INFORMATION
If the world of self-organizing systems is composed of patterns, and a pattern is a representation as something simpler, then a tremendous importance must be attached to the question: what is simplicity? Or, equivalently: what is complexity?
Obviously, what is simple to me may not be simple to you, and vice versa. In this sense, pattern is not an objective quantity. For instance, I might represent a certain painting to myself as a portrayal of the skyline of a city. But to a native of the jungle, unfamiliar with cities or perspectival drawing, the painting might look like a random jumble of lines and angles, or it might have some other form of order, but it would certainly not look like the skyline of a city.
The subjectivity of pattern cannot be ignored. However, it is not entirely irrational to hope for an objective ground from which various subjectivities can be compared. One may make a valid argument that any such "objective" ground is also subjective in some way (Goertzel, 1992d), but let us put this objection aside and proceed.
There are many ways to objectify the concept of complexity. However, given the various strengths, weaknesses and prejudices of modern science, one strategy is by far the most convenient: we shall approach the problem of a general, "objective" definition of complexity via the concept of universal computation. This is a relatively simple way of doing things, and it has the merit of connecting the concept of pattern with recent work in computer science, logic and dynamical systems theory.
A universal computer is, simply, a general-purpose computer. It is not a special kind of computer. The Cray supercomputer, with its immense speed and its vector processing capabilities, is no more of a universal computer than the IBM PC. The distinguishing characteristic of a universal computer is that, given
1) enough peripheral memory (e.g. floppy disc drives), and
2) the right instructions,
it can deliver a perfect simulation of any other computer. By a "perfect simulation," I mean that, given any instruction, the simulating computer will give exactly the same response as the simulated computer would.
One might at first wonder: but if an IBM PC is a universal computer, and can therefore simulate a Cray or any other computer, then why build huge computers like Crays at all? The problem is, of course, that the length of a program matters, as does the time a program takes to run.
To give a very crude example, consider that an IBM 386 runs around four times as fast as an IBM PC, so that even if one simulates a 386 on a PC, the same processes will always take at least four times as long. Or, consider that a Cray has vector processing and can therefore do, say, 500 things at once. An IBM PC can do only one thing at a time. The IBM PC can be programmed, given sufficient memory, to simulate the Cray; but every set of 500 things that the Cray does at one time, it will have to do one after the other. Or, finally, consider a computer like the Commodore Amiga which has sophisticated sound and graphics features built in. Every music video which one can make on the Amiga, one can make also on the IBM PC. But the program will be much longer on the IBM, because when on the Amiga one would simply refer to a "hard-wired" function, on the IBM one must spell everything out. We build faster computers, vector processing computers, and specialized computers not because they can do more than simpler, slower computers, but because they can do more things with less programming effort, or within a shorter period of time.
The simplest model of a universal computer was also the first: the universal Turing machine. This consists of a very long _ "potentially infinite" _ tape which is marked off into squares. All the squares are the same size, but some of them are marked "1" and some of them are marked "0". In addition to the tape, there is a moving tape head, which can
1) look at one square at a time and determine whether it is marked "0" or "1";
2) move to the left or to the right along the tape;
3) change the number of a square on the tape,
And, finally, there is a processing unit. Each time the tape head reads something, the processing unit takes this into account and tells the tape head what to do next: e.g. it might give the instruction "move five squares to the right and change the number."
The processing unit may be considered to contain a list of instructions such as "if the last three squares read contained ones, and the squares to their left contained zeros, then move three squares to the right, and if there is a one there, replace it with a zero." It is obviously very tiresome to program a machine with a long list of instructions of this sort, but it is not all that different from what real-life programmers do when the program in "assembly language." If a higher-level language such as BASIC or FORTRAN or PASCAL is used, its interpreter or compiler translates it into assembly language before it is run, so that all programs on real computers are in effect lists of this sort.
It turns out that this remarkably simple device, the Turing machine, if its processing unit is equipped with the proper program, can function as a universal computer: it can simulate any computer whatsoever with perfect accuracy. The potentially infinite tape takes the place of random-access memory, screen, keyboard and floppy disk drives: it serves as a "scratch pad" or working memory, a place from which to read instructions, and a place on which to write replies. As one might imagine, the Turing machine is not particularly speedy; the tape head will often have to make very long trips. But given enough time and appropriate instructions, it can simulate any computer whatsoever.
Now, how does all this help us to define complexity? The Kolmogorov-Chaitin-Solomonoff (KCS) definition of complexity says roughly that the complexity of x is the length of the shortest program for computing x. More precisely, it is better to say that the complexity of x is the length of the shortest self-delimiting program for computing x, where a self-delimiting program is one which contains, at the beginning, a message saying how long it is (Chaitin, 1987).
At first there seems to be a problem here: the shortest program on what computer? But a universal computer U can simulate any other computer V, if it is supplied with the appropriate program. Let us call this program SimUV. Then the length of the shortest program for computing x on U is no greater than the length of the shortest program for computing x on V, plus the length of the program SimUV. So, as long as the KCS complexity of x on both U and V is much greater than the KCS complexity of SimUV, it matters very little whether one is talking about computation on U or on V.
Of course, there is always the possibility that part of the hardware of computer V is a simple copy of the entity x. In this case it is possible that most of the program SimUV would in fact be taken up by the copy of x, so that it would matter whether one was talking about computation on U or on V: computation on V would be vastly superior. To compute xon V, one would have to give one command to summon x from hardware memory, but to compute x on U, one would have to either start from scratch or simulate V, which would be essentially the same thing.
For this reason, the theory of KCS complexity is only strictly applicable to "extremely large" entities x, to entities which are too large to be hard-wired into any of the computers under consideration. But of course this is a relative concept, as theoretically one may consider an arbitrarily large computer. Additionally, it has been proven that the KCS complexity is literally impossible to compute. There can never be a general algorithm for finding the shortest program computing a given entity _ the very concept of such an algorithm is a logical contradiction.
The KCS complexity is not a good general measure of complexity. In what follows I will discuss some of the problems with it, and give a few methods for circumventing them. But, nonetheless, the KCS complexity is indispensable for the study of complex systems.
Finally, we will often need to measure the distance between two entities. Conceptually, we wish to define the distance between v and w as the average of the difficulty of obtaining v from a complete knowledge of w, and the difficulty of obtaining w from a complete knowledge of v. Formally, for any two entities v and w, let I(v%w) denote the smallest value that %y% + C(y,w) takes on for any program y with input consisting of a minimal-length program for computing w. Note that I(w%v) and I(v%w) will not generally be equal, so that neither of the two is a good measure of the distance between w and v. One desires the distance between w and v to be the same as the distance between v and w. To attain this goal one may define d(w,v) = [I(w%v)+I(v%w)]/2. This is the measure of distance with which we shall work.
1.3 COMPUTATION
We have just passed rather rapidly from the general, philosophical theory of pattern to the theory of computing machines. This move was essential because, in every one of the following chapters, we will be discussing models involving binary sequences and computable functions mapping them. The underlying philosophical concepts will concern only pattern, not computation _ but the illustrative examples will be computational. Furthermore, many of the mathematical results will involve computational ideas, and a number of crucial arguments will rely upon the results of computer simulations.
It is therefore worth pausing for a moment to ask why this step isjustified. Why is it reasonable to model evolving systems as computers? This question of justification will become particularly relevant in Chapters 4 and 6, where we will be dealing with evolution in the brain and mind. Few have ever worried about whether an ecosystem can be appropriately modeled as a computational system; however, the question whether brains or minds can be so modeled has been hotly debated for decades.
Classical and Quantum Computation
As I pointed out in The Structure of Intelligence, Deutsch's (1985) recent work shows that much of the brain/mind/ computation controversy is misguided. Deutsch has proposed the "quantum computer" as an alternative to the Turing machine as a universal model of computation. And he has shown that _ according to the known principles of quantum physics _ the quantum computer is capable of simulating any finite physical system to within any degree of accuracy.
Okay, you may be thinking _ so one of these quantum computers can simulate any physical system, but what about a plain old Turing machine computer? But that's the best part _ Deutch has shown that, while a quantum computer can do everything an ordinary computer can, it cannot compute any functions besides those which an ordinary computer can run. Quantum computers do have certain unique properties, such as the ability to generate truly random numbers, rather than the pseudo-random numbers produced by a digital computer's "random number generator." But they cannot compute any novel functions. Any function that manifests itself in a finite physical system can be approximated arbitrarily well on a Turing machine.
Penrose (1989) finds this result disappointing, because he would prefer if the brain could not be modeled as a Turing machine. So he has hypothesized that, once quantum theory is replaced by a Grand Unified Theory that synthesizes quantum mechanics with general relativity, Deutsch's result will no longer hold. But this is a very weak line of argument _ although quantum theory is surely not final, we have no way of knowing what the theory that supplants it will be like. Ironically, Penrose's own attempt at a Grand Unified Theory, his intriguing "twistor theory," is very computational in concept: its ultimate goal is to explain all of physics in terms of the combinatorial properties of finite graphs.
Because of Deutsch's theorems, the assertion that evolving systems can be modeled as quantum computers is not a psychological hypothesis but a physical, mathematical fact. And so is the assertion that all functions computed by evolving systems can be expressed in terms of Turing machines.
In SI I argue that, in fact, nearly all of the structures and processes found in evolving systems are completely explicable in terms of ordinary digital computation. There is, I argue there, only one exception: consciousness, which relies on the peculiar nonlocality properties of quantum systems. In the present book, however, we shall not need to deal with the problem of consciousness; we shall stay within the safer realm of digital computation.
A Conceptual Fallacy
The concept of quantum computation resolves the question of whether physical systems such as brains can be modeled in computational language. But to me, it is equally interesting to ask why anyone would doubt the validity of such modeling in the first place. What is so unsettling about the idea that physical phenomena are at bottom computable?
Consider the following passage, from noted author Alvin Toffler's Foreword to Prigogine and Stenger's excellent book Order Out of Chaos:
The ideas of the Brussels school, based heavily on Prigogine's work, add up to a novel, comprehensive theory of change. Summed up and simplified, they hold that while some parts of the universe may operate like machines, these are closed systems, and closed systems, at best, form only a part of the physical universe. Most phenomena of interest to us are, in fact, open systems, exchanging energy or matter (and, one might add, information) with their environment. Surely biological and social systems are open, which means that the attempt to understand them in mechanistic terms is doomed to failure.
This suggests, moreover, that most of reality, instead of being orderly, stable and equilibrial, is seething and bubbling with change, disorder and process.
In Prigoginian terms, all systems contain subsystems, which are continually "fluctuating." At times, a single fluctuation or a combination of them may become so powerful, as a result of positive feedback, that it shatters the preexisting organization. At this revolutionary moment _ the authors call it a "singular moment" or a"bifurcation point" _ it is inherently impossible to determine in advance which direction change will take: whether the system will disintegrate into "chaos" or leap to a new, more differentiated, higher level of "order" or organization, which they call a "dissipative structure." (1984, p.xv)
The problem with Toffler's claims about "mechanism" is that Prigogine's work involves nonlinear differential equations which can only infrequently be solved analytically. In order to understand how these differential equations behave, it is usually necessary to combine mathematical results with numerical results obtained from computer routines. In fact, the whole theory of chaotic dynamics (bifurcations, phase transitions) to which Toffler alludes owes its very existence to the advent of modern computers, which allowed obstinate nonlinear differential and difference equations to be systematically studied for the first time.
The problem here is a confusion of two concepts. Open systems are not "machines" in the same thermodynamic sense that a Carnot engine is a machine. And complex systems like brains, immune systems, ecosystems and test tubes full of chaotically fluctuating chemicals are very, very different from more intuitively "mechanical" systems like personal computers, auto engines, waterwheels, battery-powered radios and bicycles. But, if a system can be modeled by the differential equations of physical chemistry, then it can be modeled to within arbitrary accuracy by computable difference equations. And from this it follows that the system is, to within the accuracy of any possible experimental test, modelable as a Turing machine program.
To say it once again: the modeling of the brain (or any other physical system) using computational language needs no further justification. Penrose's speculations aside, there is no scientific reason to doubt the general physical applicability of Turing machine computable functions. The Turing machine is an incredibly general model, and it is beside the point that the words "computer" and "machine" have historically been associated with narrow-minded theories of psychology, neuroscience and system behavior.
1.4 LOGICAL DEPTH
So, the theory of computation will be our fundamental context of analysis. The set of binary sequences and computers will be our basic space ofentities. But there are many different ways of structuring this space. For example, there are many different ways of measuring the complexity of sequences and computers.
The KCS complexity is one way of measuring the complexity of static objects such as number sequences. But, as I mentioned above, it does have certain shortcomings. There are many entities which look very complicated _ which are, structurally, very complicated _ and yet can be generated by short programs. Fractals and cellular automata, to be considered shortly, are two examples of this. Another example, less striking but less elaborate, is the number pi.
If x is the first ten billion digits of pi, then the KCS complexity I(x) is very small. We may write I(x)= C+log10(10,000,000,000) = C+10. Here C is the length of the shortest program P(n) that, given a number n, generates the first n digits of pi.
In terms of the definition of pattern, we may let y=P(n), and z=10,000,000,000. If we then define the complexity of a program as its length and the complexity of a number as the length of its decimal representation, If we let C=%P(n)% and observe that %z%=10, it is clear that %y%+%z% = C+10 < %x% = 10,000,000,000. We may or may not know the shortest program P(n) for computing the first n digits of the decimal expansion of pi, but we do know many formulas _ e.g pi = 4(1-1/3+1/5-1/7+...) _ which can be transformed into universal Turing machine programs of length many orders of magnitude less than ten billion.
The first n digits of pi, where n is ten billion, ten trillion or one googol (10100), have a fairly low algorithmic complexity. However, they take a long time to compute.
Bennett defines the logical depth of a sequence as the running time of the shortest program which computes it. Thus, the logical depth of the sequence consisting of the first ten billion digits of pi is very high, because even though there are short programs for computing this sequence, these programs take on the order of ten billion steps to run.
It is important to remember that there are indeed fast-running programs for computing the first ten billion digits of pi. After all, one could write a program which contained a list of the first ten billion digits of pi. But this program would not be the shortest program _ not even close.
A truly random sequence would have a very low logical depth, because the program containing a list of x and the command "print this list" would be the shortest program for computing x; and this program has a minimal running time.
One may conceive examples in which this crude approach to logical depth is counterintuitive. For instance, suppose program A has length 1000 and computes x in ten years, but program B has length 1001 and computes x in thirty seconds. Then it would be absurd to assign x a logical depth of ten years. One way to avoid this is consider, instead of the running time of the shortest program, the running time of the program for which the average of program length and running time is at a minimum.
1.5 FRACTALS
Figs. 1-5 are examples of fractal sets. So much has been written about fractals over the last ten years, in both the popular press and the technical literature, that there is no need to dwell on the matter here. The interested reader is referred to Mandelbrot's Fractal Geometry of Nature (1977), the book which made fractals popular; for a mathematically rigorous treatment, to Falconer's Fractal Geometry (1989); or, finally, to Michael Barnsley's undergraduate text Fractals Everywhere.
Much of the interest in fractals stems from their curious "dimensional" properties. For instance, the Sierpinski triangle shown in Figure 1 is in a certain sense fuller than a line drawing but emptier than a filled-in square. This sense is captured in mathematical language by the concept of Hausdorff dimension. Whereas a line has Hausdorff dimension 1, and a plane has Hausdorff dimension 2, the von Koch snowflake curve has dimension log 4/log 3=1.257.... A "fractal" is often defined as an object whose Hausdorff dimension is not an integer.
Here, however, it is not the dimensional properties of fractals that interest us, but rather their informational aspect. Therefore, we are not exactly interested in fractals in the technical sense of Hausdorff dimension. Rather, we are interested in self-similar sets, regardless of whether or not they have noninteger Hausdorff dimension. A self-similar set is a set which is made of parts that "look exactly like the whole." Most of the fractals commonly portrayed are self-similar sets; and most of the interesting self-similar sets are fractals.
More precisely, let us say that a set R is similar to a set S if, by expanding or shrinking R, one can obtain a set which is identical to S except for position (more rigorously: if S is mapped onto R by some isometry). Then we shall say that a set S is self-similar if it is made up of n parts S1,...,Sn, each of which is similar to S. If the parts are only approximately similar to S, we may say that S is approximately self-similar.
As Figures 2 and 3 show, self-similar sets can have extremely complex and beautiful structures. They can mimic very closely the structures of natural phenomena such as mountains, clouds and leaves. They can be generated by very simple rules, so that their KCS complexity is low. However, their depth complexity is not so low _ it appears that many self-similar structures must, like the decimal expansion of pi, be computed "step by step."
For instance, images like the "viscous fingers" of Figure 5 _ an approximately self-similar set _ may be computed as follows. Start with one dot in the middle of the computer screen. Then put another dot at a "random" point on the edge of the screen, and let it walk around the screen, at each step proceeding "at random" from one spot to an adjacent spot. Keep doing this until the second dot walks onto a spot next to the first dot. Then introduce another dot: let it do a similar "random walk" until it walks onto a spot next to one of the two previous dots. Continue this for a few million dots.
The word "random" has quotes around it here because a digital computer can never truly act randomly; it is always following a program. However, it can simulate randomness well enough, with the aid of standard "pseudorandomness" techniques.
Pseudorandomness can be generated by a fairly short program. Hence the viscous fingers have a low KCS complexity. However, they appear to have a large logical depth. There are tricks for speeding up the generation of viscous fingers, but there seems to be no way to avoid having each dot do its own random walk over some significant portion of the screen; and this takes a large amount of time.
For a deterministic example, let us consider the affine IFS's _ iterated function systems _ made popular by Michael Barnsley. The general idea of image generation by iterated function systems is to take a set or "system" of functions w1, w2,...,wn _ say, functions mapping the plane into the plane _ and consider the process W which maps any set B in the plane into the set
W(B) = w1(B) % ... % wN(B)
That is, one determines the set where w1 maps B, the set where w2 maps B, and so on up to wn, and then one pieces these sets together into one set. Then, one does this over and over again _ one constructs W(W(W(...W(B)...))), for some set B. If the wi meet certain technical conditions _ e.g. if they are "contraction maps" _ then the result of applying the process W infinitely many times will be a self-similar set, a fractal.
For instance, Barnsley's "Black Spleenwort Fern," shown in
Figure 3, can be generated by applying the following system of functions
to any plane set B.
Given these equations _ and a computer _ it is very simple to generate a pictorial image of the Black Spleenwort Fern. One way to do this is to start with a randomly chosen region, and repeatedly iterate the process W described above. Another way is to start with a random point, and at each discrete time step apply one of the threeequations, selected at random.
The important thing is that intricate, biological-looking forms such as those shown in Figure 3 and 4, can be generated by simple processes which may be summarized in compact mathematical formulas. To me, at least, it is truly amazing that such an excellent simulation of biological form can be so generated. The algorithmic complexity of biological objects may not be so high as appearance suggests. If this is indeed the case, then it is hardly surprising that so much of the information required to specify an entire organism can be encoded in a tiny string of DNA.
So, the omnipresence of fractals implies the omnipresence of apparently complex structures which are algorithmically simple. The depth complexity of fractal structures is another matter entirely. Viscous fingers, as mentioned above, appear to have a very high depth complexity. And the depth complexity of fractals generated by IFS's may vary a great deal, from extremely low to impractically high.
1.6 CELLULAR AUTOMATA
Cellular automata were first conceived by John von Neumann (1966), as part of an attempt to resolve an old biological question _ the nature of self-reproduction. Until the advent of molecular biology it was not uncommon to hear claims that self-reproduction was somehow incomprehensible in terms of the laws of physics, that it required some "vital force." Von Neumann set out to destroy this myth once and for all by designing a self-reproducing machine. He understood that it would be a long time before technology permitted the construction of such a machine. But nonetheless, he realized that the principles required were quite simple.
At first he considered a mobile robot assembling a replica of itself from parts floating in a lake; but he found that he was spending too much time dealing with the mechanics of moving parts around in the lake. So he and his friend Stanislaw Ulam came upon the idea of an imaginary two-dimensional world made up of identical-sized squares arranged in a lattice.
Each square could contain one of 26 possible values, and the setof squares evolved over time in a discrete manner: from time zero to time one, from time one to time two, and so on. The evolution followed a set of mechanical rules, by which the value of each square at each time depended only on the value which that square had at the previous time, and the value which those squares immediately adjacent to it had at the previous time. Von Neumann managed to select the rules in such a way that, as time progressed, a certain complex pattern on the lattice would create an exact replica of itself.
This was the first cellular automaton. In general, a cellular automaton is any set of "cells," each containing a finite number of possible values, each one updating its value at each time on the basis of the values of itself and its neighbors at the previous time.
Von Neumann's rules were rather complicated, and their importance now is only historical. The British mathematician John H. Conway has constructed an extremely simple cellular automaton called Life (Poundstone, 1985), which may be easily simulated on a home computer and seen to give rise to remarkably intricate structures and processes. And it has been proven that Life can be used as a universal Turing machine, to produce anything which any cellular automaton, or any computer, can produce _ including self-reproducing patterns.
The rules of Life are as follows. Consider a lattice of equally sized squares, each one either white or black (0 or 1), and at each time, update the value of each square as a function of the values of its eight neighbors, according to the following rules:
1)If, for a given square, the number of black neighbors is exactly two, the value of the square does not change at the next time step
2)If the number of black neighbors is exactly three, the square will be black at the next time step
3)If the number of black neighbors is zero, one, four, five, six, seven or eight, the square will be white at the next time step.
It is truly astounding that such a simple arrangement as this can serve as a universal Turing machine.
The name "Life" is appropriate for many reasons, one being that a 200 x 200 or larger display of cells iterating according to the rules of Life looks remarkably like a Petri dish full of microscopic organisms. Beginning with a random configuration at time zero, most cells immediately turn white. Then one sees swarms of black cells move slowly along the white background in all directions, combining to form larger swarms, colliding with eachother and splitting apart. If luck permits, some more complex configuration will evolve _ or, if one desires, one can "program" such a structure into the initial configuration. One may construct "factories" which produce gliding swarms of black cells, or intricate structures of squares within squares within squares, continually changing in detail but retaining the same overall design.
And cellular automata are not only mathematical curiosities. For example, they have been used to model the behavior of fluids. According to the laws of physics, the dynamics of a fluid can be predicted by a system of partial differential equations, which give the state of the fluid at each point in space x, at each time t, in terms of the states of "nearby" points at times in the "immediate" past. If instead of speaking about points we speak about small regions of space, these partial differential equations translate into rules for a cellular automaton. Each cell represents a certain region of the fluid, and instead of two possible states as in Life, each cell can have a large (but finite) number of possible states. The state of each cell at each time is determined by the state of that cell and adjacent cells at previous times. This method is extremely useful in many situations for which the original differential equations cannot be solved.
Cellular automata are also interesting in the context of algorithmic information theory. When one generates Life patterns on a computer screen, one usually starts from a state determined by a pseudo-random number generator. If one then lets the pattern iterate for ten thousand time steps, one gets another pattern, let us call it X. X may be another pseudo-random pattern, it may be an intricate design, or it may be the usual colony of undulating amoeba-like swarms. But whatever one gets, it is not unlikely that the shortest program for computing X is the one which starts with the pseudo-random initial pattern and then iterates. The pseudo-random number generator is a very short program, and the rules for Life also fit into a very short program. However, if the shortest program for computing X is indeed of this nature, then the logical depth of X is very high _ in this example, ten thousand.
We have discussed only a few types of cellular automata: Life, von Neumann's original automaton, those which simulate fluids.... Many others have been investigated, although none have yielded dynamics as rich as those of Conway's Life. But the vast majority of cellular automaton rules remain unexplored. In this regard, I must not omit the recent work of Langton, Li and Packard (1990). They were concerned with one-dimensional cellular automata, in which the cells are intervals, and _ as in Life _ the state of each interval is either black or white. The state of each interval at each time depends on the states of the k nearest intervals at the previous time. An example is given in Figure 6. What Langton, Li and Packard did was to study statistically the space of all one-dimensional cellular automata. Their main tool was the parameter L, which measures, for each set of cellular automata rules, the proportion of situations in which a cell is not turned white. Their conclusion was that there is a certain critical value for L. When L is too small and a high proportion of situations lead to a white cell, the dynamics are uninteresting because they are periodic; the same patterns repeat themselves over and over again in a very simple way. When L is too high and not enough situations lead to white cells, the result is apparent chaos: no interesting patterns emerge; the dynamics are pseudorandom. But if L is near the critical point, there is a decent chance that an interesting, attractive, complex pattern will emerge. This is not always true: there are some sets of rules with optimal L which lead to periodic behavior, and some which lead to chaos. However, if L is not right, there is essentially no chance of interesting dynamics.
1.7 STRUCTURAL COMPLEXITY AND EMERGENCE
So far we have discussed two measures of complexity: the KCS algorithmic complexity, and Bennett's logical depth. Neither of these, however, really captures the sense in which a human is more complex than a tree, which is more complex than a bacterium, which is more complex than an iron atom.
To put it another way: neither algorithmic information nor depth captures the sense in which certain cellular automata patterns are more interestingly, intricately structured than others. For their study of one-dimensional automata, Packard, Li and Langton applied a hodgepodge of statistical and information-theoretic techniques; this served their purpose well, but it is not very satisfying conceptually.
In (Goertzel, 1991a), I proposed a novel definition of complexity, and called it structural complexity. This idea will be essential to the theory of natural selection to be developed below. Structural complexity is a measure of the total amount of pattern in an entity.
Given any entity X, let S(X) _ the "structure" of X _ denote the set of all patterns in X. For instance, consider a complex cellular automata configuration, such as that produced by a particularly interesting round of Life. Among the patterns present will probably be that of repeated "gliders," as shown in Figure 7. More formally, the pattern will be (y,z), where y = "put in locations a, b, c,...," and z = (glider, location a, location b, location c,...). Other repeated forms may also be present, as well as perhaps more complex patterns. For instance, if aconfiguration were full of "boxlike" shapes, or self-similar shapes, it is likely that there would be a pattern (i.e. an approximate pattern) corresponding to this. Or if, every time a boxlike shape appeared, a circlelike shape appeared beneath it, this would also constitute a pattern. All these patterns are part of the structure of the configuration.
These patterns are there in the configuration, so they are implicit in the minimal program for computing the configuration. But there may be other configurations with equal or greater KCS complexity, which display far fewer such patterns. The amount of structure in an entity is not correlated in any obvious way with the algorithmic complexity of that entity, nor for that matter with its logical depth.
The structural complexity measures the size of the structure of an entity. This is a very simple concept, but certain difficulties arise when one attempts to formulate it precisely. An entity may manifest a number of closely related patterns, and one does not wish to count them all separately. Roughly speaking, when adding up the sizes of all the patterns in x, one must adhere the following process: 0) put all the patterns in a certain order, 1) compute the size of the first pattern, 2) compute the size of that part of the second pattern which is not also part of the first pattern, 3) compute the size of that part of the third pattern which is not also part of the first or second patterns, etc. The definition of "part of" here is slightly technical; it is given in the following section.
Finally, before discussing the implications of these concepts, we must introduce the crucial concept of emergence.
I define a "fuzzy set" as a collection of elements, each of which has a certain nonnegative "degree of membership." For instance, one might say "w belongs to S to degree 1/2," or "the platypus belongs to the set of mammals with degree 3, whereas the dog belongs to the set of mammals with degree 100." There is no maximum degree of membership. Where S and T are fuzzy sets, I define S-T as follows: the degree of membership of w in S-T is the degree of membership of w in S, minus the degree of membership of w in T.
Where S(x) is the structure of x, S(x) may be considered as a fuzzy set: the degree of membership of an element (y,z) of S(x) may be set equal to the "intensity" [1+d(y*z,x)][%y%+%z%]/%x% of (y,z) in x.
Nontechnically, this means that the degree to which a pattern (y,z) belongs to the structure of x, is the amount of simplification obtained by expressing x in terms of (y,z).
The emergence between two entities w and x is defined as the fuzzy set
Em(w,x) = S(w%x)-S(w)-S(x).
That is, it contains every pattern in w%x, the entity consisting of w and x put together _ except insofar as these patterns are patterns in w or x individually. If a pattern has intensity 100 in wUx, but intensity 10 in w and intensity 20 in x, then it has intensity 70 in Em(w,x). Emergence is a mathematical formulation of the age-old concept of Gestalt: it measures that which is in the whole but not the part. It is important because the structure of a whole is very often more than the union of the structures of its parts.
Technically, by writing S(w U x), we are assuming that w U x is encoded by some binary sequence. But in the spirit of algorithmic information theory, we need not linger over bounded-length coding programs.
1.8 THE FORMAL DEFINITION OF STRUCTURAL COMPLEXITY
This section is somewhat technical and the nonmathematical reader will probably want to skim it over or skip it entirely.
Before defining structural complexity, we must dispense with a few formal preliminaries. Where z is any binary sequence of length n, let D(z) be the binary sequence of length 2n obtained by replacing each 1 in z with 01, and each 0 in z with 10. Where w and z are any two binary sequences, let wz denote the sequence obtained by placing the sequence 111 at the end of D(w), and placing D(z) at the end of this composite sequence. The point, as usual, is that 111 cannot occur in either D(z) or D(w), so that wz is w juxtaposed with z, with 111 as a marker between them.
Next, let us assume some standard "programming language" L, which assigns to each program y a certain binary sequence L(y). The specifics of L are irrelevant, so long as it is computable on a Turing machine, and it does not assign the same sequence to any two different programs. Where U is a universal Turing machine and v and w are binary sequences, we may now propose:
Definition 1.1: The relative information I(v%w), relative to U and L, is the length of the shortest self-delimiting program which computes v, on U, from input L(y), where y is a minimal-length self-delimiting program for computing v on U.
Obviously, if v and w have nothing in common, I(v,w)=I(v). And, on the other hand, if v and w have a large common component, then both I(v,w) and I(w,v) are very small.
In a similar manner, we may define the complexity of a program y as I(L(y)), the complexity of y relative to another program y1 as I(L(y)%L(y1)), and the complexity of y relative to a set of programs {y1,y2,...,yk} as I(L(y)%L(y1)L(y2)...L(yk)). One may show that, as the lengths of the programs involved increase, these quantities become machine and language independent.
Finally, let S={y1,...,yn} be any set of programs. Then we may define the size %S% of S as the result of the following process:
Algorithm 1.1:
Step 0. Make a list of all the patterns in S, and label them
y1, y2,...,yN
Step 1. Let s1(x)=I(L(y1))
Step 2. Let s2(x)=s1(x)+I(L(y2))I(L(y1))
Step 3. Let s3(x)=s2(x)+I(L(y3)%L(y1)L(y2))
Step 4. Let s4(x)=s3(x)+I(L(y4)%L(y1)L(y2)L(y3))
...
Step N. Let %S%=sN(x)=sN-1(x)+I(L(yN)%L(y1)L(y2))...L(yN-1))
At the kth step, only that portion of yk which is independent of
{y1,...,yk-1} is added onto the current estimate of %S%. For instance, in Step 2, if (y2,z2) is independent of (y1,z1), then this step increases the initial estimate of %S% by the complexity of (y2,z2). But if (y2,z2) is highly dependent on (y1,z1), not much will be added onto the first estimate.
The size %S% is defined as the average of the estimates obtained in this manner from all possible orderings of the (yi,zi).
Where St(x) is the set of all patterns in x, we may now define the structural complexity of x to be the quantity %St(x)%. This is the size of the set of all patterns in x _ or, more intuitively, the total amount of regularity observable in x. It is minimal for a random sequence, and very small for a repetitive sequence like 000...0. It deems 0101010...10 slightly more complex than 000...0, because it is a greater variety of economical methods for computing the former (for instance, one may repeat 10's, or one may repeat 01's and then append a 0 at the end). It considers all the different ways of "looking at" a sequence.
It is worth pausing to mention an interesting unsolved problem related to structural complexity. For each positive integer n, let M(n) denote the maximum value taken on by the structural complexity of anysequence of length n. Then it is natural to ask, how does M(n) behave for large n? In other words, how much algorithmic structure can one fit into a given number of bits?
1.9 EVOLUTION AND ALGORITHMIC INFORMATION
The physical, mental and biological worlds all have a very high structural complexity. The miracle of cellular automata is that they also have a rather high structural complexity, but they are known to have a very low algorithmic complexity. They also appear to have a high logical depth. The moral is that intricate, beautiful structure may emerge from extremely simple instructions, but a long time may be required for the process to run its course.
This has interesting implications for natural selection, which we shall discuss more thoroughly later on. Natural selection works by random mutation of the genetic material of organisms, i.e. of the programs for computing organisms. In Chapter 3, I will suggest that the following principle guides the creation of these programs: an ecosystem, in selecting the genetic programs of its member organisms, seeks to minimize KCS complexity while maximizing structural complexity.
One may argue from natural selection that a biological system would do better to have a short program than a long one. A long program is valuable when it gives better results, but if a short program will do just as well, it will be preferable. To see this, one need merely consider the problem of error sensitivity. In the space required to store a program of length 100,000, one could store two copies of a program of length 50,000. If the programs were of approximately equal utility, and there were an error in the storage process, the organism with two copies of the shorter program would almost certainly be better off: it would have more intact information. Therefore, it seems quite plausible that natural selection mitigates in favor of shorter genetic programs, those which have smaller KCS complexity.
It is when we consider that the environment of each organism is largely made up of other organisms, that structural complexity comes to the fore. In a self-organizing system of organisms, the nature and behavior of each organism will evolve so as to most effectively interact with the organisms around it, so that the structure of each organism in some sense best "matches" the structures of the organisms which surround it. Another way of saying this is: the emergence between the structure and behavior of an organism and that of its fellow organisms has asignificant structural complexity.
On the basis of these cursory, schematic arguments, then, it seems that there are two algorithmic principles at work here: the tendency toward low KCS complexity of the organism, and the tendency toward high structural complexity of emergence between the organism and its environment. And the patterns which constitute the emergence between one organism and another, are simply part of the structure of the overall ecosystem, the whole system of interacting organisms. So, on the ecosystem level, we have the general principle: low KCS complexity, high structural complexity. The first principle comes from organism-level natural selection, and the second from ecology.
In other words, it would seem that natural selection tends to produce the same sort of situation that one finds in the more interesting cellular automata, such as Conway's Life.