From Complexity to Creativity -- Copyright Plenum Press, © 1997 |
Part II. Formal Tools for Exploring Complexity
CHAPTER 5. DYNAMICS AND PATTERN
CHAPTER FIVE
DYNAMICS AND PATTERN
In this chapter I will return to the question of structure versus dynamics, as discussed at the end of Chapter One. In a psychological context, the psynet model goes a certain distance toward resolving this question: it shows how mental structures might arise as a consequence of mental dynamics. But here I will confront the question in a different way, by describing some formal tools for thinking about and measuring the algorithmic patterns emergent from system dynamics.
I will begin by introducing symbolic dynamics as a means for inferring formal languages describing attractor structure. The Chaos Language Algorithm, a new computational technique for inferring algorithmic patterns from system dynamics, will be described; and some general hypotheses regarding the appearance of emergent patterns in complex systems will be presented.
The relevance of these ideas for the psynet model will be discussed. It will be argued that symbolic dynamics provides a natural mechanism by which a psynet might store symbolic information in its attractor structure.
Finally, in the last section, I will turn to physics, and explore the connections between dynamics, algorithmic pattern, and physical entropy.
Attractors are patterns in dynamics -- geometric patterns. It is interesting and important to find ways to represent these geometric patterns symbolically. This leads us to what is perhaps the most interesting tool in the dynamical systems theory toolkit: symbolic dynamics. Symbolic dynamics relates the continuous mathematics of iterations on real and complex spaces to the discrete mathematics of computation and formal linguistics.
Only in the past few decades has symbolic dynamics emerged as a useful mathematical tool. Its conceptual roots, however, go back much further, at least to Leibniz [Leibniz, 1679; see Loemker, 1969]. Leibniz proposed a "universal character for 'objects of imagination'"; he argued that a kind of alphabet of human thoughts can be worked out and that everything can be discovered and judged by a comparison of the letters of this alphabet and an analysis of the words made from them
And this systematization of knowledge, he claimed, would lead to an appreciation of subtle underlying regularities in the mind andworld:
[T]here is some kind of relation or order in the characters which is also in things... there is in them a kind of complex mutual relation or order which fits things; whether we apply one character or another, the products will be the same.
In modern language, the universal characteristic was intended to provide for the mathematical description of complex systems like minds, thoughts and bodies, and also to lead to the recognition of robust emergent properties in these systems, properties common to wide classes of complex systems. These emergent properties were to appear as linguistic regularities.
Leibniz didn't get very far with this idea. He developed the language of formal logic (what is now called "Boolean logic"), but, like the logic-oriented cognitive scientists of the 1960's-1980's, he was unable to build up from the simple formulas of propositional logic to the complex, self-organizing systems that make up the everyday world. Today, with dynamical systems theory and other aspects of complex systems science, we have been able to approach much closer to realizing his ambitions. (One might argue, however, that in the intervening centuries, Leibniz's work contributed to the partial fulfillment of his programme. The formal logic which he developed eventually blossomed into modern logic and computer science. And it is computer power, above all, which has enabled us to understand what little we do about the structure and dynamics of complex systems.)
Abraham and Combs [1995] point out the remarkable similarity between Leibniz's Universal Characteristic and the modern idea of a unified theory of complex system behavior; they single out the concepts of attractors, bifurcations, Lyapunov exponents and fractal dimensions as being important elements of the emerging "universal characteristic." It is symbolic dynamics, however, which provides by far the most explicit and striking parallel between Leibniz's ideas and modern complex systems science. Symbolic dynamics does precisely what Leibniz prognosticated: it constructs formal alphabets, leading to formal "words" and "sentences" which reveal the hidden regularities of dynamical systems. As yet it does not quite live up to Leibniz's lofty aspirations -- but there is reason to be optimistic.
Symbolic Dynamics
The basic idea of symbolic dynamics is to divide the state space of a system into subsets called cells, numbered 1 through N. Usually the cells are assumed disjoint, and are assumed to completely fill the state space of the system. Every possible state of the system is thus assigned some code number. One then charts each trajectory of the dynamical system as an itinerary of cells: "this trajectory goes from cell 14 to cell 3 to cell 21 to cell 9, and so forth...." The collection of code sequences obtained in this way can tell one a great deal about the nature of the dynamics. In essence, one is translating the dynamics of the system into a collection of abstract words!
Symbolic dynamics is particularly useful where the system involved is chaotic. Chaos, which involves dynamical unpredictability, does not rule out the presence of significant purely dynamic patterns. These patterns reveal themselves as the structure of the chaotic system's strange attractor. Examples will be given in the following section.
For instance, consider the Baker map
xn+1 = 2xn mod 1 (1)
Using the two "cells" [0,1/2] and (1/2,1], one finds that the dynamics of the Baker map lead to the following map on code sequences
s(a1a2a3...) = a2a3a4... (2)
Specifically, the code sequence corresponding to a certain initial value x0 is precisely the binary expansion of the number x0. Defining a topology on code sequence space by
d(a1a2a3..., b1b2b3...) = 2-k|ak - bk| (3)
it is easy to see that s is chaotic on code sequence space: it is sensitively dependent, topologically transitive, and has dense repelling periodic points.
Probabilistic Symbolic Dynamics
Unfortunately, at present most of the mathematical theory of symbolic dynamics is only useful for analyzing "toy iterations." But for computational purposes, the probabilistic version of symbolic dynamics, often called Markov analysis, is
of extremely general utility. While a little less elegant than its deterministic counterpart, it is far more versatile. In Markov analysis, instead of charting which cell-to-cell transitions are possible, one assigns a probability to each cell-to-cell transition. Thus the dynamics on the attractor are represented by a matrix of probabilities pmn, representing the chance of a point in cell m moving to cell n at the next iteration step.
There is a beautiful result called Pesin's Theorem which connects Markov analysis with Liapunov exponents. It says that, under appropriate conditions (e.g. if f is a diffeomorphism), the sum of the Liapunov exponents is equal to a quantity called the metric entropy. If the Liapunov exponents are not constant, this sum must be integrated over the attractor; see Pesin, 1977). But the the intriguing thing is that metric entropy is also intimately connected with thermodynamics. The metric entropy involves the entropy of the collection of first-order Markov probabilities -- i.e.
- pmn log pmn (4)
One divides this entropy by , and then takes the supremum (i.e., the least upper bound) of this quotient over all possiblepartitions into cells. The metric entropy thus defined is a qualitative, information-theoretic measure of the degree of "information-scrambling" implicit in a dynamical system. It is very satisfying indeed that the metric entropy should turn out to connect so simply with the Liapunov exponents, which are defined in terms of calculus rather than information.
Dynamical Systems that "Know" English
As a rule, it is is very difficult to construct the symbolic dynamics underlying a given iteration, even a "toy" iteration. But there are exceptions. One of these is the class of "piecewise linear" interval maps -- maps which, for some division of the interval into finitely many subintervals, are linear on each subinterval. The Baker map is piecewise linear, but it is nonrepresentatively simple. Piecewise linear maps are extremely flexible in structure. For instance, it is amusing to observe that one can construct a piecewise linear map which "knows" English.
What I mean by "knowing" English is this. Suppose that one partitions the interval [0,1] into a collection of subintervals, and assigns a subinterval to each English word. Then, if one produces a trajectory of the map from an arbitrary starting point, the symbolic dynamics of this trajectory will be a series of grammatical English sentences.
The key to this construction is the notion of probabilistic grammar. A probabilistic grammar is a finite set of rewrite rules, involving elements of a finite list of words, each one tagged with a certain probability. Here we will be directly concerned only with formal grammars, but of course, these formal grammars would be of little interest if not for their connection with the grammars of natural languages. Numerous linguists, most notably Zellig Harris (1988), have argued that the syntax of natural language can be expressed in terms of probabilistic grammars. Harris gives a fairly complete treatment of English syntax using tables indicating the probability of a given "operator" carrying a given "argument," combined with a handful of simple transformation rules.
It is easy to construct a dynamical system whose state space represents a given grammar. First, one simply takes each of the N words involved, and assigns to them, in alphabetical order, the N subintervals (i/N,(i+1)/N), where i = 0,...,N-1. Let Iw denote the subinterval assigned to word w. Divide each subinterval Iw into M equally sized subintervals (subsubintervals) Iw,r, r = 1,...,M. Here the number M should be very large; for instance, M might represent the size of the corpus from which the grammar was constructed, and should be divisible by N. The function f is defined to be linear over each subinterval Iw,r. Each word x gets a number of subintervals Iw,r proportional to the probability of the transition from w to x. Over each subinterval Iw,r which is assigned to x, the function f is defined to be the linear function mapping the left endpoint of Iw,r into the left endpoint of Ix, and the right endpoint of Iw,r into the right endpoint of Ix.
Now, given this construction, suppose one constructs symbolic dynamics on the partition defined by the Iw, for all thewords w. One finds that the probabilistic grammar obtained in this way is precisely the probabilistic grammar from which the dynamical system was constructed. For models incorporating k'th order transitions, where k exceeds one, the standard trick suffices: instead of looking at intervals corresponding to words, look at intervals corresponding to k-tuples of words. If the probabilistic grammar in question is, say, a grammar of English, then the dynamical system knows English! According to Harris's analysis of English, all English sentences can be obtained by taking the results of this dynamical system and transforming them by a handful of simple transformation rules.
One may well doubt whether there is any purpose in constructing dynamical systems which know English! The answer, of course, is that while there is no inherent value in such a highly contrived system as the one given here, the possibility of encapsulating language in the attracting behavior of a dynamical system is rife with potential applications. If, instead of a piecewise linear map, one had a dynamical system of some relevance to neural or psychological function, then one would have quite an interesting result! It is not clear exactly how complexity of grammar generation corresponds with dynamical complexity; but this would seem to be a promising area for research.
To understand the complexity that can arise from even the simplest iterations, general considerations will not suffice; one must look at specific examples. An example which I have found interesting is the generalized Baker map
xn+1 = Phi(xn) = (beta xn + alpha) mod 1
(5)
1 < beta < 2
A detailed study of this map is quite rewarding. A good special case to think about is beta = 3/2, alpha = 0. The iteration function for this case is as shown in Figure 3.
If one does symbolic dynamics on the generalized Baker map, using the natural cells [0,1/beta] and (1/beta,1], one finds that each point gives rise to a code sequence which is its expansion in base beta (see e.g. Blanchard, 1989). Expansions in noninteger bases being somewhat subtler than binary expansions, one runs into certain complications: the "code sequence space" in question no longer contains all binary sequences, but only those special binary sequences that correspond to expansions in base beta. But the argument essentially works the same; the iteration can be proved chaotic by reference to the chaoticity of a shift map on this strange code space (Goertzel, Bowman and Baker, 1993).
Taking the case beta = 3/2, let us consider some sample expansions. We have, for instance,
.100... = 2/3
.0100... = 4/9
.00100... = 8/27
.10100... = 2/3 + 8/27 = 26/27
All these are unproblematic. Things get trickier, however, when one looks at something like
.1100... = 2/3 + 4/9 = 10/9.
How can a "decimal," a number with zero on the left hand side of the "decimal point," be greater than one? Clearly something is wrong here. What is wrong is, specifically, that one is not dealing with an integer base!
Since a double one is illegal in the leading position, it is also illegal everywhere else. After all, it is equally unsettling to look at
.01100... = 10/27 > 2/3 = .100...
And double ones are not the only forbidden sequences. Consider, for instance,
.1010100... = 2/3 + 8/27 + 32/243 = 266/243
The pattern 10101 is therefore disallowed in any base 3/2 expansion. On the other hand, a simple computation shows that
.100100100...
is just fine -- skipping two zeros does the trick!
With a little creativity one can construct a huge variety of disallowed sequences. This gives rise to the question: precisely how many sequences are allowed? As the sequence length n tends to infinity, what percentage of the 2n possible sequences are permitted? There is only one answer which makes any intuitive sense: approximately (3/2)n sequences should be allowed. In the general case, approximately betan sequences should be allowed. Leo Flatto (personal communication) has proved this result using number-theoretic techniques; however, his proof works only for the case alpha = 0. Working independently, Harold Bowman and I (Goertzel et al, 1993) proved the same theorem for the case of general alpha, in a way which relates naturally with the dynamics of the map. Both of the proofs, however, involve fairly advanced mathematics, and therefore neither will be given here.
The map Phi does not have an "attractor" in the strict sense; what is has is an "invariant measure." This means that, no matter what value of x0 one begins with, the values of xi will wander all over the interval [0,1], but there will be a pattern to their distribution. If one charts the probabilities with which the xi fall into different subintervals, one will obtain an approximation to the invariant measure of the map. The measure is invariant because these probabilities are the same for almost any initial value.
However, the distinction between attractors and invariant measures is really not philosophically important. If one constructs the "Frobenius-Perron" operator corresponding to Phi,an operator which tells how Phi maps probability distributions onto probability distributions, then the invariant measure re-emerges as an attractor of this operator.
The motion of the iterates of Phi through the invariant measure are chaotic. It is not difficult to see why. Conjugacy to the shift map follows in the same way that it does for the ordinary Baker map; the difference is that not all sequences are allowable. To show topological transitivity, one must therefore be a little cleverer. Let A denote the set of all finite allowable sequences, and suppose that the elements of A are ordered {a1,a2,...}. Then the sequence
.a100a200a300a4... (6)
is allowable, and is topologically transitive, because running the shift dynamics on it will bring one arbitrarily close to any infinite allowable sequence. It is worth pausing a moment to realize why this sequence must be allowable. The reason is that, as observed above, .1001001001... is allowable, and because all of the ai are allowable, none of the expressions .ai can exceed 1 in value.
Finally it is interesting to think about the periodic points of the generalized Baker map. How many periodic points of period n are there? Quite elegantly, it turns out that the answer is: on the order of betan. The argument is as follows. Suppose one takes an allowable sequence of length n-2 and sticks two zeros on the end of it; then this sequence can be repeated over and over again forever, to give a periodic allowable sequence of period n. On the other hand, not every periodic sequence of period n is an allowable periodic sequence of period n. Thus the number of allowable periodic sequences of period n is greater than the number of allowable sequences of length n-2, and less than the number of allowable sequences of length n. But the number of allowable sequences of length n-2 is about betan-2, and the number of allowable sequences of length n is about betan. So the number of periodic points of period n is on the order of betan.
All this is fairly elementary. The remarkable thing is the elegance and subtlety that arises from considering this very simple iteration. One may well wonder, if this much structure underlies an iteration constructed from drawing two lines above the unit interval, we can possibly hope to understand what is going on with real-world dynamical systems? The answer is that we will have to ask different questions. Counting the periodic points of a given order will be out of the question; even proving chaoticity will usually be too difficult. We will have to learn how to ask questions reflecting on qualitative behavior; otherwise it is unlikely to be possible to go beyond simple "toy iterations" like the Baker map and its generalizations.
5.4 THE CHAOS LANGUAGE ALGORITHM
We have discussed symbolic dynamics, and then placed the concept of system pattern in a more general context. In thissection I will describe a computational algorithm for extracting patterns from systems, which is an extension of symbolic dynamics. The algorithm is still in a developmental stage, but it has already yielded some interesting results. A little later we will describe an application of the algorithm to the exploratory analysis of mood cycle data.
The algorithm, called the Chaos Language Algorithm (CLA), was developed collaboratively by the author and Gwen Goertzel. In its current implementation, the CLA is being used for the recognition of purely dynamic patterns, and it recognizes only a narrow class of patterns: patterns based on repetition. But this type of pattern is very common in practice, thus even this simplistic implementation is of definite use in the analysis of real systems. Above all, the CLA serves to put some "meat on the bones" of the abstract concept of algorithmic information in complex systems, by giving an explicit account of an important class of algorithmic patterns -- context-free grammars.
The concept at the heart of the CLA is the use of formal languages to represent patterns in the trajectories of a dynamical systems. In its simplest version, the CLA is a three-stage process, consisting of
1. discretization of trajectories of a dynamical system by symbolic dynamics. One divides the potential state space of the system into a finite number of disjoint regions, and assigns each region a code symbols, thus mapping trajectories of the system into series of code symbols. This series of code symbols may be interpreted as a lengthy text in an unknown formal language; the problem of the CLA then being to perform an Harris-style linguistic analysis of this text.
2. tagging of symbolic trajectory sequences using a self-organizing tagging algorithm. One takes the symbols derived in the first step and assigns each one to a certain "category" based on the idea that symbols which play similar roles in the trajectory should be in the same category. The "tag" of a code symbol is a number indicating the category to which it belongs; The code sequence is thus transformed into a "tag sequence." In linguistic terms, one is assigning words to grammatical categories based purely on frequency information.
3. inference of a grammar from the tag sequence produced in the previous step, and iterative improvement of the tagging in order to maximize grammar quality.
The end result of this algorithmic process is a formal language which captures some of the statistical and algorithmic structure of the attractor of the dynamical system. Each of the three stages may be carried out in a variety of different ways; thus the CLA is as much a "meta-algorithm" as it is an algorithm in itself.
Step 1, in the current implementation, is carried out using the standard method of symbolic dynamics. This is straightforward for the case where the system's state space is a subset of Rn, as occurs, for example, when one's data regarding the system takes the form of a table of numbers.
Step 2 is carried out by computing the "successor structure" corresponding to each code symbol. This structure is a compact way of representing the various Markov probabilities associated with a given code symbol. The idea is that the relationship of a symbol a with other symbols may be quantified in terms of the collection of subsequences containing a. For each symbol a, one may construct a vector Succ1(a) of the form
Succ1(a) = (count(a0),count(a1),...,count(a(N-1)) (14)
where count(ab) gives the number of times that the subsequence ab occurs in the symbol sequence. And, in a similar fashion, one may build a vector Succ2(a) of all two-symbol sequences that occur immediately after a at any point in the symbol sequence, and so forth. One thus derives, for each symbol a, a "successor structure"
Succ(a) = (Succ1(a),...,Succk(a)) (7)
where k is the order of the structure. By omitting all sequences of zero count and representing Succ(a) as a tree, one saves memory and arrives at very efficient insertion and retrieval operations.
The distance between two successor structures Succ(a) and Succ(b) may be defined in various ways, the simplest of which is:
d(Succ(a),Succ(b)) =
d(Succ1(a),Succ1(b)) + ... d(Succk(a),Succk(b)) (8)
d(Succi(a),Succi(b)) =
w1 [Succi(a)1 - Succi(b)1 ] + ... + wM [Succi(a)M - Succi(b)M]
where (w1,...,wM) is a vector of weights and M=Nk. This distance is a crude quantitative measure of the syntactic similarity of a and b, of how similarly a and b relate to the other symbols in the sequence s. The weight vector determines the statistical model underlying the distance measure.
The goal of Step 2 is a collection of categories with the property that each category consists of symbols which are, in an appropriate sense, mutunally similar in their relationship with other symbols. This self-referential characterization of categories leads to a computational problem: if the membership of symbol a in category Cj can only be gauged by a's syntactic similarity with other symbols in category Cj, then where does one start? This problem is solved by the idea of a self-organizing tagging algorithm. One constructs a dynamical iteration on the space of categorizations, the fixed points of which are satisfactory categorizations. Then, beginning from a random categorization, one iterates until one reaches the fixed point. In this way, the tagging is allowed to progressively construct itself.
The idea of self-organizing tagging has deep connections with the cognitive models of (Goertzel, 1994); perhaps surprisingly, it also lies at the root of one of the standardcategorization methods from statistics, the k-means method. Thus, the current implementation of the CLA carries out the crucial categorization step by k-means clustering on the space of successor structures. The k-means method has numerous shortcomings, but it is a "rough and ready" self-organizing tagging method which is eminently suitable for an initial implementation.
Finally, the third step of the CLA is the one which is most "wide open" in the sense that it involves a somewhat arbitrary measure of grammar quality. In the experiments with mood cycle data, to be reported in Chapter Eight, this step has been foregone, and the results of k-means clustering on successor structures have simply been used to derive grammatical rules. This approach has the advantage of simplicity. In general, however, it is not adequate. In (Goertzel and Goertzel, 1996) it is shown that without iterative improvement in Step 3 of the CLA, the CLA is incapable of inferring the natural partitions underlying simple mathematical iterations on the unit interval. The "three-halves" map discussed here is an example of this phenomenon.
Here we have focused on purely dynamic pattern; however, similar methods may be used for purely static or static/dynamic pattern. The only difference is the dimensionality of the entity from which repeated structures are extracted. Instead of a sequence of code numbers, one will have, in general, an n-dimensional lattice of code numbers. Instead of pairs, triples, etc. of code numbers, one will have n-cubes of code numbers of diameter 2, 3, etc. The same fundamental method may be used, but the computational complexity will obviously be higher. Purely dynamic pattern is the simplest case, and hence a natural starting point. Crutchfield (1991) has already applied his related epsilon-machine algorithm to the static/dynamic patterns traced out by two-dimensional cellular automata over time, with only partial success but very interesting results.
The CLA and Chaos Theory
To see what the CLA means in chaos theory terms, recall that attractors are typically divided into categories, such as fixed points, limit cycles, strange attractors, or chaotic attractors. The technical definition of the latter terms is a subject of some minor technical debate; my preference, followed here, is to call any attractor that is neither a fixed point nor a limit cycle a strange attractor. Strange attractors are often though not invariably associated with chaotic behavior. "Chaotic behavior" also admits several definitions. The one quality that is invariably associated with chaos, however, is sensitive dependence on initial conditions.
Chaos implies the unpredictability of the precise numerical state of a dynamical system. But this kind of unpredictability does not imply a similar unpredictability on the level of algorithmic structure. Walter Freeman, for example, has written about "attractors with wings" -- attractors with a pronounced sectional structure, each section corresponding to a certain type of system state (Freeman, 1991). Even though the system is chaotic, if one knows which section of the attractor the systemis in, one may make meaningful statistical predictions regarding the section of the attractor the system will visit next. This is a simple example of the kind of attractor structure that can be captured by a formal language.
The Generalized Baker Map Revisited
To illustrate the application of the CLA, let us return to the "three-halves map" (Goertzel et al, 1993):
xn+1 = (1.5 xn) mod 1 (9)
This iteration, begun from almost any x0 in the unit interval, leads to a chaotic trajectory on the interval. But this trajectory, though chaotic, is not devoid of dynamic pattern. If one divides the unit interval into 10 equally sized nonoverlapping subintervals, a trajectory of the system is encoded as a series of integers from 0 to 9, where the tag i represents the subinterval [i/10,(i+1)/10]. The CLA, with appropriately chosen parameters, is able to form the optimal categories
Category 0:
0 1 2 3 4 5 6
Category 1:
7 8 9
These categories give rise to natural grammatical rules. For instance, one finds that
00
01
10
are all grammatical constructions, but "11" is forbidden. Just as "V V" is forbidden in English (one does not have two consecutive verbs, though there are apparent exceptions, e.g. gerunds), "11" is forbidden in the language of the three-halves map. Similarly,
000
010
100
001
are all permitted, but "101" is not.
These grammatical rules are (approximate) patterns in the system. They allow one to make predictions about the system: if at time t the system is in a state that falls into Category 1, then at time t+1 it will definitely not be in a state that falls into Category 1. If at time t it is in Category 0, and at time t-1 it was in Category 1, then at time t+1 it must remain in Category 0. Furthermore there are statistical predictions, e.g.: if in Category 0 at time t, there is a 2/3 chance of remaining in Category 0 at time t+1. This kind of prediction can be madedespite the underlying chaos of the dynamical system. The key is in choosing a categorization which leads to a suitably restrictive grammar. This is what the CLA attempts to do. In the case of the three-halves map, the CLA can find the correct partition only if grammatical information is given enough weight in the tagging process. Purely statistical information is in this case misleading and the most effective tactic is to explicitly search for the tagging that leads to the most informative grammar.
5.5 ORDER AND CHAOS IN MOOD FLUCTUATIONS
In this section I will describe a simple attempt to use the CLA to analyze actual psychological data (rather than data generated by a mathematical function such as the Baker map). The focus is more on the methodology than on the actual results, which are quite tentative. The data in question concern human mood fluctuations.
Moods are complex things; and, as we all know from personal experience, they are difficult to predict. They are influenced by multiple, interlinked factors. Most obviously there are biological rhythms, and events in the external environment. There are also the possibilities of purely psychological rhythms, not directly tied to any specific chemical processes; and cooperative rhythms, generated by the coupling of the moods of two different people (e.g. husband and wife).
Combs et al (1994) reports an empirical study of mood fluctuations. In his experiments, each participant recorded their mood half-hourly during all waking hours of the day, for a total of 550-750 mood reports per participant. Each mood report consisted of markings on two Likert scales: one indicating excitement vs. relaxation, the other indicating happiness vs. sadness. These mood reports may be understood as points in the plane, and the data set for each participant may then be interpreted as the trajectory of a dynamical system.
Based on this data, Combs et al (1994) concludes that mood fluctuations display the characteristics of chaotic dynamics (i.e., dynamics which are neither stable, periodic, nor random). However, the identification of chaos here cannot be considered conclusive. The intrinsic fuzziness of the data, together with the short length of the time series and the breaks in the time series representing periods of sleep, all combine to render a truly rigorous numerical analysis impossible. All that can be said with certainty is that the data could plausibly be the result of noisy chaotic dynamics. On the other hand, they could also be the result of simpler underlying dynamics polluted by noise.
In this section I will reinterpret the Combs data using the tools of symbolic dynamics, rather than numerical dynamical systems theory. In the symbolic dynamics approach, as pursued here, one does not need to ask whether the underlying dynamics are chaotic or merely periodic, and one does not need to question the statistical properties of the environmental perturbations or "noise." Instead, the focus is on the emergent statistical/algorithmic patterns in the time series. Thequestion is: what patterns are there that might allow us to make approximate predictions about a person's mood at a given time, based on their moods at previous times? Specifically, a computational method called the Chaos Language Algorithm (CLA) is introduced, which infers probabilistic regular grammars as emergent patterns in discretized versions of time series.
Like Combs et al (1994), this is primarily an exploratory, methodological investigation, rather than an attempt to posit a solid new theory of mood fluctuations. Complexity science is still in an early stage of development, and the psychological applications of complexity are particularly undeveloped. The need at the present time is for the clarification of basic theoretical concepts, and the development of innovative, theoretically-informed techniques for data analysis. The theoretical concept underlying the present investigation is that the analysis of complex psychological data requires a focus on emergent algorithmic patterns rather than merely numerical variables. The application of the CLA to mood fluctuation data is presented as a concrete exploration and illustration of this abstract proposition.
The Dynamics of Human Moods
What are the results of the CLA applied to the Combs data? No tremendously striking results have been found. Roughly and intuitively speaking, the essential result is that each individual has a certain "normal mood," represented a region of the state space of their "mood trajectory." The movements away from the normal mood seem to be either random or chaotic. Once the system moves away from normal, into a different part of the state space, it is likely to stay there a while, before wandering back. However, certain parts of the "non-normal" region would seem to have more internal coherence than others, as represented by a longer average residency time. These relatively coherent regions would seem to indicate relatively integral non-normal mood states; they also tend to be fairly similar to the normal mood state. It should be noted that the Combs study did not involve individuals with manic-depressive disorder, multiple personality disorder, or other conditions involving severe mood swings.
For the analysis of the Combs data, the state space of the two-dimensional mood system was divided into 36 equally-sized regions. The CLA was set to divide these 36 regions into 3 categories. The regions are defined by dividing the rectangle formed from the minimum and maximum values on each dimension into 36 equally-sized squares. The numbering for the regions is the natural one, given by
1 2 3 4 5 6
7 8 9 ...
13 14 ...
...
The choice of the numbers 3 and 36 here is somewhat arbitrary; they were arrived at by experimentation. However, since the original data were recorded on a Likert scale, sixcertainly seems a reasonable degree of granularity. Each of the two scales is being divided into six regions. Variation within these six regions may easily be ascribed to "measurement noise." Regarding the number of categories were also tried (2,4, and 5, specifically), but 3 seemed to provide the maximally informative analysis of the Combs data. Given the tendency of the systems involved to cluster in certain narrowly defined regions, multiplication of categories serves only to divide up infrequently-visited regions into categories in pseudo-random ways.
There was moderately significant inter-subject variation in the CLA results. For two subjects, no clear categorizations was found: different runs of the CLA led to a disparate variety of different categorizations. Out of the five subjects considered in the Combs study, only two gave rise to particularly clear results.
Results for one of these subjects were as follows. The bounding rectangular region was:
4.000000 < x < 30.000000
2.000000 < y < 40.000000
The inferred categories were:
Category 0
17 23 6 7
Category 1
0 1 10 11 14 18
19 2 20 22 24 28
3 5 8
Category 2
12 13
Regions 12 and 13, isolated here as Category 2, represent a normal mood for this subject. They are adjacent regions and thus represent, in essence, a single mood state.
By far the most salient pattern in this subject's data is: Usually in regions 12 and 13, moving back and forth between one and the other, with occasional deviations to other mood states. However, other patterns beyond this are present. The regions isolated in Category 0 display a certain "coherence" beyond that which an arbitrary collection of regions would display. They lead to each other slightly more often than they lead back to the normal mood state. Category 1, on the other hand, is a "garbage collection" category, a set of apparently unrelated mood states. Thus, categories 6 and 7 represent a weakly defined but identifiable "non-normal mood."
The first-order transition frequencies between the categories are as follows (where "a b # x" indicates that a transition from region a to region b was observed x times in the subject's data set):
0 0 # 52.000000
0 1 # 17.000000
0 2 # 46.000000
1 0 # 16.000000
1 1 # 37.000000
1 2 # 45.000000
2 0 # 47.000000
2 1 # 44.000000
2 2 # 419.000000
The second-order transition frequencies are less revealing, but do contain some surprises. For instance, we have
0 0 0 # 29.000000
0 0 1 # 4.000000
0 0 2 # 19.000000
suggesting that, once the non-normal mood has persisted for two reports in a row, it is particularly likely to be observed in the next mood report. In other words, this mood state has a slight but noticeable tendency toward persistence. This is the sort of tendency that would be more convincingly observable from a longer data set.
The CLA does not always arrive at the same categorization. However, when an appropriate number of categories is used, there usually is only minor variation between different runs. The categorization given above was the most common for the subject in question; the second most common was the following:
Category 0
0 1 10 11 14 17
19 2 20 22 24 28
3 5 8
Category 1
18 23 6 7
Category 2
12 13
The third most frequent state was identical but contained both 17 and 18 in the category along with 23, 6 and 7. The occurence of a variety of different categorizations is not problematic: it only indicates that the CLA, which is itself a complex dynamical system, has a number of attractors clustered close together in its state space. The attractors with larger basins will tend to be found more often. It generally seems to be the case that the higher-quality categorizations have larger basins; but this is not known to be the case in general.
Another subject gave rise to the following results. Based on the bounding rectangular region
4.000000 < x < 34.000000
4.000000 < y < 37.000000
the following categories were obtained:
Category 0
12
Category 1
0 10 16 2 20 22
26 5 6 8 9
Category 2
1 11 13 15 17 18
21 3 7
0 0 # 86.000000
0 1 # 9.000000
0 2 # 36.000000
1 0 # 9.000000
1 1 # 184.000000
1 2 # 65.000000
2 0 # 36.000000
2 1 # 66.000000
2 2 # 196.000000
Here the "normal" mood 12 was not entered into all that often. However, the other two categories display a definite difference in their behavior with respect to Category 0. Category 1 has a very strong tendency not to lead to category 0. It, rather, leads to itself, or else to Category 2. Category 2, on the other hand, has a fairer chance of leading to Category 0. Similarly, Category 0 almost never leads to Category 1. This is a significant grammatical rule. The word 10 is infrequent, almost "ill-formed" in this grammar.
An obvious interpretation of this is that Category 2 consists of predominantly "near-normal" states -- after all, Categories 2 contains three of regions 12's nearest neighbors (11, 6 and 18). However, the key point is that 12 was coherent enough to consistently emerge as its own category, distinct from the "near-normal" states. This person has a definite pattern of moving from normal mood states, to near-normal mood states, then sometimes back to normal mood states, or sometimes to further outlying mood states. This pattern may also be inferred from the first subject, considered above. However, it is not so marked there due to the extreme frequency of the normal mood state.
Another, much less frequently obtained categorization of this subject's data, is as follows. Here 11 and 12 are bunched together as the "normal mood." The other two categories seem on the surface quite different, but a similar qualitative pattern emerges.
Category 0
16 17 6 7 8
Category 1
11 12
Category 2
0 1 10 13 15 18
2 20 21 22 26 3
5 9
0 0 # 266.000000
0 1 # 49.000000
0 2 # 34.000000
1 0 # 44.000000
1 1 # 141.000000
1 2 # 9.000000
2 0 # 38.000000
2 1 # 7.000000
2 2 # 100.000000
Here, once again, we see certain "forbidden expressions": 12 and 21. The composition of Category 2 is somewhat pseudo-random in the sense that moving a low-frequency region from Category 2 to Category 0 is not going to make much difference to the overall quality of the categorization, and may well lead to another attractor of the CLA. However, the same general pattern is observed here as in the more common categorization: one has a group of outliers (Category 2) and a group of near-normal moods (Category 1).
As noted above, there is no theory of statistical significance for the inference of grammatical patterns from symbolic data sets. Thus, these results must be considered as intuitive and exploratory rather than rigorous. In the end, however, most tests of significance are based on rather arbitrary statistical assumptions; and the ultimate evaluation of results is always intuitive. Linear statistics overlooks subtle patterns in complex data, but these patterns are there, and must be taken seriously. The CLA is one technique for eliciting such patterns.
Speculative Extensions
Putting aside for the moment the practical problems of analyzing the Combs et al data, let us now speculate as to what sort of patterns one might expect to find in an hypothetical "really good" set of human mood data. To keep things simple, instead of looking for optimal categories, let us think about only four categories:
A -- happy and excited
B -- happy and relaxed
C -- sad and excited
D -- sad and relaxed
Given this coding, the data set produced by a person over a period of time would look like
AABCDBCADCCBBCDDACCCDCCDCAAABBC...
The idea of symbolic dynamics is that list of letters is going to harbor implicit grammatical rules.
What kinds of linguistic rules might arise in these time series? The simplest rules are first-order constraints, constraints on what tends to follow what -- these, as discussed above, are the kinds of rules that have been observed in the real mood fluctuation data so far. One might find, for instance, that a person rarely moves from A to D or from D to A -- that the pairs AD and DA rarely appear in the series. This leads to the probabilistic grammatical rule that an A tends to be followed byan A, B or C. Or one might find that a certain person rarely goes from B to C or from C to B. These constraints would be pretty reasonable. Who goes from happy and excited all the way to sad and relaxed, without passing through intermediate stages?
Pushing a little further, one might find rules saying: "A has got to be followed by B or by C." Or, say: "A will nearly always be followed by B or by C." It could even happen, conceivably, that A, B, C and D only occurred in a few combinations, say ABACAB, DABACA, BABA... -- then one could look for patterns in the occurence of these combinations, these emergent "words."
These first-order transition rules may not seem like a very interesting kind of "language." They are what the Chaos Language Algorithm, in its present incarnation, looks for -- but they are only the beginning. For starters, there will be higher-order transition rules, i.e. rules involving sequences of length greater than two. An example would be: "If A is followed by C, then D is very likely to occur." In other words, ACD is probable, while ACA is unlikely. According to this rule, having been happy and excited, then sad and excited, one is not that likely to become happy and excited again; it is much more probable that one will get sad and relaxed.
In general, one finds that the simpler rules are more universal in nature, whereas the more complicated rules will tend to be more individual. For instance, while ACD may be a very likely sequence for some people, for others it may be rare. A manic-depressive person might well tend to have excitement stay high while happiness and sadness oscillated back and forth, meaning, in symbolic dynamics terms, a lot of cycles ACACACA..., or perhaps ACDACDA... or ACDCACDCA....
The patterns we have been discussing so far are all context-free rules. The next step up, in terms of complexity, is to context-sensitive rules, such as: "A follows B, but only in contexts where A occurs between C and D." Context-sensitive rules can always be summarized as collections of context-free transition rules, but they are more ordered than arbitrary collections of transition rules. They are higher-order patterns. For instance, one can list the following permissible sequences:
CABB
AABD
CABB
BACC
...
or else, using a context-sensitive rule, one can say something simple like: XABY is only a rule when X is C and Y is D.
Examples of context-sensitive rules are transformation rules, such as one finds in the transformational grammar approach to human languages. The simplest transformation rules are movement rules, such as "Whenever you have ABCDB, you can replace it with BCDAB, and still have a permissible sequence." In this rule the A is moving from the first to the penultimate position in the word, without disrupting the "meaning."
One can, potentially, have very complicated situations here. For instance, suppose
ADCBD
ACBCC
ADBDD
ABDCB
are all permissible. This would lead one to believe the rule would be
AX__X
But instead, it might well happen that the rule
ABCDB
that fits the pattern isn't a rule; instead one might have something like
BCDAB
as a rule instead. This situation, if it were to occur, would be a symbolic-dynamics analogue of what one finds in human linguistics, where
* I know the person was where
is ungrammatical even though
I know the person was here
I know the person was there
I know the person was in Topeka
I know the person was higher
...
are all permissible.
Based on the limited data that have been analyzed so far, there is no reason to believe that such subtle patterns actually exist in human moods. On the other hand, there is no reason to rule it out either. Mood fluctuations appear chaotic, but we know that chaos is just unpredictability in detail. Statistical and algorithmic predictions can still be made about chaotic systems; and symbolic dynamics is the foremost tool for making such predictions. Perhaps when we "get to know" a person, part of what we are doing is recognizing higher-order linguistic patterns in the time series of their behaviors.
Remarks on Modelling Mood Fluctuations
The model of mood fluctuations suggested by this data, and these speculations, is a simple one. Much as Combs (1996) views states of consciousness as attractors of psychological systems, one is here led to view mood states as psychological attractors. Individuals' mood reports are functions of their mood state attractors. In the case of a normal individual, there is a "normal" mood state attractor, in which the person tends toreside a majority of the time. They also have other mood state attractors, which they move in and out of.
The nature of the underlying psychological system is not really relevant for this qualitative model of mood fluctuations. However, it is well worth observing how well this understanding of moods fits in with the psynet model. The psynet model view mood states as autopoietic magician systems. In this view the attractors observable in numerical data such as that collected by Combs et al are images, in the numerical domain, of more fundamental attractors in the process domain.
In the laboratory, after all, one cannot directly study mental processes. One can only study observable "properties" of mental processes -- and these are usually numerical properties. The attractors of these numerical data sets reflect but do not entirely capture underlying process dynamics. Some of the noise and "messiness" involved in psychological time series data has to do with data-set limitations such as small size, but some of it also has to do with the inherent inaccuracy involved in the step from largely unknown process dynamics to observable but meaning-obscure numerical dynamics.
Freeman (1995) has observed, in the context of his analysis of EEG data, that the mathematical concept of "attractor" is really too rigid to apply to complex biological and psychological systems. A real psychological attractor, he points out, is only approximately defined: it is not guaranteed to be attracting, and once it has been reached, it may eventually be moved away from. This complaint, in my view, has to do with the fact that mental entities are attractors of magician systems, process systems, rather than numerical systems such as Likert scales or EEG readings.
Symbolic dynamics provides a way of abstracting away some of the peculiarities imposed by numerical measurements. What is meant by an "attractor" in a symbolic dynamics context is simply a stable region of state space: a set of symbols which tend to map into each other with reasonably high probabilities. These are the "attractors" that the CLA finds. The idea is that the symbols identified in numerical space will correspond to meaningful symbols in mental process space. One does not need the particular numbers involved in the experiment to mean much of anything, individually.
Returning to mood, one may well ask whether the mood state attractors are moved in and out of spontaneously, or only in response to external perturbations. This pertains to the question of whether chaos is present. One might hypothesize that each mood state attractor represents a wing of a strange attractor. The chaotic dynamics keeps the mood within that wing for a period of time, but sometimes it may move the mood out of the wing into another one.
Unfortunately, however, the question of spontaneous mood transitions cannot be answered on the basis of the Combs data. Furthermore, this inability does not seem to be a consequence of easily remedied flaws in this particular data set (e.g. short length, small dimensionality of mood reports, etc.). In general, the degree of environmental fluctuation in natural life is such that there will always be "reasons" to make the transition from one mood state to another. Thus, to distinguish natural,intrinsic transitions from extrinsically-induced transitions would be an extremely difficult task. Intuitively, one suspects that, with a similar data set perhaps ten to twenty times longer, one might be able to start to make some headway on this issue. On the other hand, the situation might well be the same. It may be pertinent to observe that, in the course of ordinary life, people often have extraordinary difficulty determining whether their own mood transitions are internally or externally induced.
Conclusion
This section reports an exploratory exercise in data analysis, involving the elicitation of emergent statistical/algorithmic patterns from complex psychological data. The data set in question, the Combs data on mood fluctuations, is not a "high quality" data set: it is very short, very noisy, and is riddled with breaks. Unfortunately, however, it is fairly typical of data in behavioral science. It is thus of interest to see just how much information can be squeezed out of such data.
On this point, the results here are somewhat equivocal. The number of large-basined attractors of the CLA would seem to depend on the quality of the data. If the data set is insufficiently informative, then there will tend to be a large variety of attractors, and a conclusive categorization and ensuing grammar are difficult to come by. This problem occurred severely with two of Combs' five subjects, and moderately with a third. Only for two of the subjects did the CLA have sufficiently regular behavior to justify detailed analysis of the results.
The results obtained for these two subjects do make intuitive sense. The pattern is one of a pattern of a normal mood with periodic deviations into non-normality. Non-normal moods, similar to the normal mood state, have a degree of persistence on their own. Jumps between normal moods and highly abnormal, outlying moods are unlikely; these unusual moods are generally reached via near-normal moods with their own internal coherence.
But, while these results are sensible, the clearest empirical conclusion to be drawn from this investigation is that, in order to obtain really good grammars for human mood fluctuation, one needs significantly better data than is provided by the Combs study. Since the amount of noise in the data would seem to be essentially outside the experimenter's control, what this means in practice is the collection of longer time series. Longer time series would allow clearer discrimination of various non-normal mood states and the transition between them. With the data used here, it is impossible to make inferences on such transitions with a comfortable degree of reliability.
This moderately pessimistic conclusion does not, however, cast doubt on the value of symbolic, pattern-theoretic methods as opposed to numerical methods of data analysis. What should be clear from the exploratory results presented here is that thetwo methods of analysis provide fundamentally different types of information. The numerical paradigm can tell us, in principle, whether or not chaos is present, how much chaos is present, and so forth; in the case of periodicity, it can tell us the period involved. This is useful information. The pattern paradigm, however, gives us information of more direct and intuitive psychological meaning. It tells us what kinds of states people get into, and how they transition between these states. The key point is that our natural-language understanding of psychological phenomena such as moods is based on the recognition of non-numerical patterns. Non-numerical, pattern-oriented tools for data analysis, such as the CLA, are thus entirely appropriate.
5.6 TWO POSSIBLE PRINCIPLES OF COMPLEX SYSTEMS SCIENCE
In the Introduction, the lack of precise, general laws in complexity science was mentioned. The need for an appropriate complex-systems "philosophy of science" was discussed. Instead of general laws, it was argued, one should expect complex systems science to provide us with abstract heuristic principles and general computational tools.
The ideas of this chapter follow this philosophy. The Chaos Language Algorithm is an example of a general computational tool. And, in this section, I will give some abstract heuristic principles which go along with the CLA: the Chaos Language Hypothesis, and the Structure-Dynamics Principle. These "principles" are not at this stage proven; they are merely plausible hypotheses. However, they are eminently falsifiable. Further work on the inference of structure from complex systems will automatically put them to the test.
The Chaos Language Hypothesis, first of all, states that the CLA will tend to produce similar grammars even when applied to very different psychological or social systems. In other words, it claims that psychological and social systems demonstrate a small number of "archetypal" attractor structures, and that the attractor structures observed in real psychological and social systems approximate these archetypal attractor structures. Mathematically speaking, the approximation of these archetypes should reveal itself as a clustering of inferred formal languages in formal language space. Thus one obtains the following formal hypothesis:
Chaos Language Hypothesis: The formal languages implicit in the trajectories of psychological and social dynamical systems show a strong tendency to "cluster" in the space of formal languages.
This hypothesis suggests the following three-stage research programme:
1. By computational analysis of data obtained from empirical studies and mathematical models, try to isolate the archetypal formal languages underlying complex psychological and social systems.
2. Analyze these languages to gain an intuitive and mathematicalunderstanding of their structure
3. Correlate these languages, as far as possible, with qualitative characterizations of the systems involved
If the postulated phenomenon of clustering could be shown, this would be a way of using dynamical systems theory to find precise, useful mathematical structures representing the qualitative properties of complex psychological and social systems. And this would, needless to say, be an extremely significant advance.
The Chaos Language Hypothesis postulates similarity of structure across different systems. The next hypothesis to be discussed, the Structure-Dynamics Principle, also postulates a similarity of structure, but in a somewhat different context: not between different systems, but between the purely static and purely dynamic aspects of the same system.
The essential idea of the Structure-Dynamics Principle is that, in many cases, specific components of a system are able to place the entire system in well defined states. Consider a system S which possesses two properties:
1. The components Si are each capable of assuming a certain degree of activation.
2. There are perceptible pathways between certain pairs (Si,Sj) of system components, indicating a propensity for activation of Si to lead to activation of Sj
The paradigm case here is obviously the brain: activation has a clear definition on the neural level, in terms of neural firing, and pathways are defined by dendritic connections and synaptic conductance. Similarly, if, following Edelman (1987) and others, one takes the fundamental components to be neuronal groups, one finds that the neuronal definitions of activation and pathways naturally extend up to this level.
Now let Sysi denote the collection of global system states that immediately follow high-level activation of system component Si. Then the question is whether the sets Sysi will emerge as natural categories from application of the CLA (or some more sophisticated, related algorithm) to the trajectory of the overall system S. Suppose this is the case: then what we have is a grammar of overall system states corresponding to the "grammar" of individual system components that indicates the order in which system components will be activated. But the latter grammar is, by the second assumption above, perceptible from purely structural information, from the observation of pathways between system components. Thus one has a grammar which is both a purely structural pattern (a pattern emergent in the collection of pathways in the system at a given time) and a purely dynamical pattern (a pattern emergent in the symbolic dynamics of the trajectory of the system). This is a most remarkable thing.
At this stage, there is no hard evidence that the Structure-Dynamics Principle is obeyed by real systems. However, the hypothesis is a plausible one, and, especially in the context of neuroscience and AI, it would go a long way toward solving somevexing problems. A little later, I will briefly consider the problem of knowledge representation from the perspective of the Structure-Dynamics Principle.
5.7 SYMBOLIC DYNAMICS IN NEURAL NETWORKS
With the concept of symbolic dynamics under our belts, we may return to the idea raised at the end of Chapter Two: that symbolic information is encoded in the brain/mind in the form of attractor structures. Symbolic dynamics gives a straightforward way by which logical and linguistic information can emerge out of complex, chaotic dynamics. It fits the psynet model like a glove.
But can brain systems really read each others' symbolic dynamics? In the end this is a question for the neuroscientist. As a mathematical psychologist, however, one may still gather valuable, pertinent evidence. For instance, one may ask: Are there biologically plausible ways for neural network architectures to encode and decode information in the symbolic dynamics of other neural networks? This is the question that we will address here.
Regarding encoding, first of all, it is quite possible to envision training neural networks to display given symbolic dynamics structures. This may even be done "blindly": it seems that the genetic algorithm may be used as a tool for evolving neural networks with given symbolic dynamics. Using networks of five to ten neurons, each one following the "discrete chaotic neuron" iteration of G. Mayer-Kress (1994; this is not a very simple formal model but a more complex iteration which attempts biological realism), and a population size in the range 20-50, it is possible to evolve neural networks whose strange attractors display various patterns of permissible and impermissible words (e.g., the second-order pattern of the generalized Baker map, in which 11 and 101 are disallowed). These are only simple, preliminary experiments, but they indicate that evolutionary processes are at least plausible candidates for the "training" of neural networks to produce formal languages. Given the existence of a strong body of evidence for the evolutionary character of neurodynamics and learning (Edelman, 1988; Goertzel, 1993), this is an encouraging datum.
Next, the question of decoding brings us back to the Chaos Language Algorithm. For the question is: Are there plausible neural network models that infer grammars from dynamics, in particular, from the dynamics of other neural networks.
The CLA, as currently programmed, is a fully "symbolic" implementation of the Markovian framework for grammar induction. It uses special trees for successor structures, and carries out categorization using the k-means method on tree space. However, there is no escaping the naturalness with which Markovian grammar induction lends itself to neural network implementation. Though relatively inefficient when run on today's serial computers, such neural network designs are of considerable theoretical interest. In this section I will describe one such design, which I call the "Markovian Language Network" or MLN. The MLN gives an affirmative answer to the question posed above. Yes, it says,biologically reasonable neural networks can infer grammars.
The Markovian Language Network
Alexander (1994) has designed a formal neural network that
recognizes repeated subsequences in strings. This network is "recursively modular" in structure, meaning that it consists of clusters within clusters within clusters.... This sort of network architecture is particularly interesting insofar it has been proposed to be essential to brain function (Alexander, 1995) (this idea will recur in the final section). Alexander's results show that the construction of successor structures can be carried out in a "biologically natural" fashion. The drawback of Alexander's network is that it requires a large number of iterations to carry out the recognition task, which makes serial simulation extremely slow. Recognizing the subsequences of length up to 4 in a single sentence can take a Sun workstation several days. However, calculations show that, if this algorithm were implemented in the brain, it would operate with acceptable speed.
Categorization, on the other hand, has been studied much more thoroughly by the neural network community. The most elegant and biologically sensible neural network model of categorization is Kohonen's self-organizing feature map (Kohonen, 1988). A feature map can be used to map n-dimensional successor structure space into the 2-dimensional space of a neuronal lattice. Categorization is then carried out by forming word classes from words whose successor structures activate neurons in the same region of the lattice.
Consider, then, the following architecture, consisting of four modules:
Module 1: a recursively modular network computing successor structures (Markov information) from a linear array of inputs representing linguistic units
Module 2: a self-organizing feature map, applied to the output of the Markov-information-gathering network, sorting the successor structures (and hence the original morphemic units) into categories.
Module 3: a simple network mapping the input array into a new array in which two elements have the same state if they correspond to two linguistic units in the same category. The output of this network is the "tag array."
Module 4: a recursively modular network (perhaps the same one that has been called Module 1) computing successor structures from the tag array, to be interpreted as "rules," and, possibly, to be fed into Module 2 again.
This type of neural network is what I call the "Markovian Language Network." The output of Module 4 is a list of the grammatical rules implicit in the input array. More sophisticated variations on the network can also be given, for instance, designs which incorporate a priori information on word categories in various ways.
The MLN requires only the addition of two extra "preprocessing" step to become a neural network implementationof the CLA. Specifically,
Module 5: a self-organizing feature map, used to provide a categorization of a neural system's behavior, mapping the network's states into a collection of categories.
Module 6: a simple network mapping trajectories of the neural system being studied into category labels ("code symbols"), based on the results of the feature map (in the manner of Module 3 of the MLN).
The code symbols are the linguistic units which the Markovian network in Module 1 of the MLN architecture receives in its input array. An MLN can thus be used to infer the grammatical patterns arising from another neural network's dynamics. A neural network is inferring the emergent grammatical patterns overlying the chaotic behavior of another neural network.
This train of thought has, it would seem, immense potential importance for neuroscience. Since the pioneering work of Walter Freeman (1993), it has become increasingly clear that strange attractors exist in neural systems and play an important role. Alexander (1995), generalizing Freeman's observations, has hypothesized that chaos is used by the brain on many different levels, as a general strategy for search and innovation. The MLN fits into this train of thought quite nicely: it provides a mechanism by which one neural system might understand and predict the complex, chaotic dynamics of another.
Pushing this line of reasoning to its natural conclusion, one might propose to understand the the brain as a whole as a system of neural networks mutually inferring languages in one another's behavior. This concept, while admittedly speculative, has deep meaning for cognitive science, as it hints at a possible strategy for bridging the gap between symbolic and neural network ideas. Chaos, in this view, may be seen as the link between logical, linguistic formalism and messy, self-organizing neurodynamics. And language is not a specialized, unnatural "add-on" to the routine processes of brain function, but rather, part and parcel of intermediate-level neural net dynamics.
5.8 DYNAMICS, PATTERN AND ENTROPY
Now, for the final section of this chapter, let us turn from psychology to physics, and look at some of the thermodynamic implications of dynamical patterns in system behavior.
My own slant on complex systems science is avowedly computational. I am not a natural scientist but rather a cognitive scientist, with expertise in mathematics, computer science, psychology and philosophy. However, one may also take a quite different approach to the emerging science of complexity: an approach grounded in the physics of complex systems. In this section I will deal with the relationship between dynamics and pattern from a physics point of view, by exploring some of the relationships between entropy, chaos, and algorithmic information. As this material will not be explicitly taken up elsewhere in the book, the impatient reader may wish to skip itover.
First I will review a result which I call "Brudno's Equation," which relates the metric entropy of the trajectory of a dynamical system to the computational properties of this trajectory, using the notion of algorithmic information. Algorithmic information is an important concept which will arise again in later chapters.
Then I will give some inequalities due to Carlton Caves (1990), related to Brudno's Equation, which connect chaotic dynamics with computational complexity. Finally, I will discuss Baranger's Law (Baranger, 1993), a recent discovery in far-from-equilibrium thermodynamics which explains entropy production in terms of chaotic dynamics. Drawing these various ideas together, I will argue that physical entropy bounds degree of chaos, which bounds algorithmic information -- a nice relationship which joins thermodynamics, dynamical systems theory and computation into a single formula.
Algorithmic Information and Entropy
The algorithmic information of an entity, introduced above, is defined roughly as the length of the shortest program which computes it. If an entity has algorithmic information less than its length, this means it has at least one pattern. If one has an infinite entity, say an infinite sequence of numbers like
123456789101112131415161718192021...
then the algorithmic information is defined as the value of
(algorithmic information of the first N digits)/N
for N very large.
This business of "program length" might seem a little dubious -- doesn't it depend on what computer you use? But this is exactly the crucial point. Suppose one has a huge collection of computers -- say, ten million or so. Some of them are PC's with built-in BASIC compilers, some are PC's without, some are CRAY supercomputers, some are tiny little Timex-Sinclairs from the late 1970's, some are sophisticated massively parallel wristwatch supercomputers from the year 2029,.... Then, if one takes a sequence of letters that is long enough, all the computers will basically agree on which programs are patterns in the sequence and which ones are not. The reason is that all of them are universal computers, and hence all of them can, in principle, simulate each other. The program for simulating another computer may be long, but there are only ten million computers, so there are at most (ten million) * (ten million) necessary simulation programs. Just take a sequence vastly longer than the longest one of these simulation programs. This fundamental insight is due to Gregory Chaitin.
But what does all this abstract computation theory have to do with entropy and chaos? In 1978, the Russian mathematician A.A. Brudno proved the following truly remarkable equation:
ALGORITHMIC INFORMATION = METRIC ENTROPY (10)
Now, this result is not completely general. It applies only to dynamical systems that are ergodic (a technical way of saying that every trajectory of the system must lead to basically the same behavior; time averages must equal ensemble averages). Or, for a dynamical system that is not "ergodic," it holds on the system's "ergodic set" ... the set of trajectories for which the system behaves as if it were ergodic. But most of the typical examples of chaotic systems are also ergodic. So Brudno's result, despite its limited scope, brings chaos theory up to a whole new level. It implies that, for every single trajectory of such a dynamical system, the algorithmic information is the same as the amount of chaos.
But the algorithmic information is just a measure of how difficult it is to predict the behavior of a system using computer programs. So we reach the following conclusion: for ergodic systems, truly chaotic systems, chaos is unpredictability. This is the most intuitive result in the world: this is exactly what chaos is supposed to be ... deterministic unpredictability. But Brudno's Theorem puts some meat on the idea: it explains why chaos is unpredictability.
The restriction to "ergodic" systems, though somewhat tiresome, cannot be escaped. These results are too powerful to hold with complete generality. But some recent work by the laser physicist Carlton Caves provides at least a partial generalization of Brudno's Equation. Caves has proved an equation which implies that, in every case,
METRIC ENTROPY < AVERAGE ALGORITHMIC INFORMATION <
2 * METRIC ENTROPY
(11)
(The reasoning leading up to this relation was borrowed from two Russians named Zvonkin and Levin, who in 1970 discovered a primitive form of Brudno's Equation).
This is not so precise as Brudno's Equation, but it still says quite a bit. In general, always, regardless of what kind of dynamical system we're dealing with, the average algorithmic information of a trajectory is trapped between the metric entropy and its double. Or in other words: the average algorithmic complexity of a system is between one and two times the amount of chaos in the system.
Entropy
Now let us turn from the mathematical notion of metric entropy to the physical concept of entropy. Operationally, in the chemistry lab, entropy measures the amount of heat that must be expended to set up a system in a given state, using reversible operations. But heat is really nothing more than random molecular motion -- and thus the entropy of a collection of molecules may be regarded as our degree of ignorance regarding the positions of the molecules. This what makes entropy so interesting. It is subjective and objective at the same time.
More formally, recall that each possible microscopic state of a system can be represented as a point in an abstract multidimensional space called "phase space." The entropy of a system in which all states are equally likely is just the logarithm of the total volume in phase space occupied by all the states of the system. And if all states are not equally likely, then one just divides the phase space up into regions R(1),...,R(N), assigns each region a probability p(i), and defines the entropy as the weighted average
-k[ p(1)logp(1) + ... + p(N)logp(N) ] (12)
where k is a number called "Boltzmann's constant." A standard limiting process extends this definition to the case of an infinite number of divisions of phase space; the sum becomes an integral, and the entropy becomes coordinate-dependent.
The First Law of Thermodynamics says that energy is conserved for closed macroscopic systems, just as it is for microscopic systems. An open system, one which interacts with its environment, may appear to gain or lose energy ... but if one considered the overall system of "open system + environment," one would find energy conserved once again. The Second Law of Thermodynamics, on the other hand, states that entropy must always increase. Entropy can never, under any circumstances, go down; and in an idealized mathematical system it might stay the same, but in a real system it will always go up. In other words, we can never get more knowledgeable, only more ignorant. We may gain knowledge about those things which are important to us -- but in doing so we will randomize the motions of other molecules and thus create ignorance in their regard. Life itself is a process of localized entropy decrease: living systems are highly ordered; their very premise is predictability and hence low entropy. But living systems can only maintain themselves by creating heat and hence increasing the overall entropy of the universe.
Mathematically, when one puts the Second Law together with the First Law of Thermodynamics, one obtains a contradiction. The First Law, the macroscopic conservation of energy, implies that, as a system evolves, volume in phase space doesn't change. This conclusion follows from a result called "Liouville's Theorem." The region of phase space representing a physical system can change its shape ... but its total area must always remain the same.
What gives? The facile answer is the the first law applies only to closed systems, whereas entropy is used to study open systems -- systems that are "dissipative" rather than "conservative." But this answer is actually no answer: every dissipative system is just an approximation to some conservative system. Every open system can be closed: just add in the environment, and consider it part of the system.
The real answer is much more interesting. In fact, it is the apparent contradiction between the First and Second Laws that gives entropy its meaning. The trick is that the region of phase space representing a system can spread out like a multidimensional octopus -- it can get stringy, and distribute itself very thinly over a very large range. This doesn't changeits mathematical volume. But what happens when an actual person goes to compute the volume ... to measure the system?
A standard observation in fractal geometry is that to compute the length of the coastline of Britain, one has to use a ruler of finite length. And the length of one's ruler determines the measured length of the coastline. The actual coastline doesn't change, of course; it's a matter of how many inlets and mini-penninsulas the ruler takes account of. Where fractals are concerned, measurement must take place on a certain specific scale.
A related conclusion holds concerning the measurement of volume in phase space. In order to actually measure the volume of a region, one must use some "measuring instrument" with a finite scale. For instance, one can divide phase space up into little tiny boxes and count how many boxes contain part of the region.
Multiplying by the volume of a single box, this should give the total volume of the region concerned, right? Well ... not exactly. The problem is that, when a region spreads out and gets all stringy, it will have a little tiny portion in a great number of regions. So it will appear to take up more space than it actually does.
This, as it turns out, is the source of entropy. Though it is measurable in the chem lab, entropy is nonetheless at bottom a subjective quantity. It arises from the necessity for finite-scale measurement accuracy. When measuring an evolving dynamical system, if one wants to keep up a constant accuracy of measurement, one has to continually measure on a finer and finer scale. But then one is no longer computing entropy on the same partition of states ... the entropies measured at two different times cannot be fairly compared. Entropy is free to increase ... as the Second Law says it must.
Entropy and Metric Entropy
So a system can be represented as a collection of states, each of which is a point in "phase space." And the region of phase space representing a system will tend to spread out, octopuslike, blurring the process of measurement. But why does this happen?
The answer is nothing but chaos. The mathematics of chaos theory may be used to make the informal idea of "spreading out like an octopus" quite precise. Chaos causes nearby trajectories to diverge from each other -- and hence blurs phase space, producing entropy. The following two conclusions, both simple mathematical results, are truly remarkable in their implications. They provide a complete mathematical justification for the informal "spreading out and getting stringy" argument that was used above to explain the origin of entropy.
First of all, there is an inequality relating the metric entropy with the real entropy:
METRIC ENTROPY < ENTROPY/k (13)
This is simple to interpret. It means that, if the entropy of an evolving system is small, then the system doesn't spreadthings out much. On the other hand, if the system spreads things out a lot, then the entropy is big.
And then there is an equation, actually a "limiting equation," regarding what happens once a system has been evolving for a long time:
RATE OF CHANGE OF ENTROPY --> METRIC ENTROPY (14)
Suppose that, as the system evolves longer and longer, one computes the rate of change of the entropy at every time step. How fast does the entropy increase? Eventually, this equation says, the rate of entropy increase is just about equal to the amount of spreading ... to the metric entropy.
These are not "deep" results; they follow easily from the technical definitions involved. But they do a most admirable job of connecting chaos theory with far-from-equilibrium thermodynamics. Subtracting all the technical stuff, what they say is quite simply that chaotic systems have a lot of entropy, and they produce entropy faster than nonchaotic systems. Note that we have not ruled out the possibility of a nonchaotic system with high entropy. But this system, if it exists, will produce entropy slowly. If a system produces entropy fast, it's got to be chaotic.
Baranger's Law
The Second Law says that entropy goes up. But how fast does it go up? One possibility is that it always increases as fast as it can, given the constraints of the particular situation. This principle of "maximum entropy production" was put forth in 1989 by the system theorist Rod Swenson; its philosophical implications were extensively explored by Sally Goerner in her 1993 book The Evolving Ecological Universe.
When one considers the matter carefully, one finds that Swenson's law of maximum entropy production is a serious overstatement. It is simply not true that, in all circumstances, entropy increases at the fastest possible rate. There are situations in which a system has a choice between two states -- and it choose the one which produces entropy more slowly. A simple example is water being forced through a pipe. As it rushes faster and faster, it makes a transition from smooth, "laminar" flow to chaotic turbulence -- and during this transition, it passes through a stage known as "intermittency," in which smooth and turbulent flow rapidly succeed one another. In this sort of transition, pathways of rapid entropy production are routinely passed up in favor of those which are slowpokes in the entropy game. Swenson's law does not hold.
But there are also interesting situations in which Swenson's idea works. One example is a peculiar apparatus called a Benard cell. A Benard cell is a flat box filled with some fluid, often water. The bottom of the box is a metal plate, and the critical parameter of the system is the temperature. With no heat, the cell is random; the only action is random molecular collisions. Brief molecular inhomogeneities arise, but these are mere ephemeral fluctuations from the equilibrium state. And if you take a Benard cell and heat it up just a little, the heat willdiffuse slowly through the cell, returning the system gradually to equilibrium. But suppose instead you heat the cell up a lot -- that's where the fun starts!
Hot collections of molecules are lighter than cold ones -- and if a collection is hot enough, its buoyancy will be sufficient to overcome the viscous drag of the surrounding fluid, and it will float around as a coherent unit. Once a certain critical temperature is reached, some of these overheated collections will shoot up all the way to the top of the cell, and will pull other fairly hot molecules up along with it.
But then, what happens when the hot molecules finish the voyage to the top? The air outside cools them down ... they sink back down, pulling other fairly cool molecules with them. Then, once they get to the bottom, they heat up again. Up and down, up and down the molecules cycle -- thus setting a trend for other molecules to follow. Before long a whole region of the cell is rolling over and over. This is a classic example of self-organizing behavior. No one tells the cell what to do: it just organizes itself into a coherent high-level pattern, based on a tiny change in the critical parameter of temperature.
Why doesn't the cell return to equilibrium? In dynamical language, the rolling motion is a kind of attractor. As the molecules roll up and down, they create friction and pull in more energy from hotter parts of the cell -- so there is no way for the situation to settle down, for the energy difference to disappear. This would not be possible in a closed system -- but the system is dramatically open; it is rapidly dissipating energy, and increasing entropy. The system keeps itself away from equilibrium. Instead of the "fixed point" attractor of equilibrium, it converges to the more complex attractor of continual rolling motion.
Eventually, uniform "rolls" emerge, and occupy the whole cell. Very elegant, totally spontaneous -- and, as shown by the computer simulation of David Hogg (1992), entropy production maximizing. As one increases the temperature of the Benard cell, the system systematically chooses the pathway that produces entropy fastest. When it finally opts for pulsating rolls instead of statistical homogeneity, this is because the former situation zooms us more quickly toward heat death!
Two Types of Bifurcation
Fluid forced through a pipe doesn't obey maximum entropy production. Emergence of rolls in the Benard cell does. Thus, an intellectual puzzle emerges. What is it that differentiates the two types of experiment? When is entropy maximized?
As it turns out, one difference between the two is the type of bifurcation involved. A "supercritical" bifurcation is a situation in which, when a certain parameter of a system passes a certain critical value, the system reaches a "choice" between two mathematically possible paths. One path is stable, the other unstable. The stable path, in practice, is the one selected -- and the transition is, in a way, smooth ... there is no discontinuous "jump" in system behavior. A "subcritical" bifurcation, on the other hand, involves a sudden leap from one state to another -- such as when water running through a pipehops from smooth to turbulent flow. In technical language, a supercritical bifurcation has a continuous bifurcation diagram, while a subcritical bifurcation does not.
Rolls in a Benard cell arise through a supercritical bifurcation: when the temperature passes a certain level, the future of the system splits up into two possible paths, one stable, one not. Turbulent flow through a pipe, on the other hand, arises subcritically: there is no smooth motion toward a "choice point" ... as the speed of the flow increases, there is a sudden jump.
Given these facts, analogical reasoning leads to the following conjecture: that Swenson's principle of maximum entropy production works only when a system is presented with a choice that is a supercritical bifurcation.
But, as it turns out, even this hypothesis is too strong -- another restriction is needed. What is necessary is first of all that, before the bifurcation, the system's state have a certain property which is a "dynamical invariant."
Losing Structure through Chaos
A "dynamical invariant" is any property of a system that doesn't change as the system evolves. For instance, the Benard cell, until the bifurcation, has the property of statistical homogeneity ... meaning it looks the same everywhere. This property doesn't change as the system evolves: it's a dynamical invariant. Often, in physics, dynamical invariants take the form of symmetries ... for instance, the uniform appearance of the Benard cell may be characterized as a "translational symmetry," meaning that when you shift your eyes from one part of the cell to another (i.e. translate your coordinate system), the appearance of the cell remains exactly the same.
What is needed to make Swenson's principle work is that a system have a dynamically invariant property, which is lost in a supercritical bifurcation. In other words, the bifurcation splits the system's evolution into two "possible" branches, one stable and one unstable ... and the invariant property follows the unstable branch, the branch that doesn't happen. In this case, as it turns out, the stable branch without the property has a higher entropy. This, finally, is Baranger's Law -- proved by Michel Barenger in 1993, in a brief paper modestly entitled "An Exact Law of Far-from-Equilibrium Thermodynamics."
In the Benard cell, when one reaches the bifurcation point, the homogeneous, symmetric condition is no longer stable. There is, mathematically, a possibility for the system to continue in a symmetric way -- for the cell to forever keep from rolling. But this possible path never materializes, because it is unstable -- the slightest random fluctuation is enough to destroy it. On the other hand, the stable state, the complex self-organizing rolling behavior of the cell, fails to display the symmetry property that the system had before the temperature reached its critical value.
The proof is somewhat involved, but the reason for this beautiful law is not so hard to see. Consider: one never has perfect obedience to any invariant property -- in reality one has situations that are "almost homogeneous", "almost circular", etc. Even before the bifurcation, what one is actually observing is not a trajectories possessing a dynamical invariant, but a trajectory almost possessing a dynamical invariant. Before the supercritical bifurcation, these near-invariant-possessing paths stay near the "phantom" paths that really have the dynamically invariant property. Afterwards, though, chaos sets in ... large Liapunov exponents develop in directions away from the phantom paths, and the real paths lose the property. But we have seen that chaos -- Liapunov exponents, metric entropy -- is proportional to rate of entropy production. Nature, by following chaos, maximizes entropy production!
A complex line of reasoning ... but yet, a very simple conclusion. What we have concluded is, quite simply, that smoothly losing structure increases entropy. The "structure" may be symmetry or any other dynamical invariant. When a supercritical bifurcation occurs and this structure is broken, the structure-shattering chaos produces entropy. In the process, the system may well develop new, more complex structures, like the rolls in the Benard cell ... but the old invariant structure is lost.
Chaos and Computation
Let us return to the Benard cell, one more time. Unfortunately for Swenson's Law, the emergence of rolls is not the whole story. It turns out that, if the temperature is increased at just the right rate, the rolls give way to a sort of "honeycomb" of equally sized hexagonal cells. If the entropy went up here, one would have another example of entropy increase driving form creation.
But in fact there is no reason to believe this is the case. This is a subcritical bifurcation, not a supercritical one. The hexagons, which are after all a more interesting structure than the rolls, are not formed by entropy increase at all.
What Baranger's Law doesn't get at is the production of new structure that often follows the destruction of an old structure. It explains the demise of the homogeneous structure of the Benard cell, but not the emergence of the rolling structure ... let alone the emergence of the hexagonal structure. But it is these phenomena, rather than the initial symmetry, which make the Benard cell interesting.
To see the relation between chaos and structure, we have to return to computation theory. Given the relationship between metric entropy and entropy, Brudno's Equation is easily seen to imply that
ALGORITHMIC INFORMATION < 2*ENTROPY/k (15)
So if a physical system has almost no entropy, all its trajectories will be very simple to compute. Only a system with large entropy can have algorithmically complex trajectories. But then, complex trajectories lead to high metric entropy, which means rapid entropy production. Everything connects with everything else. Abstract computational patterns are not allowed to forget their origins in an actual, physical dynamical system. Thermodynamics, chaos and computational complexity are threedifferent views of the same thing: complexity.