The mathematical methods of nonlinear time series analysis are idealized versions of the techniques that the mind uses to understand the time series of perceptual data that are continually presented to it. In particular, a strong parallel exists between
the chunking of perceptions into wholes, and their subsequent entry into long-term memory, and manipulation via associative and hierarchical relationships, and the analysis of time series via dimensional embedding (production of
delay vectors), and the carrying out of prediction, noise reduction and global modeling using
neighborhoods of delay vectors, and centroids of these neighborhoods
Key words: dimensional embedding, delay vector, perceptual chunking, associative memory, categorization
Complex systems philosophy tells us that the interesting structures we see in the world around us are emergents, arising via the self-organization of smaller-scale structures, interacting in relatively simple ways. But the processes of self-organization and emergence are much easier to identify qualitatively than quantitatively. Time series analysis is about taking numerical data derived from complex systems, and using this data to understand the self-organization and emergence qualitatively observed in the system.
Traditional methods of time series analysis are based on linear mathematics. However, linearity and self-organization are antithetical; to study self-organization and emergence we need to do nonlinear time series analysis (see Footnote 1 for a philosophical explanation of why). The techniques for analysis of time series resulting from complex, nonlinear processes are numerous and sophisticated; many are summarized e.g. in Kantz and Schreiber (1997). However, these techniques, at present, have the appearance of a grab-bag of assorted methods rather than a unified conceptual approach.
My goal in this article is to come at the topic from a different direction, and to present a unified conceptual framework for nonlinear time series analysis, based on the complex systems philosophy of mind outlined in my previous publications (Goertzel 1993, 1993a, 1994, 1996, 1997). I will then explain how the standard methods of nonlinear time series analysis fit into, and in many cases emerge from, this framework.
The basic premise of the analysis given here is that understanding a nonlinear time series, describing a complex process, is a cognitive operation, similar in structure and dynamics to other complex cognitive operations. In particular, a strong parallel will be noted between
Specific aspects of the mapping proposed are the following:
perceptual chunking : decomposing series into delay vectors associative spreading finding neighborhoods between chunks stored : in delay vector space in long-term memory finding categories finding centroids among chunks : of delay vectors living in (building hierarchical neighborhoods mental structure)
In essence, it is argued that the mathematical methods of nonlinear time series analysis are idealized versions of the techniques that the mind uses to understand the time series of perceptual data that are continually presented to it.
Finally, although this aspect is not enlarged on here in detail, it should be noted that these observations have dramatic implications for the design of intelligent time series analysis software. They give a conceptual blueprint for integrating nonlinear time series analysis into the cognitive activity of a general-purpose self-organizing artificial intelligence system.
In Goertzel (1993) I define an "intelligent" system as a system which can achieve complex goals in complex environments. It follows from this that an intelligent system must be able to understand multi-part systems that change in complex ways over time. Every intelligence must do this in its own way, but there are certain emergent properties which, I claim, are universal to all intelligent systems.
Rather than justifying these "emergent universals of intelligence" in detail here, I will simply summarize some of them. A variety of justifications may be found in the publications referenced above.
Now, taking this framework for granted, consider the situation of an intelligent system when presented with perceptual information. The perceptual organs of the system are obviously only giving it a fairly narrow "window" on the complex systems in its environment. Through its perceptual organs, it receives a stream of input, which is a projection of the stream of information present in the actual environment. In order to understand the environment, the system must reconstruct, in some sense, the emergent structures in the actual environment from the projection it gets from its sensory system.
Why must it reconstruct the emergent structures in the environment? Because, in order to achieve goals in a complex environment, an intelligence must be able to predict what is going to happen in the future of the environment based on what has happened in the past; and to infer what is happening in one part of the environment based on what is happening in another. And the only way to predict a complex system is to recognize its emergent structures -- chaos implies that predictability on the level of individual data points is unreliable, whereas prediction on the level of emergent structures has at least a hope of being approximately reliable!
Of course, this "reconstruction" need not be explicit -- an intelligent system doesn't have to build an inner simulacrum of its environment. But, it has to behave with understanding of the emergent structures in its environment. In order to achieve goals based on the understanding of these emergent structures, it must be able to manipulate internally its implicit or explicit representations of these structures. It must be able to infer relationships between emergent structures, most notably associative and hierarchical (e.g. sequencing) relationships.
The lineal awareness system is crucial here. The basic element of perception is not the atemporal instant but the temporal chunk, a series of perceptions grouped into an instantaneously perceived whole. That the perceptual chunk is simultaneously temporal and atemporal is the essence of all phenomenological weirdness -- the "twistiness of being," one might call it. A temporal chunk represents an elemental observation of the environment, which, if sufficiently interesting, is entered into the long term memory as a new, autonomously acting, information-bearing agent. The long-term memory -- the general, self-organizing agent system of mind -- records the relationships between notable temporal chunks, e.g. which ones follow which other ones, which ones associate with which other ones, which ones are parts of which other ones.
In the instance of language processing, a temporal chunk is a brief phrase, such as "Bill tickles me" or "a temporal chunk is a phrase." In music, a temporal chunk is a musical phrase -- and much interesting music achieves its interestingness by playing around with the boundaries of temporal chunking. You're never quite sure when one chunk is going to end and the next begin! There are cases, such as fugues or samba drumming, where multiple overlapping temporal chunks are intentionally set up by a composer or performer, but these are not necessarily counterexamples -- one may argue that such compositions derive their meaning from the way they partially parallelize a fundementally serial perceptual chunking mechanism.
In many cases chunks are non-overlapping. One musical phrase begins when the previous one ends. However, overlapping chunks also have their roles, in modern music and poetry, and in other contexts as well. Language parsing involves the perception of many overlapping chunks, representing possible understandings of parts of a sentence. The ultimate parsing of a sentence involves dividing the sentence into a series of nonoverlapping chunks; but the semantic understanding of the sentence may involve some of the overlapping chunks discarded in the process of forming the "main," syntactic meaning of the sentence. It is not only the nonoverlapping chunks that are entered into long-term memory.
Reasoning is then done using temporal chunks and the relationships inferred between them. This reasoning takes many forms. For example, when a temporal chunk is activated in memory, temporal chunks that are similar -- associated -- are also activated, via association-bearing mental agents. Furthermore, temporal chunks that follow it are activated, via succession-relation-bearing mental agents. (The nature of these agents will vary depending on the hardware underlying the mind in question.)
In order for intelligence to be possible, the inferred relationships between the temporal chunks extracted from sensory projections of the environment has got to mimic, approximately, the actual relationships between the states of the environment corresponding to the formation of the temporal chunks. For example, the inferred dynamics of the temporal chunks (which chunks follow which ones with what probabilities) must approximately mimic the actual dynamics of the environment (which environment states follow which ones with what probabilities). Categories recognized among temporal chunks should group together chunks corresponding to environmental states that, in fact, play similar roles in environmental dynamics. Etc.
Intuitively, this kind of mimicry is clearly possible if the mapping between states-of-the-environment and temporal chunks is generally one-to-one; i.e., if each state of the environment corresponds to a different temporal chunk. If the mapping is not one-to-one, then the mimicry will be more difficult, since a single temporal chunk will refer to a large number of different environmental conditions.
In reality, of course, the mapping is very rarely one-to-one. Some simple mathematics may be revealing here. In a very system, the N elements comprising a length-N temporal chunk will be largely independent of each other, and so if each element can take one of K different values, there will be KN possible temporal chunks. If one wishes to think in terms of continuous-valued attributes, the same type of information limitation may be arrived at using a different language. For instance, physicists like to talk about "degrees of freedom", and would say that a length-N temporal chunk has N degrees of freedom, and hence can only directly mimic an environment with N degrees of freedom or less.
If the environment has more possible states than the collection of temporal chunks can accomodate, then a one-to-one mapping is not possible. Instead, however, one may ask that an individual temporal chunk corresponds to a natural region or "cell" in the state space of the environment. The relations between temporal chunks then correspond to relations between cells in the environment's state space. The degree of this correspondence depends on many different things.
Many different environmental states will correspond to the same sensory input I. However, when one considers a series of two sensory inputs -- say, I(T) at time T and I(T-1) at time T-1 -- then the number of environmental states at time T consistent with these sensory inputs becomes smaller. Many of the environmental states consistent with I(T) would not likely be preceded by I(T-1). The longer the temporal chunk, the more one narrows down the possibilities; the smaller the cells in environmental-state-space corresponding to a given temporal chunk will get; and the better the system's understanding becomes.
On the other hand, longer temporal chunks become more and more difficult for a mind to manage. A large pool of short chunks is information-rich; relationships amongst chunks can be extracted very easily, because each chunk can be considered in the context of many other similar chunks. Each long chunk will not have as much context, and so will not be analyzed as carefully.
In short, one factor mitigating in favor of long temporal chunks is the fact that they can, in principle, provide better understanding of the environment. On the other hand, chunks that are too long relative to the lifespan and overall memory capacity of the intelligent system will be of no use, because it will be impossible to carry out statistically significant reasoning in regard to these chunks. These are by no means the only factors conspiring to determine the capacity of the mind's lineal awareness system; but the question of lineal awareness system capacity will not be considered any further here.
Much of nonlinear time series analysis is based on dimensional embedding, i.e., the construction and manipulation of delay vectors. Given a long series of data, one creates a bunch of short data samples from it, subseries of length m called "delay vectors". One then considers this series of delay vectors as a time series in itself, and analyzes the dynamics of this derived time series, under the assumption that it will mimic the dynamics of the complex system that gave rise to the original time series under analysis.
The mimicry of the original dynamical system by the dynamical system implicit in the delay vector time series is mathematically "guaranteed" to almost always hold, by some highly technical theorems, proved by Takens (1985) and Sauer (1993). However, this mathematical guarantee applies only to infinitely long time series known with infinite accuracy, and getting it to work in reality can prove highly difficult. The conditions of the mathematical guarantee are as follows. If the series of actual states of the system under observation falls into an attractor of fractal dimension D, then a collection of delay vectors of length > 2D will mimic the dynamics of the original system.
In fact, delay vectors are not an arbitrary mathematical trick, but rather a reflection in abstract mathematics of the psychological technique of perceptual chunking. What the embedding theorems say is that a perceptual chunking system which produces chunks of length N can be expected to produce a subset of long-term memory that accurately mimics a real-world system whose attractor has fractal dimension less than N/2. For instance, if we have a chunk size of 7, then we're able to accurately mimic systems whose attractors are 3-dimensional, but not 4-dimensional. Note that the important variable here is not how many variables ("degrees of freedom") the system being observed appears to have in this or that mathematical model, but rather the dimensionality of the system's attractor -- which indicates how many dimensions are minimally needed to describe the system's long-term behavior.
Mathematically, the root of the "embedology" results is Whitney's (1936) embedding theorem, which says that any D-dimensional surface can be embedded in 2D+1 dimensional space. This is a pleasant, intuitive result, which may be proved inductively in the spirit of the following observations. If a D-dimensional surface is embedded in D dimensional space, it may overlap itself in the manner of a Klein bottle, and the overlap will be of dimension D. If it is embedded in D+1 dimensional space, the overlap will be of dimension D-1; and so forth, until in 2D+1 dimensional space, the overlap disappears.
So, suppose one has taken the D-dimensional attractor A of a certain dynamical system, and embedded it in M < 2D+2 dimensional space. The region of M dimensional space containing the attractor may be divided up into cells, and the movement of the system's states through the attractor may be captured as a table of probabilities of cell transitions. One may ask, for each cell x: how long can the system be expected to stay in cell x; and, if the system is in cell x right now, what's the chance that y is the next cell it visits after it leaves x? Clearly, if
Now, suppose one is not observing the dynamics of the entire system, but only has access to some one-dimensional measurement of the system. Clearly, the one-dimensional measurement at time T does not generally capture all the detail of the system's state at time T. Since the system's attractor is a D-dimensional surface, embeddable in M-dimensional space, we know that at least M independent measurable values must be needed to capture the system's state at time T.
However, one may construct a delay vector of one-dimensional measurements, of length M. This chunk of length M does have the right number of variables to characterize the system's state at a given time. Given experimental error, one must consider that one actually does not have a precise delay vector, but rather a cell in M-dimensional space centered on particular delay vector. (For even greater precision, we would consider this a fuzzy cell, but this will do for now.)
One may now ask whether each cell in the M-dimensional space, representing a given collection of system states, corresponds uniquely to a cell in the M-dimensional space, representing a collection of delay vectors. I.e., is each system state uniquely characterized by a history of M one-dimensional measurable values leading up to that system state, or is there duplication? If this unique characterization property holds, then we can make a quite remarkable mapping: each delay vector corresponds to a certain cell in state space. Furthermore, if cell x in state space is nearly always followed by cell y, then the delay vector corresponding to x will nearly always be followed by the delay vector corresponding to y. In other words, the dynamics of the delay vectors will mimic the dynamics of the original system. The mechanism of perceptual chunking allows one to overcome the limitation imposed by projecting a multidimensional system into a single-dimensional measurement. Ignorance of aspects of the world, imposed by narrow perceptual bandwidth, is overcome by attention to process.
The weak point in this argument is the assumption of unique characterization: why should it happen that each cell of system states corresponds uniquely to a cell of delay vectors? Why not have different states that have the same history of one-dimensional measurements leading up to them? What is known, mathematically, is basically this: If you pick a random dynamical system from an appropriate distribution on the space of dynamical systems, then the unique characterization property will hold for virtually all cells. Now, real-world dynamical systems are arguably non-random. However, the method has been shown to work on a number of interesting special cases, and so it seems to be sound in practice -- assuming one gets everything right, i.e. one has measurement samples at the right frequency, one guesses the dimension M correctly, there is not too much randomness contaminating the determinism of the data, etc.
If the parameters are not chosen correctly, or if there is a great deal of noise polluting the data, then the method will still work in a probabilistic sense -- the table of transition probabilities between delay vectors will still approximate the table of transition probabilities between system states. But the accuracy will be degraded; and the attractor obtained will make less geometric sense viewed from a coordinate-geometry rather than transition-probability point of view.
This is fascinating mathematics -- but it is also more than this. It takes only a moment's reflection to see that dimensional embedding is essentially the same operation as perceptual chunking. It is simply an idealized case. Instead of recording only the most interesting chunks in long-term memory, we keep a record of all chunks. Instead of mostly distinct chunks, we take all overlapping chunks. And instead of settling for rough probabilistic correspondences between real-world dynamics and internally inferred dynamics, we seek to get the parameters exactly right, so we can perfectly mimic real-world dynamics.
Looking at nonoverlapping chunks, as the mind habitually does, simply gives a less filled-in sampling of points on the attractor. And having rough probabilistic correspondence rather than exact correspondence means that the mind can't rely on visual similarity of inferred attractors, but only on approximate similarity of transition probability tables.
Many scientists involved with applying dimensional embedding to behavioral data have been frustrated with the results, but this is because their situation is more like the situation of a mind confronted with a highly complex environment. Given the complexity of the systems they are studying, they cannot expect highly accurate, geometrically beautiful reconstructions, but only rough probabilistic approximations of real-world attractors.
In this section, I will review several of the nonlinear time series analysis methods described in Kantz and Schreiber (1997), and explain how they reflect the complexity-based temporal information-processing framework described in the Section 2. I will not discuss every algorithm presented by Kantz and Schreiber, but I will discuss the most important ones, and in a way that gets at the root ideas common to all the variations. References to chapters and pages, in this section, should be assumed to be to Kantz and Schreiber.
Most of the techniques for nonlinear time series analysis described in Kantz and Schreiber are techniques for manipulating delay vectors, i.e., idealized perceptual chunks. The manipulations used are highly reminiscent of the basic psychological operations described in the previous section, involving association and categorization, i.e., heterarchical and hierarchical structure, and the interaction between the two.
Let us begin at the purely mechanical level. Appendix 2 of Kantz and Schreiber describes an simple box-counting algorithm for
Many of the time series algorithms described by Kant and Schreiber involve taking the collection of delay vectors in the neighborhood of delay vector V as a kind of surrogate for V -- under the principle that, given the error involved in measurement, V might as well have been any of these neighbors. This is the simplest form of reasoning by analogy; it is merely association-spreading. To find out about V, this heuristic technique says, ask about things associated with V as well.
In particular, this technique is proposed in Chapter 4 as an approach to time series prediction. To predict what will happen next given a certain series of data, find all delay vectors associated with this series of data, and take the average prediction.
It is also proposed as a basic method for noise reduction. To remove noise from a data value, consider the delay vector V with that value in the middle of it; and replace the data value with the average of the numbers in the middle of the delay vectors similar to (associated with) V.
These are useful specialized techniques for analogical reasoning, for using associative context to transcend information limitations. They embody a basic psychological strategy: if you don't know enough about one thing, use information you've gathered about other things similar to that thing. They are identical to spreading activation in a associative network of perceptual chunks.
I have described these methods in terms of associations, but they may also be described in terms of hierarchy. We many consider the notion of the center of mass or "centroid" of a collection of delay vectors. This is closely related to psychological theories of categorization. The most basic theory of mental categorization is the "exemplar" theory, which states that the degree of membership in a category is given by the distance to the exemplar of the category, which is taken as the category element closest to the average of all category members. This is not a perfect theory of categorization by any means, but it roughly reflective of very many human categories. Furthermore, algorithmically, it is equivalent to the k-means method, which is a very good technique for performing an initial categorization on a large body of data.
Categorization is the root of the hierarchical network that complements, in the mind, the heterarchical network of associations. We thus have, reflected in nonlinear time series analysis, a small version of the dual network of mind -- hierarchy/heterarchy superposed. The simple method for time series prediction described above may be summarized as: use associations to form categories, and take as the prediction for a delay vector, the prediction for its category. Simple noise reduction may be summarized as: replace a delay vector by its association-inferred category.
The basic cognitive operation of nonlinear time series analysis thus emerges as:
Proceeding onwards, the "locally linear noise reduction algorithm" given in Chapter 10 is a fairly convoluted example of this same cognitive operation. Actually, I believe that this is a fairly unnatural algorithm, without any direct counterpart in human psychology. However, it is interesting to consider the algorithm in perceptual-chunk terms anyway, in order to see just in what way sophisticated mathematics can cause us to deviate from psychological principles of data understanding.
The basic question addressed by locally linear noise reduction is: What do we do if we've introduced too many dimensions in the construction of delay vectors, so that our reconstructed attractor has more dimensions than the real attractor? We need to identify the extra dimensions, which, because of noise, may look a bit like the genuine dimensions. For instance, suppose the correct embedding dimension was 2, but the embedding dimension chosen was 3. Then, ideally, one would obtain a 2-dimensional surface sitting in a 3-dimensional space, and one would realize that one had used one dimension too many. In practice, however, the attractor obtained from the series of delay vectors will be a 2-dimensional surface fuzzed out into 3 dimensions, and the degree of fuzzing may make it difficult to tell which are the correct 2 dimensions.
The approach taken here is to compare each delay vector V to the centroid C of the collection of delay vectors in its local neighborhood (delay vectors within distance less than epsilon from it). One then identifies the directions in which V differs least from C, and moves V back toward C in these directions -- figuring that these directions the extra ones, in which V would not differ from C at all if not for noise. Having done this for every delay vector, one then adjusts each individual data value by the average adjustment over all delay vectors containing the data value.
The essential concept underlying this method is the same as in the simple noise reduction method: one is doing is moving each perceptual chunk closer to the category that contains it. The subtle point here, however, is that one is not merely moving it along the straight line to the category centroid. Rather, one is trying to find the optimum direction along which to move it toward the category centroid -- moving it along the directions in which it differs least from the category. Most simply, for instance, suppose that a delay vector was very similar to its category centroid in its third element: then the third element in the chunk could be made even more similar to the category's third element, on the principle that this third element was actually supposed to the be same for all points on the attractor. In general, the directions involved will not be specific coordinates of the delay vectors in the given representation, but rather linear combinations of coordinates of delay vectors; but the idea remains the same.
Locally linear noise reduction feels unnatural to me in much the same way that the backpropagation method for neural net weight updating does. It seems to be using fancy mathematics to do efficiently what the mind does in a simpler but less efficient way. Locally linear noise reduction is a technique that is definitely valuable in cases where one has a clear-cut, low-dimensional chaotic attractor polluted by a small amount of noise. This situation does occur in time series analysis, especially in physical science. However, it is quite different from the situations that environments typically present to intelligent systems. Just as backpropagation is useful for training neural networks to solve simple problems, but scales poorly to complex ones; so, one suspects, locally linear noise reduction will fail to scale well to high-dimensional or severely noise-polluted problems.
A far more psychologically natural reflection of the hierarchical network of mind is found in the "radial basis function" method for global time series modeling, described in Chapter 12.4.2. This is, in essence, a method for approximating the lower levels of a perceptual hierarchy in terms of higher levels. It involves selecting a series of delay vectors as basis elements yi, and expressing other delay vectors y as
y = f(x) = a0 + a1 Phi( | x - y1 | ) + ... + an Phi( | x - yn | )for appropriate x, where | | is a norm and Phi(r) is a decreasing function of r, e.g. a multidimensional Gaussian. The basis elements yi must be chosen appropriately, to be well-distributed across the attractor of the system.
For example, the basis vectors yi are well taken as the centroids of categories found by applying a clustering algorithm to the data values, e.g. the k-means clustering algorithm. One is then approximating the individual data values as combinations of the category centroids -- one is using the categorization, the hierarchical coarse-graining of the system, to understand the items categorized. One is decomposing the system into a collection of "elemental behaviors" -- the basis vectors yi -- and then attempting to describe the system's whole dynamics as a linear combination of these elemental behaviors.
Of course, this is only the simplest possible way of expressing a system's behaviors in terms of basis behaviors. A subtler approach would use a function F(x) that was a nonlinear combination of the basis functions Phi(|x-yi|). Maja Mataric (1995), in her work on mobile robotics, expresses a robot's repertoire of behaviors in terms of combinations of basis behaviors; where the combinations are either linear, or sequential, meaning that x can be formed by yi followed by yj. This approach might be useful here, if one were operating in a space of delay vectors of various lengths.
Mathematical formalism aside, the basic idea underlying the radial basis function method is nonetheless quite psychologically natural. One is taking particular perceptual chunks as "elemental" and expressing other perceptual chunks as combinations of these elemental perceptions. Like all modeling, this is essentially a compression technique. Here, in order to cope with a vast, unwieldy and noisy pool of raw data, one abstracts higher-level categorial structure from the data, builds hierarchical relationships from data values to categories, and then uses the abstracted structure to recapture as much of the individual detail as possible.
The purpose of this kind of strategy, in an intelligent system, is simply that not every perceptual chunk observed can be retained forever. In order to retain some information from perceptual chunks that have been left by the wayside in order to make room for new ones, one seeks to express wide fields of perceptual chunks as simple combinations of a few abstract categories, archetypal chunks.
The discussion of specific nonlinear time series analysis methods in psychological context could continue indefinitely .. but the main points have, I think, been adequately made. Dimensional embedding is idealized perceptual chunking; and the standard nonlinear time series analysis methods, as presented by Kantz and Schreiber, are idealized versions of common psychological strategies for dealing with perceptual chunks once they have been entered into the self-organizing network of long-term memory.
This line of thinking has obvious consequences for the future of nonlinear time series analysis. The moral of the story is: Time series analysts, go psychological!
Specifically, we should seek to go beyond elementary association-building, analogy and category formation, and embody more and more sophisticated psychological strategies for perceptual-chunk-handling in idealized mathematical number-crunching algorithms. Nonlinear time series analysis should be explicitly treated as a subset of AI, with fascinating parallels to other chunking-oriented branches of AI, such as natural language processing.
The WebMind AI software, under development by myself and my colleagues at IntelliGenesis Corp., reflects this attitude by providing unified techniques for handling numerical and textual data (see Goertzel, 1996 for an early discussion of Webmind). But that is another story!
My study of Kantz and Schreiber was motivated by a thread on the chaopsyc listserver in April and May 1998, involving detailed chapter-by-chapter reviews of this book and discussion of the practical relevance of the techniques presented.
My thinking on lineal awareness and perceptual chunking has been greatly influenced by collaborative work on natural language processing with Onar Am, carried out over the late 1997 and early 1998.