Structure of Intelligence -- Copyright Springer-Verlag 1993

Back to Structure of Intelligence Contents

Chapter 7

Long-Term Memory

7.0 Structurally Associative Memory

    It is clear that analogy cannot work effectively without recourse to an effective method of storing patterns. However, I suggest that an even stronger statement holds: the nature of analogy actually dictates a particular type of memory structure. The only way analogy can work effectively is if it is coupled with a memory that is specifically structured so as to support analogical reasoning.

    This memory must, of course, be a "long-term memory": its contents must be sufficiently permanent to carry over from one situation to another. I will argue that the entire structure of the mind's long-term memory can be elicited through the study of analogy. On the other hand, the "short-term memory" of contemporary psychology is essentially a euphemism for "consciousness", and we shall deal with it in a later chapter.

    In a completely different context, Jousselin (1987) has reached a similar conclusion regarding the relation between processing and memory. He has shown that, in a very general mathematical sense, the nature of the operations carried out by the processor of a computer actually determine the structure of the memory of that computer.

    Strictly speaking, what is given here is not a model of how memories are physically stored in the brain or anywhere else, but rather a model of how memory access must work, of how the time required to access different memories in different situations must vary. However, following Jousselin, I hold that the structure of memory access is much if not all of the structure of memory. This point will be elaborated below.

ASSOCIATION AND PATTERN

    The model which I will propose is associative in the sense that it stores related elements near each other (Kohonen, 1984; Palm, 1980). I have already suggested that mental process is founded on induction and analogy, which arebased on pattern recognition. It follows from this hypothesis that, from the point of view of mental process, two entities should be considered to be associated if and only if they have patterns in common, or are bound together as the substrate of a common pattern.

    As in Chapter 3, let IN(x,y;z) denote the intensity with which (x,y) is a pattern in z. Then the network of emergence associated with a set of functions is a weighted graph of which the nodes correspond to the functions, and in which each triple x, y an z is connected as follows

%

y %%%%%%%%%%%

%c %%%% x

z %%%%%%%%%%%

with weight c=IN(x,y;z). If IN(x,y;z)=0 then, of course, no zero-weight connection need actually be drawn. The essential aspect of this diagram is that each of x, y and z holds a unique position.

    A structurally associative memory, associated with a set of functions, is also a weighted graph. The nodes correspond to the functions, and the nodes are connected in triples. In these respects it is similar to a network of emergence. But it need not possess the perfect structure of the network of emergence. Rather, if it does not possess this perfect structure, it is required to continually adjust itself so as to better approximate the structure of a network of emergence. It may contain the same entity at many different places. It may adjust itself by connecting nodes which were not previously connected, by adjusting the intensity of existing connections, or by adding new nodes (representing new or old entities).

    The degree to which a graph is a structurally associative memory may be defined in a number of ways, the simplest of which are of the form: 1/[1-c*(the amount by which the graph deviates from the network-of-emergence structure)], where c is an appropriately chosen positive constant. But there is no need to go into the details.

    From here on, I will occasionally refer to a structurally associative memory of this sort as a STRAM. Imperfection in the structure of a STRAM may stem from two sources: imperfect knowledge of the degrees to which a certain pair of functions is a pattern in other functions; or imperfect reflection of this knowledge in the structure of the network. The former is a pattern recognition problem and relates to the interconnection of cognition and memory, to be addressed below. For now let us concentrate on the latter difficulty: given a certain set of data, how can a structurally associative memory be intelligently reorganized?

    In practice, a structurally associative memory cannot be given an unlimited number of nodes, and if it has a reasonably large number of nodes, then not every triple of nodes can be interconnected. The number of connecting wires would very soon grow unmanageable. This is the familiar combinatorialexplosion. To be practical, one must consider a fixed set of n nodes each connected to a small number k of other nodes (say k=O(logn)), and one must arrange the given functions among these nodes in such a way as to approximate the structure desired. When a new function is created, it must take a node over from some other function; and likewise with a new connection. This requires an intricate balancing; it is a difficult optimization problem. What is required is a rapid iterative solution; an algorithm which is able to continually, incrementally improve the structure of the memory while the memory is in use. We shall return to this in Section 7.2.

7.1 Quillian Networks

    It is rather difficult to study the structure of human long-term memory, since we cannot look into a person's brain or mind to determine the processes by which their memory organizes itself. The only aspect of memory which is open to study is memory access. In particular, a great deal of attention has been devoted to the time required for memory access.

    One frequently studied phenomenon is "priming": an experimenter shows the subject one letter, then shows two other letters simultaneously, and measures how long it takes for the subject to determine if the two simultaneously presented letters are the same or different. The answer comes considerably faster when the one preliminary letter is the same as one of the two simultaneous letters (Posner and Snyder, 1975). This shows that when something is summoned from memory it is somehow "at the top" of the memory for a period afterwards, somehow more easily accessible.

    According to the idea of structurally associative memory, if x is more easily accessible than y, those things which are similar to x should in general be more easily accessible than those things which are similar to y. This has been shown by many different experimenters, e.g. Rips et al (1973).

    This is an essential result; however, it is clear that this simple feature would be predicted by all sorts of different memory models. Using similar techniques, psychologists have attempted to determine the structure of memory in further detail. For instance, Collins and Quillian (1969) performed several experiments to test Quillian's network theory of memory (Fig. 3), according to which concepts are stored as nodes in a digraph. For instance, chair, couch and table would be stored on nodes emanating from the furniture node; and coffee table and dinner table would be stored on nodes emanating from the table node. In their experiments, subjects were asked to verify statements of the form "an X is a Y" -- say, "a couch is a table", or "a couch is furniture". Collins and Quillian predicted that the time required to verify the sentence would be a linear function of the number of links between the concepts in the memory digraph.

    This hypothesis at first appeared to be correct; but further experiments showed that the model has difficulty dealing with negative responses. Therefore Rips et al proposed an alternate model of memory in which the similarity of twoentities is defined as the amount by which their "semantic features" overlap. According to their experiments, this sort of similarity is a far better predictor of reaction time than Quillian's hierarchical distance.

    Collins and Loftus (1975) responded with an improvement of the Quillian model, according which concepts are stored in the nodes of a network, and each link of the network is assigned a weight corresponding to the degree of association between the concepts that it connects. Memory access then involves two stages: 1) "spreading activation", in which an activated node spreads its activation to neighboring nodes, and 2) evaluation of the most active portion of the network. This accounts for the data because it incorporates the "feature overlap" of Rips et al into the network structure. Ratcliff (1989) has criticized the model for not adequately explaining the process of evaluation; but this seems to me to be beside the point.

    The model of Collins and Loftus is somewhat similar to the structurally associative memory; the biggest difference is that in the Collins and Loftus model, "similarity" is imposed a priori from the outside. The model does not explain how the mind assigns these degrees of similarity. Psychologically, this is unsatisfactory. However, it must be noted that "Quillian networks" of the sort studied by Collins and Loftus have since become quite common in AI programs. The a priori nature of similarity is no problem here, since the writer or user of the program can specify the pertinent degrees of similarity. Quillian networks are a simple and effective way of representing knowledge.

    A unique Quillian network may be derived from any network of emergence by a very simple process. To explain this we shall require one preliminary concept.

Definition 7.1: The contextual distance between x and y, relative to the set V, is the sum over all v in V of d#[St(x%v)-St(x),St(y%v)-St(v)]. It will be denoted d##(x,y).

This measures the distance between x and y relative to the set V: not only direct structural similarities between x and y are counted, but also similarities in the ways x and y relate to elements of V.

    Then, to form the Quillian network corresponding to a given network of emergence, one must simply create a new link between each two nodes, and weight it with the contextual distance between the entities stored at the nodes, relative to the other entities in the memory.

    The structurally associative memory network refers to a deeper level than the Quillian network, but it is fully compatible with the Quillian network. It therefore seems reasonable to hypothesize that anything which the Quillian network explains, the structurally associative memory can also explain. However, there are certain phenomena which require a deeper analysis, and hence the full power of the structurally associative memory.

    For example, it will be seen below that the details of the relationship between memory and analogy fall into this category. The Quillian network supports avery rough form of analogy based on a priori "similarity", but to explain the full subtlety of analogical reasoning, the structurally associative memory is required.     In general, it would seem that the Quillian network is not really suited to be a large-scale, adaptive, self-organizing memory. The structurally associative memory is based on pattern recognition and hence it can easily be modified on the basis of new pattern recognitions. And the STRAM stores more refined data: it stores information about the type of relation between two entities.

    The remainder of this chapter will be devoted to showing that the STRAM is capable of functioning as a large-scale, adaptive, self-organizing memory; and showing how its structure relates to analogical reasoning.

7.2 Implications of Structurally Associative Memory

    Let us now return to the problem of how a STRAM is to maintain its approximate network-of-emergence structure. One way to approach this optimization problem is via the multilevel methodology (Goertzel, 1989). This application of the multilevel methodology bears a strong similarity to Achi Brandt's (1985) multilevel algorithm for the Ising spin problem.     

    The first level would be the following procedure:

1) assign nodes to any newly-introduced functions, and single out a number of

pre-existing nodes, in such a way that the nodes assigned and selected are

approximately evenly distributed across the network (so that no two are too

close to each other). Call these functions xi and their nodes N(xi).

2) switch the node of each of these functions xi with the node of one of the

functions y with which it shares a pattern -- that is, assign xi node N(y) and y node N(xi) -- and see if, according to the new arrangement, the total amount of pattern which the network indicates in xi and y is greater. If so, let N(xi)=N(y) and let N(y)=N(xi) -- that is, make the switch permanent -- and apply Step 2 to xi again. If not, proceed to one of the other functions y with which xi shares a pattern (say, proceeding through the xi in order of decreasing intensity of shared pattern). Once all the functions with which xi shares more than e (some small fixed number) worth of pattern have been exhausted, exit Step 2.

This algorithm moves functions through the network toward their "proper position." The problem is that, even if xi is not in its optimal position, its position could well be better for it than the positions of its neighbors. Ideally, one might like to try each xi out in every possible position; and although this is not possible, one may indeed improve upon the "neighbors-only" approach.

    Following the multilevel methodology, one could seek to find the optimum node for xi among those nodes at a distance between h1 and h2 from xi in the network (where the distance between two nodes is measured by the number of links separating them). One could attempt this by the Monte Carlo method,randomly seeking to switch xi with functions in this region, or one could attempt this by randomly beginning the neighbors-only search given above from points in this region.

    And if one found a new home for xi in this region, one could execute the neighbors-only search from the new home. And from the answer which this yielded, one could execute the second-level search. And one could repeat this entire process according on an arbitrary number of levels, according to the basic framework outlined in Chapter 2.

    Of course, this is only one possibility; the multilevel philosophy could be applied in many different ways, and there are many other approaches to optimization. The important point is that, by some specially-tailored optimization method, the network must continually reorganize itself so as to better approximate the network of emergence structure, and so as to integrate new data in a manner which maintains this structure. It would seem to be completely impossible to determine, at this stage, the actual method by which the human brain reorganizes its memory network.

HOLISTIC MEMORY

    It might seem that no organism could afford to continually subject its memory to such a risky, speculative optimization algorithm as that sketched above. It could be, of course, that there exists some optimization algorithm which is substantially more effective than those presently known. However, as emphasized in Chapter 2, most researchers doubt if there will ever exist a rapid, highly accurate algorithm for the solution of difficult optimization problems. In the context of associative memory, rapidity is of the essence, so it is probably true that rough approximate answers must suffice.

    If one were designing a brain, one way of partially eliminating the risk involved in adjusting memory according to a highly approximate algorithm would be to provide it with a large number of structurally associative memories, each storing largely the same functions. In fact, it has often been observed that biological memory is holistic in the sense that it appears to "store almost everything almost everywhere". Any small section of the cortex appears to contain a large part of the data which the cortex stores. This an empirical observation, validated repeatedly in the laboratory (Hubel, 1988).

    It has been suggested that holistic memory is necessary because of the error-prone nature of the brain's components. This is almost certainly true. However, the present considerations suggest another good reason: the nature of structurally associative memory seems to require that memory structure be continually subjected to radical, experimental transformations. In order for these transformations not to interfere too much with continual memory use, the mind must incorporate some sort of "workspace" in which unsuccessful transformations may be tried and discarded without global effect. In short: holistic memory may be necessary not only because of the error-prone natureof the brain's "hardware", but also because of the error-prone nature of high-level mental process.

PHYSICAL MEMORY STRUCTURES

    In this context, let us return to the question of physical memory. A physical memory is, obviously, not a weighted graph. Nonetheless we would like to call certain physical memories structurally associative memories. Therefore it is necessary to assign to each physical memory M a set of "naturally correspondent" structurally associative memories. We know of no completely satisfactory way of doing this, but the following scheme is not unreasonable. It requires only that we assign to each pair of elements (x,y) stored by M a distance DM(x,y) which measures the difficulty of locating x in memory given that y has very recently been located. Let F be a mapping which takes the set S of elements stored by M into the nodes of a graph; let L(x,y) denote the number of links in the graph F(S) along the shortest path from(x) to F(y). Then the accuracy with which F represents M may be defined by the average, over all (x,y), of %L(x,y)-DM(x,y)%. This definition requires that, in a rough sense, distance in the memory M correspond to distance in the graph F(S). Certainly there is more to be done in this direction, but the point is that it is indeed possible to model any sort of memory as a graph, and hence to analyze any sort of memory as structurally associative memory.

MEMORY AND ANALOGY

    Finally, we must now explore the manner in which a structurally associative memory may be utilized by cognitive processes. In the notation of the definition of analogy given above, consider that f and y have already been chosen. The process of selection will be discussed briefly a bit later. Then, given x, an appropriate x% may be located by a simple process of "looking up" in STRAM.

    In the case of structural analogy all that is required is to look for something which has a large amount of pattern in common with x. Specifically, a list must be made of the significant patterns in x, and then the other functions in which these functions are also patterns must be searched, with the objective of locating those which have a large number of these patterns in common. This may be a large search, and it will have to be approximate -- e.g. a Monte Carlo search could be implemented, skipping over at random some of the neighbors. The point is that the making of the list of patterns in x, and of neighbors which may also possess these patterns, is merely a matter of looking at the actual graph of the network. Of course, this method will only be accurate to the extent that the STRAM approximates the network of emergence structure.

    In the case of contextual analogy, a list must be made of all the patterns which emerge between x and v, and then a search of the neighbors of thesepatterns must be made, until a function x% is located such that many of the same patterns emerge between it and some other entity v%.

    If f is not the identity mapping, then things are slightly more involved. One knows only that f(x%) is close to x in the network; one knows nothing about x% itself. So, in general, a list must be made of all the patterns which emerge in x or between x and w, and then a search of the neighbors of these patterns must be made, until an entity z is located such that many of the same patterns emerge between it and some other entity w%. Then x% may be set equal to f-1(z). This involves the potentially difficult computation of f-1, and it is generally much more difficult than structural or contextual analogy. However, as suggested earlier, even speculative modeling analogy may be useful, as in brainstorming.

    The location of x% is Step 1 of the general process of analogy. The next step is the recognition of patterns in x%. This process may be decomposed into two stages: isolating the patterns already "known" by STRAM to be part of the structure of x%, and finding new patterns not yet "known" by STRAM. Of course, if f is the identity mapping, then it is trivial to locate all patterns in x' that have been identified by STRAM; in fact, this was already done to some degree of approximation in the selection of x%. In this case the first stage is very easy to execute. But if f is not the identity, then in order to find out what STRAM knows about x%, one must search through STRAM for x% (remember,it is f(x%) which is near x, not necessarily x% itself). One may treat this search as a minimization, over the space of all nodes of STRAM, of the function d#(x,x%). It would not be wise to execute this minimization by local search (from each node proceeding to that neighboring node which minimizes the function), because the structure of STRAM is almost certainly imperfect; a focused Monte Carlo search, a simulated annealing search, or a multilevel search would be much more effective.

    The second stage of Step 2, the location of new patterns in x%, will be implicitly addressed later, when we discuss pattern recognition in general.

7.3 Image and Process

    In Section 7.2 I explained the structurally associative memory by analogy to Quillian networks. But, as hinted there, the Quillian network has several undesirable properties not shared by the STRAM. Some of these are relatively technical, such the fact that the Quillian network has no connection with the details of analogical reasoning. But there are also more philosophical differences. In particular, I will argue that there are absolutely crucial phenomena which the Quillian network cannot even begin to explain.

    What is at issue here is what Israel Rosenfeld (1988, p.3) has called

        a myth that has probably dominated human thought ever since human beings began to write about themselves: namely, that we can accurately remember people, places and things because images of them have been imprinted and permanently stored in our brains; and that, though we maynot be conscious of them, these images are the basis of recognition and hence of thought and action.

In Rosenfeld's opinion, a careful examination of classical neurological experiments shows that "the fundamental assumption that memories exist in our brains as fixed traces, carefully filed and stored, may be wrong" (p.5).

    For instance, consider the well-known series of experiments carried out by Wilder Penfield, beginning in the 1930's. He stimulated various areas of the brain in conscious patients and noted that this appeared to elicit recollections of "forgotten memories." At first sight this would seem to speak against Rosenfeld's point -- the natural interpretation is that Penfield was touching the areas of the brain in which those memories were stored. But actually things are not so clear. Recent experiments show that these forgotten memories are actually "fragmentary impressions, like pieces of a dream, containing elements that are not part of the patient's past experiences" (p. 7).

    Also, these forgotten memories occur only when the brain stimulation is simultaneous activity with the limbic system. Since the limbic system is the seat of emotion, this is evidence in favor of Freud's observation that memory without emotion would be unrecognizable. As Gloor et al (1982) put it, describing their observations of epileptic patients:

        [W]hatever we experience with our senses... even after it has been elaborated as a percept in the temporal neocortex, must ultimately be transmitted to limbic structures in order to assume experiential immediacy. This may... imply that all consciously perceived events must assume kind sort of affective dimension, if only ever so slight.

    Rosenfeld proposes that, rather than storing traces, memory stores procedures. According to him, what happened in Penfield's experiments was that certain processes were activated, which then constructed the so-called forgotten memories on the spur of the moment, based partly on emotional factors and partly on information somehow stored in nearby parts of the cortex.

    Further support for this point of view is given by the work of Mahl et al (1964), who observed these forgotten memories depend significantly upon "the patient's mental content at the time of stimulation." Sometimes they may not be memories at all, but merely rearrangements of the ideas which occupied the patient's mind just prior to stimulation.

    No one denies that part of memory consists of procedures. For instance, every time we form a spoken word from its syllables, we are applying certain phonological procedures. However, most contemporary psychologists would agree with Broca, who argued in 1861 that there is a crucial structural difference between the image-based memory responsible for the storage of words and their meanings, and the procedural "memory for the movements necessary for articulating words."

    Against such arguments, Rosenfeld summons an impressive variety of evidence. In each case, a phenomenon which at first appears to depend onimage-based memory is seen to require procedural memory. For instance, he refers to David Marr's demonstration that shapes can be recognized as shapes without any reference to previous knowledge, merely by executing certain procedures on the appropriate visual stimuli. This shows that shape recognition probably does not depend upon searching a memory store of shapes until a match is found -- it may, rather, be a matter of summoning appropriate procedures from the memory. But if shape recognition does not require a store of shapes, then why should memory contain such a store at all?

    Rosenfeld admits that "Marr did not totally abandon the idea of fixed memories, since ultimately the naming of a shape required, in his scheme, a memory search" (p. 113). To Rosenfeld, this is the major limitation of Marr's approach. However, it seems to me that this shows exactly where Rosenfeld goes too far. Clearly he is right that the brain does not hold, somewhere in its memory, little pictures of circles or ellipses or squares. Rather, it stores certain procedures or functions which characterize these shapes. But this fact does not imply all that he says it does.

    For the purpose of illustration, let us consider a highly oversimplified example. Let y denote the function which, from P and r, generates the circle with radius r and center P; and let z=(P,r). Then, roughly speaking, we may say that a certain collection of stimuli x is "representable as a circle" to the extent that (y,z) is a pattern in x. For each shape, there will be one or more such characterizing patterns. I do not mean to imply that the mind stores shapes by fitting them to their standard mathematical equations, but only that it characterizes shapes by certain "symmetries" or "computational shortcuts" which manifest themselves as patterns. Algebraic equations are one kind of specifying pattern, but not the only kind. For example, one of the patterns characterizing the shape "square" might be "invariance under reflection and ninety-degree rotation".

    Let us suppose, then, that in the mind's "language" each shape is a certain collection of characterizing procedures. Then what is wrong with calling this collection a "label" or "trace" of the shape? It seems clear that, in general, a mind would do best to store such collections in proximity to each other. After all, they will very often be used to recognize the same shapes.

    Rosenfeld thinks Marr is wrong to speak of a "memory search". But does he believe that a mind always immediately selects the most appropriate procedures? If a mind recognizes a shape by recognizing certain symmetries and other patterns in it, then what could possibly be wrong with the hypothesis that the mind has to search a little to determine the appropriate patterns?

    A careful study of Rosenfeld's book reveals that the structurally associative memory accounts, schematically at least, not only for Marr's work but for all the phenomena which Rosenfeld adduces against the image-based model of memory. From the fact that most memory relies on procedures, one cannot conclude that these procedures are not organized and accessed according to a network structure. Returning to Quillian networks, I agree with Rosenfeld that "chair" is stored in the memory as a collections of procedures for determiningwhat is a chair. But I still maintain that the Quillian network can be a useful approximation to the actual network of interrelations between the procedures associated with various entities.

INFORMATION AND RELATION

    Extending this point of view, I concur with Erlich (1979, p. 200) that each item stored in memory should be "considered capable, by rights, of performing two different functions: the informative function and the relational function." That is: each item in memory is acted on by other items in memory, and also acts on other items in memory. The Quillian approach emphasizes the static, "acted-upon" aspect of memory; whereas the Rosenfeld approach stresses the dynamic, procedural, "acting-on" aspect, and considers actions on external stimuli as well as other memory items.

    For instance, "chair", "house", "meal" and so forth are collections of procedures which act very little to transform other entities in memory -- mainly they act on external stimuli. But logical and mathematical structures, as well as words such as "in", "on" and "besides", are primarily relational: they are collections of procedures which serve primarily to act on other entities stored in memory.

    More precisely, what I mean here by "A acts on B" is simply: "A is a function which takes B as arguments." Those entities which are patterns between other entities in memory will thus "act on" many other entities. This definition is validated by the fact that such entities will very often be invoked by analogical searches proceeding from B to A; and by the fact that if A acts on B, recognition of B as a pattern in an entity will often be followed by recognition of A.

    In sum: the STRAM is a memory model which 1) accounts for the procedural nature of memory, 2) recognizes the approximative value of static semantic networks, 3) explains the self-organizing, generative nature of memory, and 4) acknowledges the intricate structural interdependence of memory with cognition.

SPARSE DISTRIBUTED MEMORY

     Another way to look at this dichotomy is to observe that the STRAM is superficially similar to Kanerva's (1988) "sparse distributed memory", in which entities are coded as binary sequences and stored, each in several places, near other sequences that are "similar". Kanerva measures similarity by the Hamming distance -- the fewer places in which two sequences differ, the more similar they are. This is ideal for a memory storing images of physical objects, which may be conveniently characterized by a list of binary qualities. For instance, one could associate with a physical object A a binary sequence a1...an as follows: a1 is 1 if and only if the object is red, a2 is 1 if and only if the objectis green, a77 is 1 if and only if the object is dirty, etc. Given any such assignation of sequences to objects, the similarity of two objects A and B could be plausibly measured by the number of places in which the sequences a1...an and b1...bn differed.

    But for a memory storing relations, procedures and so forth, it is much more direct to measure similarity structurally, as is done in the STRAM. It would certainly be possible to encode complete information about a complex procedure in a binary sequence. This possibility lies at the core of algorithmic information theory. But this is not a natural way to relate procedures. To make the Hamming distance mesh smoothly with analogical reasoning, one would have to enumerate all possible patterns w1,...,wm, and assign each entity A a binary sequence a1,...,an based on the formula: for some fixed C, and 0%r<k, amk+r=1 if and only if wm is a pattern in A with intensity greater than rC but less than (r+1)C.

    Given this setup, Kanerva's sparse distributed memory would be extremely similar to the STRAM. But it would completely ignore the fact that the procedures being stored were procedures and not merely images of physical objects. The emphasis would be informational. The STRAM gives equal balance to the informational and relational functions of the entities stored in memory.