Recognition of Visual Invariants
using Evolutionary and Inferential Pattern Mining
on Spatiotemporal Data
This brief note suggests a new approach to the recognition of invariant patterns in visual data, which is based on taking the conceptual framework underlying Dileep George and Jeff Hawkins’ recent work (2004), and implementing this framework in the context of the powerful learning and inference mechanisms provided by the Novamente AI Engine (Looks et al, 2004).
This new approach has not been implemented yet; details on the (not too serious) obstacles in the way of practical implementation will be mentioned at the end.
The basic idea underlying George and Hawkins’ work is that object recognition and other aspects of visual-invariant-recognition occur via the hierarchical recognition of patterns in temporal sequences. I agree that this is an interesting and quite possibly correct approach, but, I think the way George and Hawkins have modeled the details of their conceptual approach is somewhat oversimplistic – and oversimplistic in a way that artfully conceals the real difficulty of the type of pattern recognition that their conceptual approach entails.
I’ll explain here how one can remove their simplifications in a way that better reveals the cognitive depth of the invariant visual pattern recognition problem -- and in fact better reveals the power of George and Hawkins’ ideas for coping with this depth . The result is a novel design for a vision processing system, and some novel twists on George and Hawkins’ hypotheses about the neuroscience of vision.
To fully follow the discussion in this note, the reader will have to read George and Hawkins’ paper carefully; it’s brief and clearly written, and I’m not going to summarize it in detail here, I’ll just recap the basics.
George and Hawkins treat a simplified model of a hierarchical visual system, with three levels: high, mid and lower. Each level contains a number of processing units, and each processing unit is concerned with a certain region of space in the visual image being processed. The regions dealt with by different processing units may overlap slightly in order to encourage continuity. What each processing unit does is to recognize temporal sequence patterns in its inputs. In the case of a lower-level unit, the inputs are observed visual data (in the sample implementation they describe, each visual datum concerns whether a particular pixel is black or white; but the same basic approach obviously works if one is dealing with color or gray-scale data, etc.). In the case of a high or mid level unit U, the inputs are the outputs of the processing units corresponding to subregions of the region that U corresponds to. The output of a processing unit is the set of temporal patterns that the processing unit has recognized. Of course, this three-level architecture could easily be extended to N levels. Finally, a number of simplifying assumptions are made to make analysis and simulation more tractable, such as Markovicity of the probabilistic variables studied at each level, and non-overlap of the time-scales dealt with at each level.
Clearly, these simplifying assumptions (Markovicity, absolute non-overlap of time-scales) will have to be lifted if the George and Hawkins approach is to be applied to real image processing, or used as a detailed brain model. But, I don’t think this is the only limitation of the approach. The main problem I see is a more foundational one: I don’t think that the pattern-recognition approach of looking for frequent temporal sequences is going to be good enough to deal with the recognition of complex objects in real visual data. I think the brain is doing something more than this, and that AI systems will have to do something more than this if they are to use George and Hawkins’ conceptual framework to do useful visual invariant recognition.
What I suggest is as follows. Let’s hypothesize that the conceptual framework proposed by George and Hawkins is basically correct. This isn’t exactly a proven fact, but as Hawkins has pointed out in his book (2004a), it’s strongly suggested by a host of neuroscience evidence. However, let’s conceive of their processing units as being significantly more sophisticated animals than what George and Hawkins propose. Let’s conceive of each of their processing units as something that recognizes spatiotemporal patterns among its inputs, where these patterns may be (probabilistic) logical formulas rather than simple temporal sequences. One then has the same conceptual picture as in George and Hawkins, but a much more powerful pattern recognition framework.
Furthermore, in expanding the power of each processing unit, one can make some conceptual expansions to the underlying framework. The search space for patterns is larger if one looks for more general logical formulas rather than just temporal sequences – and a larger search space means there’s more motivation to use context to guide one’s search. The context that’s available here comes in two forms:
Thus, my suggestion is that the processing units should do a search through logical-formula space for patterns in their inputs, and that
Now, how should this “search through logical formula space” be conducted? Of course this may be done in many different ways. In the Novamente AI system we have tools that are well-suited for this task, namely
Due to the time-consuming nature of both PTL and Combo-BOA, the best way to architect a Novamente-based implementation of a George-and-Hawkins-esque visual invariant recognition system would be to use a distributed architecture. In the ultimately efficient implementation, each processing unit in the George and Hawkins architecture would be assigned a different machine; but a flexible architecture would support K processing units per machine.
I believe that the result of this would be an object recognition (and general visual invariant recognition) system with a high level of accuracy and generality. Furthermore, as the visual patterns recognized would be represented in probabilistic logic format, they would be easily combined with other sorts of information such as information about action, inputs from other sorts of sensors (acoustic, haptic, olfactory, etc.), and linguistic information (allowing visual grounding of linguistic terms and relationships).
The order of application of these learning algorithms, in the George and Hawkins architecture, should be quite interesting. Initially, one would proceed from the bottom up: first learning the lowest-level patterns, then proceeding to the next level up, etc. But after this first round of bottom-up learning, the patterns in the various processing units may be allowed to coevolve, ultimately leading to an overall attractor-state of patterns in the different processing units that are adapted to one another.
So far, I’ve basically been discussing how to make George and Hawkins’ framework more powerful by combining it with more advanced computer science. However, their focus is on neuroscience, not computer science. What do the ideas I’ve outlined here have to say about the neuroscience of vision?
I don’t really have any argument with their Figure 2, which correlates the laminar structure of cortical regions with the hierarchical structure of the processing units in their model. However, I think that when you get more fine-grained than this picture, in doing your neural modeling, you need to confront the fact that the neurons in each cortical column on each layer are doing more than just looking for frequent temporal sequences in their inputs.
I hypothesize that each of George and Hawkins’ processing units corresponds to a population of cortical columns, which is an “evolving population” in the sense of Edelman’s Neural Darwinism (1987). This evolving population undergoes progressive reinforcement-learning-based selection, aimed at the construction of neuronal pathways that recognize interesting patterns in their inputs.
Furthermore, I hypothesize that there are a significant number of cross-links, breaking the purely hierarchical geometry of George and Hawkins’ diagrams. These cross-links make the network into a “dual network” (Goertzel, 1993, 1994, 1997), rather than a pure hierarchy. I suggest that Hebbian learning based on these cross-links serves to allow nonlocal (and of course generally nonlinear) correlations between distant regions to play a role in the patterns recognized in the processing units.
Of course, designing experiments to validate or refute these hypotheses using current neuroscience technology is not an easy task, and I will happily leave that to the neuroscientists!
As noted above, these ideas have not been implemented yet -- and implementation may not occur for a while, because the Novamente team has limited resources and our current focus is not on vision processing. A rough estimate is that implementing and doing initial tests of the new approach would take a team of two about six months: not a tremendous effort, but not trivial either. The reason it would be this difficult is not that the algorithm is complicated, but rather that the computation required will likely be intensive, and this will necessitate modifying parts of the Novamente system to run in a distributed manner, whereas these parts now run only within a single machine.