Inferring Genetic Regulatory Patterns

with Integrative AI

 

Part I:  Quantitative Pattern Recognition

 

 

 

Ben Goertzel1, Cassio Pennachin, Andre’ Senna,

Thiago Maia, Takuo Henmi, Guilherme Lamacie, Saulo Pinto

 

Biomind LLC

 

1 contact author: ben@goertzel.org

 

 

May, 2002

 

 

The problem treated is the inference of genetic regulatory patterns (fragments of and approximations to genetic regulatory networks) from gene expression time series data. The many challenges involved (data quality, nonstationarity, etc.) are confronted by applying a sophisticated statistical pattern recognition methodology that integrates several AI techniques, including evolutionary programming, probabilistic inference, and nonlinear forecasting.  A simplified set of pattern types is introduced, permitting easy “pattern browsing” by biologists without training in computer science or time series analysis.  The approach is illustrated by describing some patterns extracted from a standard dataset on the yeast cell cycle.   Part II of the paper will describe an extension of the method to incorporate background data from various biological databases into the integrative data analysis process. 

 

 

1.        Introduction

 

 

            The vast majority of recent work on the analysis of gene expression data has involved clustering or supervised categorization.  Clustering tools are used to group similar genes together; or supervised learning algorithms are used to create models distinguishing the expression profiles of cells in one category from the profiles of cells in another.   These are worthy applications, but they represent very coarse interpretations of the data.  A complementary, and in some cases much deeper understanding can be obtained by inferring parts of the network of interactions by which genes conspire to control cellular functions.  

Of course, this “regulatory network inference” problem is much more difficult than clustering or categorization, and it is arguable that the data currently available is too scant and too noisy, and too infrequently gathered according to an adequate experimental design, to permit its solution.   However, the work described here suggests that with sufficiently sophisticated pattern recognition technology, it is possible to get meaningful regulatory network inference results even today.  We will show that, though one cannot yet infer the entire genetic regulatory network from experimental data, one can observe interesting regulatory patterns, useful for guiding further computational and laboratory exploration.   And, as the relevant experimental equipment and methodology become more mature, these techniques will be yet more valuable. 

Compared to the standard approaches found in the bioinformatics literature, our analytical techniques are extremely sophisticated, involving the integration of several different AI techniques within a general-purpose AI system, the Novamente AI Engine (Goertzel and Pennachin, 2002).   A full review of Novamente would go well beyond the scope of this paper; here, therefore, we will give a relatively mathematical treatment of the algorithms used by Novamente as it behaves in the context of genetic regulatory network inference.  Implementation details and efficiency optimization techniques, which are highly involved, will be omitted.

The existing literature on regulatory network inference spans a variety of approaches.  Liang et al (1998) and later Akutsu et al (1999) have created algorithms for inferring Boolean networks from regulatory network data; however, they only tested their algorithm on simulated data, not real microarray data, so its functionality in practice remains unknown.  On the other hand, a variety of continuous-variable models, some of which have been tested on real lab data, are well-reviewed in (Wessels et al, 1999).   Pairwise models seek to find pairs of genes whose expression time series are appropriately  correlated.  Arkin and Ross (1997) did this statistically; Chen (1999) took a more innovative computer-science-oriented approach, involving the analysis of peaks in gene expression time series.   One problem with pairwise models is that they ignore many types of gene interactions, which are not binary in nature.  One way to go beyond this is to create  “network models” that use linear or nonlinear differential equations to model gene interactions (Someren et al, 2000); an issue with these models, however, is that the number of parameters increases very rapidly with the diversity of phenomena modeled.

Von Dassow et al (2000) is worth reading as an indication of the complexity of the equations and the number of parameters needed to really model a genetic regulatory network in detail (the segment polarity network in insects, in this case).  Their equations are derived in an integrative way from a large number of different data types, some quantitative, some qualitative.  Inferring this kind of equational model from microarray data in its current quantity and quality is, as they note, not possible.

We believe that existing linear-equation, pairwise and Boolean network models are too oversimplified to capture much of real regulatory network structure; whereas existing complex network models have too many parameters to be realistically fitted by the scant and noisy data.  The approach proposed here walks the fine line between these twin pitfalls by taking a statistical pattern recognition approach, which is complex enough to capture subtle patterns, but does not make heavily parameter-laden, hard-to-calibrate assumptions about network dynamics.

Here we will present a formalization of the regulatory network inference problem, and describe our innovative approach to solving it.  We will then present initial results obtained by applying our technique to a standard yeast cell cycle dataset (Cho et al, 1998).  The results presented here have been obtained via a command-line Linux program called “RNI  Miner,” which is a specialized interface to our general-purpose AI system, the Novamente AI Engine.  Plans are underway to embody the RNI Miner functionality in a commercial product, the Biomind Genetic Network Analyzer, which will present genetic network inference functionality in the context of a simple-to-use Web-based UI.

Roughly speaking, the analytical process we describe here has three aspects.  We:

 

1.      Define a special kind of probabilistic logical expression (a “logical pattern”) to represent a conjectural regulatory pattern (a relationship between genes)

2.      Use probabilistic logic rules to assess the extent to which a conjectural regulatory pattern fits the data or not, incorporating biological background knowledge where appropriate

3.      Use a combination of learning techniques (including evolutionary programming and probabilistic inference) to find conjectural regulatory patterns that fit the data well

 

We have implemented these structures and processes in the context of the Novamente AI Engine, a general-purpose integrative AI architecture that has applications in many domains beyond biology as well (Goertzel and Pennachin, 2002).

In this paper, Part I in a two-part series, we will consider primarily the special case of this analytical process in which no biological background knowledge is invoked.  The use of background knowledge will be presented in detail in Part II.   However, we will describe the process here in sufficiently generality to encompass the incorporation of background knowledge, and we will give some simple examples of background-knowledge-incorporating genetic regulatory patterns.

One major issue that arises in this work is the tradeoff between accuracy and comprehensibility.   Put simply, the most significant patterns in a given dataset may be difficult for humans to understand.   And comprehensibility is very important here, because the main role of the patterns recognized by a system such as ours, is to guide future laboratory work.  If the patterns recognized by the system are not intuitively meaningful to human biologists, then pragmatically speaking they are of no value.   To address this problem, we have developed two simplified “pattern templates,” which we call symbolic tabular patterns and directional symbolic tabular patterns.  We will give several examples of these along with more general logical patterns (representing probabilistic logic rules). We will also describe a novel method for interactively exploring patterns in expression data using tabular patterns, in which the user fills in parts of a tabular pattern and the remainder is filled in by pattern-recognition software.  As we shall illustrate with concrete examples, sometimes the more general logical patterns have a great intuitive power to go along with their high accuracy; but in other cases the simpler tabular patterns provide more insight.

 

 

2.   A Mathematical Formulation of the Genetic Regulatory Network Inference Problem

 

In (Wessels et al, 2001), an elegant mathematical formulation of the genetic regulatory network inference problem is presented.  However, the formulation given there is specifically oriented toward particular types of inference algorithms.  Here we present a formalization that takes its cue from their treatment, but has a somewhat more general scope.

First some notation: 

 

 

In this notation, we may represent a gene expression time series as 

 

S = { E(t(1)), E(t(2)),….,E(t(n)) }

 

where E(t(i)) is the expression level vector associated with time t(i), and

 

i>j è t(i)> t(j)

 

In general (and in most real datasets) the elements of the series are not always equispaced, i.e. t(i+1) – t(i) may be nonconstant.

Next, let

 

Sj,k = { E(t(j)),…,E(t(k)) }

 

be a subseries of S.  Then the regulatory network inference problem can be cast as follows:

 

Given S, find F  so that

·        åi>n+1 || F(Si,i-n) - E(t(i+1))|| is small 

·        F is as simple (un-complex) as possible

 

This is a classic inverse problem, a problem of inferring dynamics from data.  James Crutchfield has written a great deal about this type of problem under the general name of Computational Mechanics (see e.g. Crutchfield and Young, 1990; Crutchfield, 1989). Supposing that there is a real F underlying the data, which we may call F*, then the goal is to find a simple F that accurately approximates F* . 

            Conceptually speaking, the notion of complexity that we use here is that of algorithmic information theory (Chaitin, 1987).  A function is simple if it can be computed by a compact program.  The complexity of a function is the length of the shortest program for computing it.  Pragmatically, we will consider genetic functions within a certain class of functions, which has a simple, easily computable heuristic complexity measure associated with it.

            The quality of a candidate function F as a genetic dynamic may be formally defined as

 

W – cost(F)

 

where

 

cost(F) =  w *complexity(F) + (1-w) åi>n+1 || F(Si,i-n) - E(t(i+1))||

 

and 0<w<1 is an adjustable weight.  The cost factor is small for high-quality F; the constant W must be chosen sensibly.   For instance, if one has a probability measure on the space of genetic dynamics, W may be plausibly chosen equal to twice the mean of cost(F).

Wessels et al (2001) describe a number of mathematical criteria for judging algorithms that infer regulatory networks from microarray data.  Adding simplicity to their list, and converting their criteria to the present notation, we find:

 

·        inferential power (defined as the similarity between F* and F)

·        predictive power (defined as how closely, given F* and not F, one can reconstitute the actual time series S from the initial conditions E(t(0)) )

·        robustness (how noise-tolerant is the method)

·        consistency (does the method infer too many possible networks underlying given data)

·        stability (are the predicted expression levels finitely bounded)

·        computational cost

·        simplicity (ceteris paribus, Occam’s razor leads us to prefer the simplest model)

 

A key point is that different approaches to the problem make different assumptions as to the nature of the space F of genetic dynamics.  Some assume it’s a space of linear functions, some that it’s a space of finite Boolean networks, some that it consists of nonlinear functions of some specific form, etc.  Most generally, we will consider F to be a space of probabilistic logic rules.  This is not a restrictive assumption since probabilistic logic rules as we formulate them are dense in the space of smooth functions (since they can approximate arbitrary polynomials).  In practice, due to the requirement of rule comprehensibility, we will often restrict consideration to simpler subsets of the space of possible probabilistic logic rules, e.g. what we call “tabular rules.”  Finding ways to present more complex rules in a humanly and biologically intuitive manner is an important area for future work.

 

 

3.  Pragmatic Data Issues

 

At the moment, the largest problem facing the would-be regulatory-network infererrer is the poor quality of the available experimental data.   There are four chief shortcomings: excessively noisy data, ambiguous data interpretation, overly brief time series, and insufficient replication of experiments.

A brief review of the experimental apparatus used for data collection may be appropriate at this point, in order to place these shortcomings in perspective.  Firstly, there are two major types of microarray hardware: gene chips and spotted microarrays.  Anecdotally, Affymetrix gene chips appear to be more reliable than spotted microarrays, on the whole.  Spotted microarrays are often subject to chemical impurities that cause it to look like some genes are expressed at a moderate level when they’re actually not expressed at all; because of this, many researchers using this apparatus operate by ignoring all but the most vigorously expressed genes in any given experiment.  

All current approaches to the measurement of gene expression embody a common methodology.  Single stranded DNA/RNA molecules are anchored by one end to some kind of surface (a chip or a plate depending on the type of apparatus). The surface is then placed in a solution, and the molecules affixed to the chip will seek to hybridize with complementary strands (“target molecules”) floating in the solution.   The target molecules are fluorescently labeled, so that the spots on the chip/array where hybridization occurs can be identified.  The strength of the fluorescence emanating from a given region of the surface is a rough indicator of the amount of target substance that bound to the molecule affixed to that region.  An image file is created, a photograph of some sort of the pattern of fluorescence emanating from the microarray itself. Typically the image file is then “gridded”, i.e. mapped into a pixel array with a pixel corresponding to each probe molecule. Then, there is a bit of black art involved in computing the hybridization level for a spot, involving various normalization functions that seem to have more basis in trial-and-error than in fundamentals.

In an effort to get more reliable results, researchers generally prepare two related samples, each of which is colored with a different fluorescent substances (usually, one green, one red).  They then compare the relative amounts of expressed DNA/RNA in the two samples.  The ratio of green/red at a given location is a more meaningful number than the absolute fluorescence. Using this ratio is a way of normalizing out various kinds of experiment-specific “noise”, assuming that these noise factors will be roughly constant across the two samples.

But even if this ratio data is produced in a relatively noise-free way, there are still significant issues in the interpretation of the data.   For one thing, there are many different factors influencing the strength of the bond formed between two single stranded DNA/RNA molecules, such as the length of the bonded molecules, the actual composition of the molecules, and so forth. Errors will occur due to the ability of DNA to bind to sequences that are roughly complementary but not an exact match. This can be controlled to some extent by the application of heat, which breaks bonds between molecules – getting the temperature just right will break false positive bonds and not true positive ones. Other laboratory conditions besides temperature can have similar effects. Another problem is that the “probe molecules” affixed to the surface may fold up and self-hybridize, thus rendering them relatively inaccessible to hybridization with the target.   And, finally, microarray results do not indicate whether a given instance of gene expression is the result of transcription or RNA turnover.  It is only the transcription case that give you direct information about genetic regulatory patterns.

All these pragmatic issues mean that a single data point in a large microarray data set cannot be taken all that seriously. The data as a whole is extremely valuable and informative, but there are many ways for spurious information to find its way into the dataset. This means that data analysis methods, to be successfully applied to microarray data, have got to be extremely robust with respect to noise.

Given all these problems, it seems evident that experiments should be repeated many times over, under conditions as close to identical as possible.  Only in this way can we know if, say, the similar movements of FKH1 and NDD2 in a given experiment are a meaningful indication of cellular dynamics under the given experimental conditions, or just a fluke.  But it is very rare to see replicate data sets reporting multiple instances of the same experiment under very similar conditions.  The yeast cell cycle datasets generously posted online and standardly used to test gene expression data analysis algorithms, do not come with replicate datasets.  The reason for this is clear: running replicates of experiments is costly.  Yet without this, it is difficult to be confident of the significance of the data, or patterns extracted from it.

Finally, there is the problem of time series length.  Most gene expression data collected is not time series data at all, but simply refers to the expression levels of genes in a given cell as a single time.  For regulatory network inference, of course, we need time series data.  But most time series data that has been collected covers only 5-20 time points.  By the ordinary standards of the discipline of time series analysis, this is terribly short.  None of the standard methods in the time series analysis toolkit are meant to apply to a series of 15 6000-dimensional vectors.  But this is what we have with standard datasets regarding the yeast cell cycle: there are over 6000 genes in yeast and microarray datasets give expression levels for every one of them.  There are significant experimental challenges in the production of longer gene expression time series; but we believe these are challenges that must and will be met over the next few years.  From a time series analysis perspective, the difference between 15 time points and 150 time points is vast.   The techniques presented here are able to give meaningful results even on the very brief time series currently available; but we believe they will show their true power only when more extensive gene expression time series data is available.  Of course, the “ideal “ (i.e. statistically acceptable) case is to have a reasonable number of replicates of experiments with reasonably long time series.  This is far from the current situation, but experimental genetics is nothing if not rapidly evolving.

 

 

 

 

4.    Multiple-Time-Series Pattern Recognition

 

In this section we will outline our general approach to regulatory network inference.  We will cast the problem as a special case of a more general problem of multiple-time-series pattern recognition (MTSPR).  MTSPR is a general framework for the analysis of interrelated time series, which may be associated with a variety of different learning mechanisms.  In the following section we will discuss some concrete learning mechanisms, drawn from the Novamente integrative AI system, that we have found valuable in an MTSPR context.

 

 

4.1     The Data Matrix

 

An MTSPR problem begins with a set of numerical time series.   In the simplest case, these all involve the same time points, so that one has a set of series Vi, each containing a value at each of a set of time point Tj.   Such data is typically presented in the form of a matrix, which we will refer to in the following as the data matrix:

 

T1            T2            T3            ...

 

V1        V11            V12            V13            

V2        V21            V22            V23               

V3        V31            V32            V33               

                               

 

Sometimes there will be some missing values, and sometimes each value Vij will be associated with some other parameters (e.g. measures of confidence, or probability-distributional parameters).  There may also, in some cases, be nonquantitative information associated with the Tj, Vi, or Vij,  or with the matrix of time series as a whole, or with subsets (typically submatrices) of the matrix of time series.

In the case of gene expression time series data, what we have is

 

Vij = expression of gene gi at time Tj

 

where gi is the gene corresponding to series Vi.

Here we’ll only explicitly consider the case where the time series Vi in question all occur at the same time points Tj.  There are no major issues in extending the concepts presented here to the more general case, but it makes the notation and details more complex, and would clutter the exposition unnecessarily.   In gene expression data analysis, the time series derived from a single experiment will all occur at the same time points.  If one is trying to align time series from multiple experiments on the same kind of cell, then one has the more complex case.

It is nicest mathematically and computationally when the Tj values are equispaced.  In microarray experiments on the cell cycle, however, this is not normally the case (though it’s usually roughly true); so we will not assume it.

One or more of the rows of the data matrix represents a time series whose pattern of behavior we want to understand.  Most simply, this means that we want to find a function F that

 

·        starts with the V matrix with time values up to TN as input

·        as an output yields the values VI(1) ,N+r … VI(k) , N+r for some set of rows {I(1),…I(k)}. 

 

The value r represents the “lookahead.”  In the case of equispaced Tj, then r = Tj+1-Tj means that we are looking one time step ahead; this is the case we will explicitly consider here.   In general, it is possible to have multiple simultaneous lookaheads. 

            In practice, for gene expression analysis, there are two major cases:

 

 

 

4.2  Defining a Function Space

 

Suppose we have a space of functions mapping Rn ´Rm à [0,1].  Assume that these functions are elements of the space of n-argument expressions generated by recursively applying operators drawn from some set of elementary operators. 

We have experimented with various operator sets.  For our practical work with gene expression analysis, so far, we have chosen a fairly specialized and restricted operator set, that will be described in Section 6.1 below.   However, these specialized operators are defined in terms of simpler and more standard operators, which we will introduce in this section.  From a mathematical perspective, the operator set defined in this section is actually quite adequate for gene expression data analysis.  From a pragmatic perspective, however the problem with it is that it often produces patterns that are complex in appearance, and difficult for humans to easily browse and intuitively comprehend.

We will refer to the set of operators including

 

> , < , =, >fuzzy , <fuzzy, = fuzzy, , + , * , - , / , AND , OR , NOT

 

as the standard operator set.

The =fuzzy operator is also called similarity; it is defined by

 

similarity(x,y) = 2-c|x-y| , c>0

 

The parameter c is set differently for each dataset, so that the mean value of similarity(x,y) over x and y in the data matrix is roughly .5.  A random sampling method may be used together with the conjugate gradients optimization algorithm, to do this optimization of c in an efficient way.

Similarly, e.g. x >fuzzy y may be defined by

 

x > fuzzy y = 0                                           if x < y

x > fuzzy y = -1 + 2/(2-c|x-y| + 1)               otherwise

 

where the c value is the same as for the similarity operator.

In the simplest case, the logical operators are considered using their probabilistic logic definitions, so that for x, y Î[0,1]

 

x AND y = xy

x OR y = x+y-xy

NOT x = 1-x

 

These definitions embody independence assumptions; in an advanced implementation these assumptions are not made, and known dependencies are taken into account, so that e.g. if

 

A = similarity(x, .5) = .3

B = similarity(x,.8) = .9

 

then it need not be true that

 

A AND B = .3*.9

 

since A and B are not truly independent.

            The simplified operator set to be introduced in Section 6.1 below, defines some higher-level operators that are compositions of the standard operators.

Let us call the space of functions mapping Rn ´Rm à [0,1], defined by recursive application of operators from this operator set, F(n,m;O).  Sometimes we will leave the operator set implicit and simply write F(n,m) - F(n,m;O). 

            Finally, where f ÎF(n,m), and the data matrix involves m rows Vi, let EN(f) denote the output of f, when applied to the input consisting of the columns of the data matrix for times TN-n+1,…,TN.   The operator EN effectively transforms f into a function of time.

 

           

 

4.3  Predictive Relationships

 

Suppose we have two functions f, h ÎF(n,m).   We may define the predictive logical pattern

 

f à h  <s>

 

to have the strength

 

s =  åN  (EN(f) * EN+1(h))   /  åN EN(f)

 

where the sum goes over all times N for which the terms in the summand are defined.

Intuitively, the semantics here is

 

 

 

åN  (truth of f over (TN-n+1,..,TN)) AND (truth of h at TN+1)

s =                   ---------------------------------------------------------------------

åN  (truth of f over (TN-n+1,..,TN))

 

 

            We may define a simultaneous logical pattern similarly, as a pattern of the form

 

f à h  <s>

 

where

 

s =  åN  (EN(f) * EN(h))   /  åN EN(f)

 

 

A predictive logical pattern uses quantities computed from the times before time T, to predict something about a quantity at time T.  Essentially, it  is a probabilistic implication between two relationships, one pertaining to the time interval (T-n+1,…,T), the other pertaining to the time point T.  On the other hand a simultaneous logical pattern is a implication relation between two expressions, both of which may directly pertain to the same time T.   Both predictive and simultaneous logical patterns may be interesting, in various cases.

            The use of a single probability value s for the truth value of a logical pattern is simple, but not always adequate.  In practice we use several options, the simplest of which is a (probability, weight of evidence) = (s,d) pair, as in

 

f à h  <s,d>

 

where d is in [0,1].   The semantics of the weight of evidence d is basically the same as in Pei Wang’s NARS system (Wang, 2001), although Wang’s conceptual interpretation of the strength value s is a little different from ours.

Generally

 

d=H/(H+k)

 

where H is a measure of the number of elementary observations underlying the probability s.  In this case, we may set

 

H = åN EN(f)

 

corresponding with the s formula given above.  The value k is called the “personality parameter”; a typical value range is k=2-10.

We may define the weight w of a logical pattern by

 

w=s*d

 

The goal of MTSPR, as currently implemented, is to find logical patterns of high weight. 

            The complexity of a predictive pattern, drawn from the function spaces defined above, may be defined pragmatically as the number of nodes in the smallest function tree representing the pattern, drawn from the space of function trees using the specified elementary operators at its nodes.  Of course, this is not the same as the algorithmic information relative to a powerful programming language like LISP, because there may be some operator trees that would be much more compactly expressed using programs containing loops or other advanced constructs.  However, we feel this is a meaningful measure of simplicity for small operator trees such as we are considering in practice here.

            Finding the highest-weight predictive or simultaneous logical patterns in a set of data appears to be a computationally intractable problem.  What we will present here is a plausible heuristic for finding reasonably high-weight predictive patterns within reasonable computational effort.

            Finally, it is not guaranteed that the highest-weight predictive patterns, when composed into an implicit genetic dynamic, will necessarily yield the highest-quality genetic dynamic.  Some interesting theoretical work could perhaps be done here, but this path has not yet been pursued.    All we can say at present is that high-weight predictive patterns, in practice, tend to lead to reasonably high-quality genetic dynamics.

 

 

4.4   The Implicit Genetic Dynamic

 

So, finally, how does all this lead to an inferred genetic dynamic?  Interestingly, although from a general conceptual perspective our goal in this work is to infer genetic dynamics, from a pragmatic point of view, at the present point in time, assembling regulatory patterns into an overall genetic dynamic is not really necessary.  This is because the main use of regulatory network inference tools, at this point, is to stimulate biological discovery, rather than to singlehandedly model the nature of the genome.  The really important thing is to find a set of prominent and relevant regulatory patterns that can stimulate further hypothesis.

For theoretical purposes, however, it is valuable to explicate the mathematical connection between the pragmatic pattern-recognition methods we are using and the general framework of genetic regulatory network inference.  We will do so in this section.

Suppose one has a set of predictive logical patterns, inferred from a data matrix.  We may write this set

 

P = { fr à hr <sr,dr> , r=1,…,Q}

 

Given a time point N, a time lag n, and a gene gi, one may gather a set P(N,n,i) of the indices r corresponding to the elementary predictive patterns in S that make predictions about Vi,N (about gene gi at time TN).  

Now, how do we derive from this a single prediction for the expression of gi at TN?  We simply merge the predictions made by the various predictive patterns in P(N,n,i).  This merging is done by the “revision” formula

 

(1/D) å sr dr

 

where

 

D = å sr dr

 

and the sums are taken over S(N,n,i).  

This process almost allows us to define a genetic dynamic that maps the M-time-point state of the dynamic genome into the present state of the genome.  There is only one catch: there may be some pairs (hi,TN) about which there are no predictive patterns, or about which the only predictive patterns have very low weight of evidence.  So what we really derive from this procedure is a subset of a genetic dynamic.  

This is why we speak of deriving “genetic regulatory patterns.”  We believe that, given the numerous problems with current microarray data, no analytic approach that focuses on inferring entire genetic dynamics can hope for significant success.  But it may yet be possible to infer parts of the genetic dynamic with reasonably high confidence.  These genetic-dynamic fragments form what we informally call genetic regulatory patterns. 

Finally, we have spoken about predictive logical patterns, but what about simultaneous logical patterns – how do they contribute to the construction of an implicit genetic dynamic?   This is a little subtler, but conceptually quite straightforward.  Simultaneous logical patterns represent probabilistic constraints, and a hypothesized genetic dynamic should be considered more plausible to the extent that it is agreeable with these constraints.

Suppose we have a candidate genetic dynamic F, an approximation to the (unknown) real genetic dynamic F*.  Then, for each time point t(i), we can look at the series

 

G(t(i)) = ( E(t(1)),…, E(t(i)), F( E(t(i)), …, E(t(i-k)) )  )

 

Then, given a set of simultaneous logical patterns, we can assess, for each G(t(i)), the weight of the simultaneous logical patterns evaluated at the time point t(i), on the series G(t(i)).  The average of this over all t(i) is an measure of how harmonious the simultaneous logical patterns are with the candidate genetic dynamic F, in the context of the particular gene expression time series E. 

            To incorporate simultaneous logical patterns in the definition of an implicit genetic dynamic, then, one must revise the revision rule given above, and use instead

 

(1/D) å zr sr dr

 

where

 

D = å zr sr dr

 

 

and zr is defined as the harmoniousness of the predictions made by the predictive pattern  fr à hr with the given set of simultaneous logical patterns.

 

 

 4.5          Augmenting the Matrix

 

As noted above, our general approach to regulatory network inference is integrative in two senses.   Our methodology is based on integrative AI: as will be discussed in the following section, we use a combination of learning methods to enable efficient search of the space of possible predictive patterns.   But we also consider it important to integrate diverse background knowledge into the data analysis process, a topic that will be taken up in detail in Part II of this paper. 

In the terminology introduced above, the simplest form this integration takes is the augmentation of the matrix with specialized rows.   These extra rows embody additional knowledge, and may be incorporated in predictive relations, along with the original rows of the matrix (representing gene expression levels).

For instance, one can run a clustering algorithm, and then add the expression values of clusters in extra rows.  This is very sensible since most current work on regulatory network inference focuses on the output of clustering.   If one has the output of supervised learning experiments, based on some classification results telling which genes are associated with which diseases for example, then these category labels can be included as extra rows as well. 

More interestingly, one can use information from external databases.  For instance, there is knowledge about which genes tend to be active at which phases of the cell cycle.   This knowledge has been gathered across many different experiments.  One can create an auxiliary time-series for each phase of the cell cycle, so that the value of this series at time T is the average activity of genes generally active in that phase, at time T.  This is a simple yet subtle way of incorporating cross-experimental knowledge into the analysis of a single experiment.

Suppose one is interested particularly in the regulation of a handful of proteins, which may for example be known to be involved in some disease of interest.  Then one can create an auxiliary time series corresponding to the set of genes that produce these proteins.

One may also produce auxiliary time series corresponding to phenotypic characteristics.  For example, the yeast genome, the Stanford Genome Database contains information about the phenotypic meaning of each gene.  One can construct an auxiliary time series indicating the average expression, at each time point, of genes with a given phenotypic meaning (e.g. genes related to metabolism, or those involved with transcriptional regulation of a certain gene).

Adding these auxiliary time series does not change the mathematical problem of inferring predictive relationships; it just increases the fund of data available for predictive relation formation.  However, it makes a huge difference in the scope of the patterns that can be found.  Because, with these additional rows added in, one is no longer merely recognizing patterns in a particular experiment considered in a vacuum – one is recognizing patterns in a particular experiment in the context of a great deal of biological background knowledge.  Integrative AI is being used to carry out informationally-integrative data analysis. 

 

 

 

5.  Heuristic Learning Methods for MTSPR

 

 

 

            The formalism of the previous section casts the problem of genetic regulatory network inference as a problem of inferring probabilistic logic rules from data.  The quality of a probabilistic logic rule as a data model was also quantified, in terms of (strength)*(weight of evidence).   The problem of how to actually find high-quality probabilistic logic rules, however, remains to be addressed.  This is a very significant problem, because the space of possible rules is very large.  We have explored a variety of approaches to this task, and will not give a full review here.  We will restrict ourselves to describing one relatively simple, pragmatically workable heuristic technique.   

We have carried out our data analysis using Novamente (Goertzel and Pennachin, 2002), an integrative AI system in which knowledge is represented in a “dynamic semantic network,” which is acted upon by a number of different cognitive dynamics.  Probabilistic logic rules are represented inside Novamente as collections of nodes and links, and in this way the gamut of Novamente cognition mechanisms can be used to search the space of probabilistic logic rules for high-quality examples.  It is not feasible to discuss Novamente cognition in any detail in this paper.  Fortunately, however, a good sense of our methods can be gotten across without such a discussion. 

 

 

5.1  A Simplified Approach to Inferring Regulatory Patterns

 

 

            In this section we will describe a relatively simple algorithm that we have found able to infer probabilistic logic rules of the sort mentioned above, from real gene expression data.   The algorithm involves two levels; let us begin by describing the first:

 

Level 1.   A genetic algorithm is used to evolve sets of genes.   These sets are normally of size 2-10.   The fitness function of a gene-set Q is defined as the interpredictability of Q.

 

The GA used here is fairly standard, see e.g. (Goldberg, 1989).  Crossover of two sets is defined by producing a child set containing some elements from each parent set.  Mutation of a set is carried out by removing one or more elements of the set with a certain probability, and adding elements to the set with a certain probability.  The population is allowed to contain sets of varying size.

In the process of evaluating the interpredictability of a gene-set Q, a set of predictive patterns relating genes in Q with each other and with background data is produced.  This set of predictive patterns is stored.  Once the Level 1 evolution process is done with, the answer provided is not merely an collection of fit gene-sets Qi; it is the sets of predictive patterns corresponding to these gene-sets.  For, as we have seen, it is these predictive patterns that implicitly contain an inferred genetic dynamic.

But how is the interpredictability of a gene-set Q computed?  Again we avail ourselves of evolutionary programming:

 

Level 2.  Suppose we are given a gene-set Q, and are asked to evaluate its interpredictability.  We then compute the Q-predictability for each gene in the set and average these values together. 

 

The Q-predictability of a gene g in Q is defined by searching for predictive patterns that predict g based on: the other genes in Q, and the extra background-knowledge-embodying time-series appended to the data matrix.

            And how is this search for patterns predicting g conducted?  By specifying a lookback n, and executing out a genetic algorithm on the space of probabilistic logic functions involving the relevant rows of the data matrix, extending n columns in the past.  This GA is similar to Koza-style GP (Koza, 1992), in that it is acting on trees representing logic formulas. 

            This is obviously not a highly efficient search approach, as it involves a GA with another GA inside the fitness function.  However, in practice, it is much more effective than simply searching the space of all predictive patterns in the given data matrix.  

An advanced Novamente implementation of these processes allows some interaction between the different levels of evolution, rather than considering them as completely distinct, with one nested inside the other.   Chiefly, this is done by storing knowledge as to which predictive patterns have proved valuable within which gene sets.  These predictive patterns can then be used as initial guesses within other, related gene sets – a strategy that can significantly accelerate the search process.

One way or another, however, it seems to be necessary to restrict the search for patterns to patterns among relatively small sets of genes.  Otherwise the scope of the search is just too large to be practicable, even for yeast with its 6000+ genes, let alone for human-genome-inspired data.

This sort of technique is certainly not guaranteed to produce a maximum-quality genetic dynamic, in the general sense introduced above.  In fact, it is extremely unlikely to do so, as it embodies some fairly severe restrictions on the space of predictive patterns.   The correct perspective to take is that these algorithms are reasonably effective heuristics for finding reasonably high-quality predictive patterns.   Given a pattern recognition problem of such dimensionality and complexity, this may be the best that can be expected.

 

 

5.2   The Role of Further Novamente Cognition

 

Generally speaking, Novamente cognition has two things to contribute to the regulatory network inference process, beyond just providing a framework within which the algorithm of the previous subsection may be compactly implemented:

 

1.      Superior search speed through integration of cognitive techniques other than evolutionary programming (e.g. probabilistic inference)

2.      Automated augmentation of the data matrix with background-info-derived rows

 

Point 2 will be explored in the sequel to this paper, and Point 1 in other further papers.

 

 

 

6.   Simplified Logical and Tabular Patterns