Inferring Genetic Regulatory Patterns

with Integrative AI

Part I: Quantitative Pattern Recognition

Ben Goertzel¹, Cassio Pennachin, Andre’ Senna,

Thiago Maia, Takuo Henmi, Guilherme Lamacie, Saulo Pinto

Biomind LLC

May, 2002

The problem treated is the inference of genetic regulatory patterns (fragments of and approximations to genetic regulatory networks) from gene expression time series data. The many challenges involved (data quality, nonstationarity, etc.) are confronted by applying a sophisticated statistical pattern recognition methodology that integrates several AI techniques, including evolutionary programming, probabilistic inference, and nonlinear forecasting. A simplified set of pattern types is introduced, permitting easy “pattern browsing” by biologists without training in computer science or time series analysis. The approach is illustrated by describing some patterns extracted from a standard dataset on the yeast cell cycle. Part II of the paper will describe an extension of the method to incorporate background data from various biological databases into the integrative data analysis process.

1. Introduction

The vast majority of recent work on the analysis of gene expression data has involved clustering or supervised categorization. Clustering tools are used to group similar genes together; or supervised learning algorithms are used to create models distinguishing the expression profiles of cells in one category from the profiles of cells in another. These are worthy applications, but they represent very coarse interpretations of the data. A complementary, and in some cases much deeper understanding can be obtained by inferring parts of the network of interactions by which genes conspire to control cellular functions.

Of course, this “regulatory network inference” problem is much more difficult than clustering or categorization, and it is arguable that the data currently available is too scant and too noisy, and too infrequently gathered according to an adequate experimental design, to permit its solution. However, the work described here suggests that with sufficiently sophisticated pattern recognition technology, it is possible to get meaningful regulatory network inference results even today. We will show that, though one cannot yet infer the entire genetic regulatory network from experimental data, one can observe interesting regulatory patterns, useful for guiding further computational and laboratory exploration. And, as the relevant experimental equipment and methodology become more mature, these techniques will be yet more valuable.

Compared to the standard approaches found in the bioinformatics literature, our analytical techniques are extremely sophisticated, involving the integration of several different AI techniques within a general-purpose AI system, the Novamente AI Engine (Goertzel and Pennachin, 2002). A full review of Novamente would go well beyond the scope of this paper; here, therefore, we will give a relatively mathematical treatment of the algorithms used by Novamente as it behaves in the context of genetic regulatory network inference. Implementation details and efficiency optimization techniques, which are highly involved, will be omitted.

The existing literature on regulatory network inference spans a variety of approaches. Liang et al (1998) and later Akutsu et al (1999) have created algorithms for inferring Boolean networks from regulatory network data; however, they only tested their algorithm on simulated data, not real microarray data, so its functionality in practice remains unknown. On the other hand, a variety of continuous-variable models, some of which have been tested on real lab data, are well-reviewed in (Wessels et al, 1999). Pairwise models seek to find pairs of genes whose expression time series are appropriately correlated. Arkin and Ross (1997) did this statistically; Chen (1999) took a more innovative computer-science-oriented approach, involving the analysis of peaks in gene expression time series. One problem with pairwise models is that they ignore many types of gene interactions, which are not binary in nature. One way to go beyond this is to create “network models” that use linear or nonlinear differential equations to model gene interactions (Someren et al, 2000); an issue with these models, however, is that the number of parameters increases very rapidly with the diversity of phenomena modeled.

Von Dassow et al (2000) is worth reading as an indication of the complexity of the equations and the number of parameters needed to really model a genetic regulatory network in detail (the segment polarity network in insects, in this case). Their equations are derived in an integrative way from a large number of different data types, some quantitative, some qualitative. Inferring this kind of equational model from microarray data in its current quantity and quality is, as they note, not possible.

We believe that existing linear-equation, pairwise and Boolean network models are too oversimplified to capture much of real regulatory network structure; whereas existing complex network models have too many parameters to be realistically fitted by the scant and noisy data. The approach proposed here walks the fine line between these twin pitfalls by taking a statistical pattern recognition approach, which is complex enough to capture subtle patterns, but does not make heavily parameter-laden, hard-to-calibrate assumptions about network dynamics.

Here we will present a formalization of the regulatory network inference problem, and describe our innovative approach to solving it. We will then present initial results obtained by applying our technique to a standard yeast cell cycle dataset (Cho et al, 1998). The results presented here have been obtained via a command-line Linux program called “RNI Miner,” which is a specialized interface to our general-purpose AI system, the Novamente AI Engine. Plans are underway to embody the RNI Miner functionality in a commercial product, the Biomind Genetic Network Analyzer, which will present genetic network inference functionality in the context of a simple-to-use Web-based UI.

Roughly speaking, the analytical process we describe here has three aspects. We:

1. Define a special kind of probabilistic logical expression (a “logical pattern”) to represent a conjectural regulatory pattern (a relationship between genes)

2. Use probabilistic logic rules to assess the extent to which a conjectural regulatory pattern fits the data or not, incorporating biological background knowledge where appropriate

3. Use a combination of learning techniques (including evolutionary programming and probabilistic inference) to find conjectural regulatory patterns that fit the data well

We have implemented these structures and processes in the context of the Novamente AI Engine, a general-purpose integrative AI architecture that has applications in many domains beyond biology as well (Goertzel and Pennachin, 2002).

In this paper, Part I in a two-part series, we will consider primarily the special case of this analytical process in which no biological background knowledge is invoked. The use of background knowledge will be presented in detail in Part II. However, we will describe the process here in sufficiently generality to encompass the incorporation of background knowledge, and we will give some simple examples of background-knowledge-incorporating genetic regulatory patterns.

One major issue that arises in this work is the tradeoff between accuracy and comprehensibility. Put simply, the most significant patterns in a given dataset may be difficult for humans to understand. And comprehensibility is very important here, because the main role of the patterns recognized by a system such as ours, is to guide future laboratory work. If the patterns recognized by the system are not intuitively meaningful to human biologists, then pragmatically speaking they are of no value. To address this problem, we have developed two simplified “pattern templates,” which we call symbolic tabular patterns and directional symbolic tabular patterns. We will give several examples of these along with more general logical patterns (representing probabilistic logic rules). We will also describe a novel method for interactively exploring patterns in expression data using tabular patterns, in which the user fills in parts of a tabular pattern and the remainder is filled in by pattern-recognition software. As we shall illustrate with concrete examples, sometimes the more general logical patterns have a great intuitive power to go along with their high accuracy; but in other cases the simpler tabular patterns provide more insight.

2. A Mathematical Formulation of the Genetic Regulatory Network Inference Problem

In (Wessels et al, 2001), an elegant mathematical formulation of the genetic regulatory network inference problem is presented. However, the formulation given there is specifically oriented toward particular types of inference algorithms. Here we present a formalization that takes its cue from their treatment, but has a somewhat more general scope.

First some notation:

g is the space of genes
g is a particular gene
G is the space of genomes (a genome is a set of genes associated with a particular organism)
G is a particular genome
E is the space of expression level vectors
E is a particular expression level vector (each entry in such a vector gives the expression level of a particular gene, as measured in a particular experiment on a particular genome)
S is the space of gene expression level time series
S is a member of S; that is, it is a set of expression level vectors, each associated with a given point in time.
F is the space of genetic dynamics (i.e. functions mapping Eⁿ into E for some n, most simply n=1)
F is an instance of a genetic dynamic
C is the space of environmental conditions for a expression experiments
C is a particular set of conditions for a given experiment

In this notation, we may represent a gene expression time series as

S = { E(t(1)), E(t(2)),….,E(t(n)) }

where E(t(i)) is the expression level vector associated with time t(i), and

i>j è t(i)> t(j)

In general (and in most real datasets) the elements of the series are not always equispaced, i.e. t(i+1) – t(i) may be nonconstant.

Next, let

S_j,k= { E(t(j)),…,E(t(k)) }

be a subseries of S. Then the regulatory network inference problem can be cast as follows:

Given S, find F so that

· å_i>n+1|| F(S_i,i-n) - E(t(i+1))|| is small

· F is as simple (un-complex) as possible

This is a classic inverse problem, a problem of inferring dynamics from data. James Crutchfield has written a great deal about this type of problem under the general name of Computational Mechanics (see e.g. Crutchfield and Young, 1990; Crutchfield, 1989). Supposing that there is a real F underlying the data, which we may call F*, then the goal is to find a simple F that accurately approximates F* .

Conceptually speaking, the notion of complexity that we use here is that of algorithmic information theory (Chaitin, 1987). A function is simple if it can be computed by a compact program. The complexity of a function is the length of the shortest program for computing it. Pragmatically, we will consider genetic functions within a certain class of functions, which has a simple, easily computable heuristic complexity measure associated with it.

The quality of a candidate function F as a genetic dynamic may be formally defined as

W – cost(F)

where

cost(F) = w *complexity(F) + (1-w) å_i>n+1|| F(S_i,i-n) - E(t(i+1))||

and 0<w<1 is an adjustable weight. The cost factor is small for high-quality F; the constant W must be chosen sensibly. For instance, if one has a probability measure on the space of genetic dynamics, W may be plausibly chosen equal to twice the mean of cost(F).

Wessels et al (2001) describe a number of mathematical criteria for judging algorithms that infer regulatory networks from microarray data. Adding simplicity to their list, and converting their criteria to the present notation, we find:

· inferential power (defined as the similarity between F* and F)

· predictive power (defined as how closely, given F* and not F, one can reconstitute the actual time series S from the initial conditions E(t(0)) )

· robustness (how noise-tolerant is the method)

· consistency (does the method infer too many possible networks underlying given data)

· stability (are the predicted expression levels finitely bounded)

· computational cost

· simplicity (ceteris paribus, Occam’s razor leads us to prefer the simplest model)

A key point is that different approaches to the problem make different assumptions as to the nature of the space F of genetic dynamics. Some assume it’s a space of linear functions, some that it’s a space of finite Boolean networks, some that it consists of nonlinear functions of some specific form, etc. Most generally, we will consider F to be a space of probabilistic logic rules. This is not a restrictive assumption since probabilistic logic rules as we formulate them are dense in the space of smooth functions (since they can approximate arbitrary polynomials). In practice, due to the requirement of rule comprehensibility, we will often restrict consideration to simpler subsets of the space of possible probabilistic logic rules, e.g. what we call “tabular rules.” Finding ways to present more complex rules in a humanly and biologically intuitive manner is an important area for future work.

3. Pragmatic Data Issues

At the moment, the largest problem facing the would-be regulatory-network infererrer is the poor quality of the available experimental data. There are four chief shortcomings: excessively noisy data, ambiguous data interpretation, overly brief time series, and insufficient replication of experiments.

A brief review of the experimental apparatus used for data collection may be appropriate at this point, in order to place these shortcomings in perspective. Firstly, there are two major types of microarray hardware: gene chips and spotted microarrays. Anecdotally, Affymetrix gene chips appear to be more reliable than spotted microarrays, on the whole. Spotted microarrays are often subject to chemical impurities that cause it to look like some genes are expressed at a moderate level when they’re actually not expressed at all; because of this, many researchers using this apparatus operate by ignoring all but the most vigorously expressed genes in any given experiment.

All current approaches to the measurement of gene expression embody a common methodology. Single stranded DNA/RNA molecules are anchored by one end to some kind of surface (a chip or a plate depending on the type of apparatus). The surface is then placed in a solution, and the molecules affixed to the chip will seek to hybridize with complementary strands (“target molecules”) floating in the solution. The target molecules are fluorescently labeled, so that the spots on the chip/array where hybridization occurs can be identified. The strength of the fluorescence emanating from a given region of the surface is a rough indicator of the amount of target substance that bound to the molecule affixed to that region. An image file is created, a photograph of some sort of the pattern of fluorescence emanating from the microarray itself. Typically the image file is then “gridded”, i.e. mapped into a pixel array with a pixel corresponding to each probe molecule. Then, there is a bit of black art involved in computing the hybridization level for a spot, involving various normalization functions that seem to have more basis in trial-and-error than in fundamentals.

In an effort to get more reliable results, researchers generally prepare two related samples, each of which is colored with a different fluorescent substances (usually, one green, one red). They then compare the relative amounts of expressed DNA/RNA in the two samples. The ratio of green/red at a given location is a more meaningful number than the absolute fluorescence. Using this ratio is a way of normalizing out various kinds of experiment-specific “noise”, assuming that these noise factors will be roughly constant across the two samples.

But even if this ratio data is produced in a relatively noise-free way, there are still significant issues in the interpretation of the data. For one thing, there are many different factors influencing the strength of the bond formed between two single stranded DNA/RNA molecules, such as the length of the bonded molecules, the actual composition of the molecules, and so forth. Errors will occur due to the ability of DNA to bind to sequences that are roughly complementary but not an exact match. This can be controlled to some extent by the application of heat, which breaks bonds between molecules – getting the temperature just right will break false positive bonds and not true positive ones. Other laboratory conditions besides temperature can have similar effects. Another problem is that the “probe molecules” affixed to the surface may fold up and self-hybridize, thus rendering them relatively inaccessible to hybridization with the target. And, finally, microarray results do not indicate whether a given instance of gene expression is the result of transcription or RNA turnover. It is only the transcription case that give you direct information about genetic regulatory patterns.

All these pragmatic issues mean that a single data point in a large microarray data set cannot be taken all that seriously. The data as a whole is extremely valuable and informative, but there are many ways for spurious information to find its way into the dataset. This means that data analysis methods, to be successfully applied to microarray data, have got to be extremely robust with respect to noise.

Given all these problems, it seems evident that experiments should be repeated many times over, under conditions as close to identical as possible. Only in this way can we know if, say, the similar movements of FKH1 and NDD2 in a given experiment are a meaningful indication of cellular dynamics under the given experimental conditions, or just a fluke. But it is very rare to see replicate data sets reporting multiple instances of the same experiment under very similar conditions. The yeast cell cycle datasets generously posted online and standardly used to test gene expression data analysis algorithms, do not come with replicate datasets. The reason for this is clear: running replicates of experiments is costly. Yet without this, it is difficult to be confident of the significance of the data, or patterns extracted from it.

Finally, there is the problem of time series length. Most gene expression data collected is not time series data at all, but simply refers to the expression levels of genes in a given cell as a single time. For regulatory network inference, of course, we need time series data. But most time series data that has been collected covers only 5-20 time points. By the ordinary standards of the discipline of time series analysis, this is terribly short. None of the standard methods in the time series analysis toolkit are meant to apply to a series of 15 6000-dimensional vectors. But this is what we have with standard datasets regarding the yeast cell cycle: there are over 6000 genes in yeast and microarray datasets give expression levels for every one of them. There are significant experimental challenges in the production of longer gene expression time series; but we believe these are challenges that must and will be met over the next few years. From a time series analysis perspective, the difference between 15 time points and 150 time points is vast. The techniques presented here are able to give meaningful results even on the very brief time series currently available; but we believe they will show their true power only when more extensive gene expression time series data is available. Of course, the “ideal “ (i.e. statistically acceptable) case is to have a reasonable number of replicates of experiments with reasonably long time series. This is far from the current situation, but experimental genetics is nothing if not rapidly evolving.

4. Multiple-Time-Series Pattern Recognition

In this section we will outline our general approach to regulatory network inference. We will cast the problem as a special case of a more general problem of multiple-time-series pattern recognition (MTSPR). MTSPR is a general framework for the analysis of interrelated time series, which may be associated with a variety of different learning mechanisms. In the following section we will discuss some concrete learning mechanisms, drawn from the Novamente integrative AI system, that we have found valuable in an MTSPR context.

4.1 The Data Matrix

An MTSPR problem begins with a set of numerical time series. In the simplest case, these all involve the same time points, so that one has a set of series V_i, each containing a value at each of a set of time point T_j. Such data is typically presented in the form of a matrix, which we will refer to in the following as the data matrix:

T₁ T₂ T₃ ...

V₁ V₁₁ V₁₂ V₁₃ …

V₂ V₂₁ V₂₂ V_{23 …}

V₃ V₃₁ V₃₂ V_{33 …}

… … … …

Sometimes there will be some missing values, and sometimes each value V_ij will be associated with some other parameters (e.g. measures of confidence, or probability-distributional parameters). There may also, in some cases, be nonquantitative information associated with the T_j, V_i, or V_ij, or with the matrix of time series as a whole, or with subsets (typically submatrices) of the matrix of time series.

In the case of gene expression time series data, what we have is

V_ij = expression of gene g_i at time T_j

where g_i is the gene corresponding to series V_i.

Here we’ll only explicitly consider the case where the time series V_i in question all occur at the same time points T_j. There are no major issues in extending the concepts presented here to the more general case, but it makes the notation and details more complex, and would clutter the exposition unnecessarily. In gene expression data analysis, the time series derived from a single experiment will all occur at the same time points. If one is trying to align time series from multiple experiments on the same kind of cell, then one has the more complex case.

It is nicest mathematically and computationally when the T_jvalues are equispaced. In microarray experiments on the cell cycle, however, this is not normally the case (though it’s usually roughly true); so we will not assume it.

One or more of the rows of the data matrix represents a time series whose pattern of behavior we want to understand. Most simply, this means that we want to find a function F that

· starts with the V matrix with time values up to T_N as input

· as an output yields the values V_I(1) _,N+r … V_{I(k) , N+r}for some set of rows {I(1),…I(k)}.

The value r represents the “lookahead.” In the case of equispaced T_j, then r = T_j+1-T_j means that we are looking one time step ahead; this is the case we will explicitly consider here. In general, it is possible to have multiple simultaneous lookaheads.

In practice, for gene expression analysis, there are two major cases:

One just throws all the data at an analytic algorithm, looking for patterns among any genes whatsoever, curious to see what turns up
One has a small set of genes that one is particularly concerned with (the rows {I(1),…I(k)} in the present formalism)

4.2 Defining a Function Space

Suppose we have a space of functions mapping Rⁿ ´R^m à [0,1]. Assume that these functions are elements of the space of n-argument expressions generated by recursively applying operators drawn from some set of elementary operators.

We have experimented with various operator sets. For our practical work with gene expression analysis, so far, we have chosen a fairly specialized and restricted operator set, that will be described in Section 6.1 below. However, these specialized operators are defined in terms of simpler and more standard operators, which we will introduce in this section. From a mathematical perspective, the operator set defined in this section is actually quite adequate for gene expression data analysis. From a pragmatic perspective, however the problem with it is that it often produces patterns that are complex in appearance, and difficult for humans to easily browse and intuitively comprehend.

We will refer to the set of operators including

> , < , =, >_fuzzy , <_fuzzy, =_fuzzy, , + , * , - , / , AND , OR , NOT

as the standard operator set.

The =_fuzzy operator is also called similarity; it is defined by

similarity(x,y) = 2^-c|x-y| , c>0

The parameter c is set differently for each dataset, so that the mean value of similarity(x,y) over x and y in the data matrix is roughly .5. A random sampling method may be used together with the conjugate gradients optimization algorithm, to do this optimization of c in an efficient way.

Similarly, e.g. x >_fuzzy y may be defined by

x >_fuzzy y = 0 if x < y

x >_fuzzy y = -1 + 2/(2^-c|x-y|+ 1) otherwise

where the c value is the same as for the similarity operator.

In the simplest case, the logical operators are considered using their probabilistic logic definitions, so that for x, y Î[0,1]

x AND y = xy

x OR y = x+y-xy

NOT x = 1-x

These definitions embody independence assumptions; in an advanced implementation these assumptions are not made, and known dependencies are taken into account, so that e.g. if

A = similarity(x, .5) = .3

B = similarity(x,.8) = .9

then it need not be true that

A AND B = .3*.9

since A and B are not truly independent.

The simplified operator set to be introduced in Section 6.1 below, defines some higher-level operators that are compositions of the standard operators.

Let us call the space of functions mapping Rⁿ ´R^m à [0,1], defined by recursive application of operators from this operator set, F(n,m;O). Sometimes we will leave the operator set implicit and simply write F(n,m) - F(n,m;O).

Finally, where f ÎF(n,m), and the data matrix involves m rows V_i, let E_N(f) denote the output of f, when applied to the input consisting of the columns of the data matrix for times T_N-n+1,…,T_N. The operator E_N effectively transforms f into a function of time.

4.3 Predictive Relationships

Suppose we have two functions f, h ÎF(n,m). We may define the predictive logical pattern

f à h <s>

to have the strength

s = å_N (E_N(f) * E_N+1(h)) / å_N E_N(f)

where the sum goes over all times N for which the terms in the summand are defined.

Intuitively, the semantics here is

å_N (truth of f over (T_N-n+1,..,T_N)) AND (truth of h at T_N+1)

s = ---------------------------------------------------------------------

å_N (truth of f over (T_N-n+1,..,T_N))

We may define a simultaneous logical pattern similarly, as a pattern of the form

f à h <s>

where

s = å_N (E_N(f) * E_N(h)) / å_N E_N(f)

A predictive logical pattern uses quantities computed from the times before time T, to predict something about a quantity at time T. Essentially, it is a probabilistic implication between two relationships, one pertaining to the time interval (T-n+1,…,T), the other pertaining to the time point T. On the other hand a simultaneous logical pattern is a implication relation between two expressions, both of which may directly pertain to the same time T. Both predictive and simultaneous logical patterns may be interesting, in various cases.

The use of a single probability value s for the truth value of a logical pattern is simple, but not always adequate. In practice we use several options, the simplest of which is a (probability, weight of evidence) = (s,d) pair, as in

f à h <s,d>

where d is in [0,1]. The semantics of the weight of evidence d is basically the same as in Pei Wang’s NARS system (Wang, 2001), although Wang’s conceptual interpretation of the strength value s is a little different from ours.

Generally

d=H/(H+k)

where H is a measure of the number of elementary observations underlying the probability s. In this case, we may set

H = å_N E_N(f)

corresponding with the s formula given above. The value k is called the “personality parameter”; a typical value range is k=2-10.

We may define the weight w of a logical pattern by

w=s*d

The goal of MTSPR, as currently implemented, is to find logical patterns of high weight.

The complexity of a predictive pattern, drawn from the function spaces defined above, may be defined pragmatically as the number of nodes in the smallest function tree representing the pattern, drawn from the space of function trees using the specified elementary operators at its nodes. Of course, this is not the same as the algorithmic information relative to a powerful programming language like LISP, because there may be some operator trees that would be much more compactly expressed using programs containing loops or other advanced constructs. However, we feel this is a meaningful measure of simplicity for small operator trees such as we are considering in practice here.

Finding the highest-weight predictive or simultaneous logical patterns in a set of data appears to be a computationally intractable problem. What we will present here is a plausible heuristic for finding reasonably high-weight predictive patterns within reasonable computational effort.

Finally, it is not guaranteed that the highest-weight predictive patterns, when composed into an implicit genetic dynamic, will necessarily yield the highest-quality genetic dynamic. Some interesting theoretical work could perhaps be done here, but this path has not yet been pursued. All we can say at present is that high-weight predictive patterns, in practice, tend to lead to reasonably high-quality genetic dynamics.

4.4 The Implicit Genetic Dynamic

So, finally, how does all this lead to an inferred genetic dynamic? Interestingly, although from a general conceptual perspective our goal in this work is to infer genetic dynamics, from a pragmatic point of view, at the present point in time, assembling regulatory patterns into an overall genetic dynamic is not really necessary. This is because the main use of regulatory network inference tools, at this point, is to stimulate biological discovery, rather than to singlehandedly model the nature of the genome. The really important thing is to find a set of prominent and relevant regulatory patterns that can stimulate further hypothesis.

For theoretical purposes, however, it is valuable to explicate the mathematical connection between the pragmatic pattern-recognition methods we are using and the general framework of genetic regulatory network inference. We will do so in this section.

Suppose one has a set of predictive logical patterns, inferred from a data matrix. We may write this set

P = { f_r à h_r<s_r,d_r> , r=1,…,Q}

Given a time point N, a time lag n, and a gene g_i, one may gather a set P(N,n,i) of the indices r corresponding to the elementary predictive patterns in S that make predictions about V_i,N (about gene g_i at time T_N).

Now, how do we derive from this a single prediction for the expression of g_i at T_N? We simply merge the predictions made by the various predictive patterns in P(N,n,i). This merging is done by the “revision” formula

(1/D) å s_r d_r

where

D = å s_r d_r

and the sums are taken over S(N,n,i).

This process almost allows us to define a genetic dynamic that maps the M-time-point state of the dynamic genome into the present state of the genome. There is only one catch: there may be some pairs (h_i,T_N) about which there are no predictive patterns, or about which the only predictive patterns have very low weight of evidence. So what we really derive from this procedure is a subset of a genetic dynamic.

This is why we speak of deriving “genetic regulatory patterns.” We believe that, given the numerous problems with current microarray data, no analytic approach that focuses on inferring entire genetic dynamics can hope for significant success. But it may yet be possible to infer parts of the genetic dynamic with reasonably high confidence. These genetic-dynamic fragments form what we informally call genetic regulatory patterns.

Finally, we have spoken about predictive logical patterns, but what about simultaneous logical patterns – how do they contribute to the construction of an implicit genetic dynamic? This is a little subtler, but conceptually quite straightforward. Simultaneous logical patterns represent probabilistic constraints, and a hypothesized genetic dynamic should be considered more plausible to the extent that it is agreeable with these constraints.

Suppose we have a candidate genetic dynamic F, an approximation to the (unknown) real genetic dynamic F*. Then, for each time point t(i), we can look at the series

G(t(i)) = ( E(t(1)),…, E(t(i)), F( E(t(i)), …, E(t(i-k)) ) )

Then, given a set of simultaneous logical patterns, we can assess, for each G(t(i)), the weight of the simultaneous logical patterns evaluated at the time point t(i), on the series G(t(i)). The average of this over all t(i) is an measure of how harmonious the simultaneous logical patterns are with the candidate genetic dynamic F, in the context of the particular gene expression time series E.

To incorporate simultaneous logical patterns in the definition of an implicit genetic dynamic, then, one must revise the revision rule given above, and use instead

(1/D) å z_r s_r d_r

where

D = å z_r s_r d_r

and z_r is defined as the harmoniousness of the predictions made by the predictive pattern f_r à h_rwith the given set of simultaneous logical patterns.

4.5 Augmenting the Matrix

As noted above, our general approach to regulatory network inference is integrative in two senses. Our methodology is based on integrative AI: as will be discussed in the following section, we use a combination of learning methods to enable efficient search of the space of possible predictive patterns. But we also consider it important to integrate diverse background knowledge into the data analysis process, a topic that will be taken up in detail in Part II of this paper.

In the terminology introduced above, the simplest form this integration takes is the augmentation of the matrix with specialized rows. These extra rows embody additional knowledge, and may be incorporated in predictive relations, along with the original rows of the matrix (representing gene expression levels).

For instance, one can run a clustering algorithm, and then add the expression values of clusters in extra rows. This is very sensible since most current work on regulatory network inference focuses on the output of clustering. If one has the output of supervised learning experiments, based on some classification results telling which genes are associated with which diseases for example, then these category labels can be included as extra rows as well.

More interestingly, one can use information from external databases. For instance, there is knowledge about which genes tend to be active at which phases of the cell cycle. This knowledge has been gathered across many different experiments. One can create an auxiliary time-series for each phase of the cell cycle, so that the value of this series at time T is the average activity of genes generally active in that phase, at time T. This is a simple yet subtle way of incorporating cross-experimental knowledge into the analysis of a single experiment.

Suppose one is interested particularly in the regulation of a handful of proteins, which may for example be known to be involved in some disease of interest. Then one can create an auxiliary time series corresponding to the set of genes that produce these proteins.

One may also produce auxiliary time series corresponding to phenotypic characteristics. For example, the yeast genome, the Stanford Genome Database contains information about the phenotypic meaning of each gene. One can construct an auxiliary time series indicating the average expression, at each time point, of genes with a given phenotypic meaning (e.g. genes related to metabolism, or those involved with transcriptional regulation of a certain gene).

Adding these auxiliary time series does not change the mathematical problem of inferring predictive relationships; it just increases the fund of data available for predictive relation formation. However, it makes a huge difference in the scope of the patterns that can be found. Because, with these additional rows added in, one is no longer merely recognizing patterns in a particular experiment considered in a vacuum – one is recognizing patterns in a particular experiment in the context of a great deal of biological background knowledge. Integrative AI is being used to carry out informationally-integrative data analysis.

5. Heuristic Learning Methods for MTSPR

The formalism of the previous section casts the problem of genetic regulatory network inference as a problem of inferring probabilistic logic rules from data. The quality of a probabilistic logic rule as a data model was also quantified, in terms of (strength)*(weight of evidence). The problem of how to actually find high-quality probabilistic logic rules, however, remains to be addressed. This is a very significant problem, because the space of possible rules is very large. We have explored a variety of approaches to this task, and will not give a full review here. We will restrict ourselves to describing one relatively simple, pragmatically workable heuristic technique.

We have carried out our data analysis using Novamente (Goertzel and Pennachin, 2002), an integrative AI system in which knowledge is represented in a “dynamic semantic network,” which is acted upon by a number of different cognitive dynamics. Probabilistic logic rules are represented inside Novamente as collections of nodes and links, and in this way the gamut of Novamente cognition mechanisms can be used to search the space of probabilistic logic rules for high-quality examples. It is not feasible to discuss Novamente cognition in any detail in this paper. Fortunately, however, a good sense of our methods can be gotten across without such a discussion.

5.1 A Simplified Approach to Inferring Regulatory Patterns

In this section we will describe a relatively simple algorithm that we have found able to infer probabilistic logic rules of the sort mentioned above, from real gene expression data. The algorithm involves two levels; let us begin by describing the first:

Level 1. A genetic algorithm is used to evolve sets of genes. These sets are normally of size 2-10. The fitness function of a gene-set Q is defined as the interpredictability of Q.

The GA used here is fairly standard, see e.g. (Goldberg, 1989). Crossover of two sets is defined by producing a child set containing some elements from each parent set. Mutation of a set is carried out by removing one or more elements of the set with a certain probability, and adding elements to the set with a certain probability. The population is allowed to contain sets of varying size.

In the process of evaluating the interpredictability of a gene-set Q, a set of predictive patterns relating genes in Q with each other and with background data is produced. This set of predictive patterns is stored. Once the Level 1 evolution process is done with, the answer provided is not merely an collection of fit gene-sets Q_i; it is the sets of predictive patterns corresponding to these gene-sets. For, as we have seen, it is these predictive patterns that implicitly contain an inferred genetic dynamic.

But how is the interpredictability of a gene-set Q computed? Again we avail ourselves of evolutionary programming:

Level 2. Suppose we are given a gene-set Q, and are asked to evaluate its interpredictability. We then compute the Q-predictability for each gene in the set and average these values together.

The Q-predictability of a gene g in Q is defined by searching for predictive patterns that predict g based on: the other genes in Q, and the extra background-knowledge-embodying time-series appended to the data matrix.

And how is this search for patterns predicting g conducted? By specifying a lookback n, and executing out a genetic algorithm on the space of probabilistic logic functions involving the relevant rows of the data matrix, extending n columns in the past. This GA is similar to Koza-style GP (Koza, 1992), in that it is acting on trees representing logic formulas.

This is obviously not a highly efficient search approach, as it involves a GA with another GA inside the fitness function. However, in practice, it is much more effective than simply searching the space of all predictive patterns in the given data matrix.

An advanced Novamente implementation of these processes allows some interaction between the different levels of evolution, rather than considering them as completely distinct, with one nested inside the other. Chiefly, this is done by storing knowledge as to which predictive patterns have proved valuable within which gene sets. These predictive patterns can then be used as initial guesses within other, related gene sets – a strategy that can significantly accelerate the search process.

One way or another, however, it seems to be necessary to restrict the search for patterns to patterns among relatively small sets of genes. Otherwise the scope of the search is just too large to be practicable, even for yeast with its 6000+ genes, let alone for human-genome-inspired data.

This sort of technique is certainly not guaranteed to produce a maximum-quality genetic dynamic, in the general sense introduced above. In fact, it is extremely unlikely to do so, as it embodies some fairly severe restrictions on the space of predictive patterns. The correct perspective to take is that these algorithms are reasonably effective heuristics for finding reasonably high-quality predictive patterns. Given a pattern recognition problem of such dimensionality and complexity, this may be the best that can be expected.

5.2 The Role of Further Novamente Cognition

Generally speaking, Novamente cognition has two things to contribute to the regulatory network inference process, beyond just providing a framework within which the algorithm of the previous subsection may be compactly implemented:

1. Superior search speed through integration of cognitive techniques other than evolutionary programming (e.g. probabilistic inference)

2. Automated augmentation of the data matrix with background-info-derived rows

Point 2 will be explored in the sequel to this paper, and Point 1 in other further papers.

6. Simplified Logical and Tabular Patterns

The above mathematical framework is extremely effective if one’s goal is to infer patterns descriptive of gene expression data, for use inside other software of some sort. However, if one’s objective is to have software infer patterns that are useful to human biologists, in guiding them to new hypotheses to be explored in the lab, then the above methods are lacking, because their output is generally a set of large and hard-to-comprehend function trees.

In practice, therefore, it turns out to be valuable to restrict the power of the methodology somewhat, looking only at a smaller class of functions, which are particularly intuitively interpretable. We have been working with three classes of functions of this nature:

· Directional logical patterns

· Symbolic tabular patterns

· Directional symbolic tabular patterns

In this section we will describe these specialized pattern classes, and give examples of patterns from these classes that were inferred by our software from the standard yeast cell cycle data set (Cho et al, 1998). We will also describe a novel method of data exploration that we have created, called tabular querying, which exploits the power of tabular patterns in a very user-friendly way.

6.1 The Simplified Operator Set

To construct these simplified patterns, we consider a function space F(n,m,O) defined in terms of a simplified operator set, rather than the standard operator set introduced above.

The first step toward defining the simplified operator set, is a process we call “symbolization.” After performing a standard normalization on the data, we “symbolize” the numerical data by assigning labels to ranges of normalized expression values:

Normalized Gene Expression Value	Label
0-.5	VERY LOW
.5-1	LOW
1-1.5	MODERATELY LOW
1.5 – 2	MODERATELY HIGH
2-2.5	HIGH
2.5-3	VERY HIGH
3-	EXTREMELY HIGH

The procedure involved in creating the symbolization intervals is simply:

Throw out outliers on the high end, to arrive at an interval [0,a] in which (100-p)% of the values lie
Divide the interval [0,a] into 6 equally-sized subregions labeled as in the above table
Consider the interval [a,¥] to be labeled EXTREMELY HIGH

These symbols may be considered as probabilistic predicates derived from the above-defined basic predicates, e.g.

LOW( g, t) = ( (expression(g,t) > .5) AND (expression(g,t) < 1 ) )

In the following we will use MOD_LOW as an abbreviation for MODERATELY LOW, and introduce a couple other obvious similar abbreviations.

We also introduce three directional predicates, INCREASE, STABLE and DECREASE, defined as follows. Let U = {VERY_LOW, LOW, MOD_LOW, MOD_HIGH, HIGH, VERY_HIGH, EXTR_HIGH}, and let cat(g,t) denote the category in U to which the gene g’s expression value at time t belongs. Introduce the natural ordering on U. Then we say

INCREASE(g, t) = ( cat(g,t) > cat(g, t-1) )

DECREASE(g, t) = ( cat(g,t) < cat(g, t-1) )

STABLE(g, t) = ( cat(g,t) = cat(g, t-1) )

The simplified operator set consists of

· The symbolic and directional predicates introduced above

· The Boolean functions AND, OR and NOT

We have focused our practical efforts largely on inferring regulatory patterns based on these function trees involving the simplified operator set.

The key motivation underlying the choice to work with the simplified patterns set is that, at present, the primary use of our software is to provide patterns for human biologists to look at, so as to provide them guidance in the hypothesis process. Looking at function trees composed from the simplified operator set provides a restricted scope of pattern search, but has the advantage of producing patterns that are relatively easily comprehensible by human biologists. It seems that patterns involving these function trees, to a certain extent, match many biologists’ natural way of thinking about gene expression relationships. If the goal of the work were to provide the most explanatory patterns possible, for us in some other automated analysis process, then it would make more sense to use patterns based on function trees built using the simpler, less restrictive operator set mentioned above.

6.2 Example Logical Patterns

One class of pattern we have found particularly interesting are simultaneous logical patterns involving the simplified operator set. These rules sometimes have highly transparent biological meanings.

Two examples of such rules, inferred from the above tables for the {FKH2, MCM1, ACE2} gene set, are

INCREASE (ACE2) AND (NOT INCREASE(MCM1) ) à INCREASE(FKH2)

DECREASE(ACE2) AND (NOT DECREASE(MCM1) ) à DECREASE (FKH2)

In these rules, we have suppressed the time parameter t in the various directional predicates, because it is the same for all predicates in the rule (this being a simultaneous pattern).

Another example pair of simultaneous logical patterns, inferred using the simplified operator set for the {SIC1, PCL2, CLN3, ACE2, SWI5} gene set, is given by:

C = ( LOW(SIC1) OR MOD_LOW(SIC1) ) AND ( LOW(PCL2) ) AND (LOW(CLN3)) OR (MOD_LOW(CLN3) )

C AND EXTR_HIGH( SWI5) à

DECREASE(SWI5) AND INCREASE(ACE2)

C AND (MOD_HIGH(SWI5) OR HIGH(SWI5)) à

INCREASE(SWI5) AND INCREASE(ACE2)

This is a subtler pattern than the previous example. Here the proposition C gives a context, in which the dependence of the correlation between SWI5’s movements and ACE2’s movements upon the level of SWI5 can be observed. This is an example of the depth of relationship that the current approach is capable of detecting. The system is not only detecting a contextual relationship here, it is detecting a context in which a certain gene’s value can serve as context for a dynamic relationship between genes. This is exactly the kind of complex interrelationship that makes genetic dynamics so subtle. Of course, with the data currently available, it is only possible to infer a very small percentage of the relationships of this type that exist; as the quality and quantity of data improves, more and more such relationships will become inferable.

6.3 Symbolic Tabular Patterns

The logical patterns given above are basically completely general function trees produced over the simplified operator set. It can also be useful to restrict the structure of the function trees to produce particular forms of patterns. We have created two particular pattern types, which we call symbolic tabular patterns and directional symbolic tabular patterns. The merit of these patterns types is that they are particularly easy to visually inspect.

For instance, regarding the gene set {FKH2, MCM1, ACE2}, the following tabular patterns are found in the Cho et al dataset:

	T	t+1
FKH2	MOD LOW	LOW
MCM1	LOW	LOW
ACE2	MOD LOW	MOD LOW, DECREASE

	T	t+1
FKH2	MOD LOW	MOD HIGH
MCM1	LOW	LOW
ACE2	MOD LOW	MOD LOW

	t	t+1
FKH2	MOD LOW	MOD HIGH
MCM1	MOD LOW	LOW
ACE2	LOW	MOD LOW

	t	t+1
FKH2	MOD HIGH	MOD HIGH,
MCM1	LOW	MOD LOW
ACE2	MOD LOW	MOD LOW

	t	t+1
FKH2	VERY HIGH	LOW
MCM1	LOW	MOD LOW
ACE2	MOD LOW	LOW

The semantics of these patterns is almost transparent from their visual form. For instance, the final pattern in the list means

VERY_HIGH( FKH2, t) AND LOW(MCM1, t) AND MOD_LOW(ACE2,t) à

LOW(FKH2,t+1) AND MOD_LOW(MCM1, t+1) AND LOW(ACE2. t+1)

The superior visual inspectability of the tabular format is obvious even in the above example, which is relatively simple because it involves only three genes. It is even more striking in cases such as the following, involving a five-gene set:

	t	t+1
SIC1	LOW	MOD LOW
PCL2	LOW	LOW
CLN3	MOD LOW	LOW
ACE2	MOD LOW	HIGH
SWI5	EXTREMELY HIGH	VERY HIGH

This pattern highlights the opposite directionality of change of ACE2 and SWI5 that occurs in certain contexts. It is particularly noteworthy because these opposite-directional movements are quite large quantitatively.

On the other hand, there are other tabular patterns showing that these two genes sometimes change in unison:

	t	t+1
SIC1	MOD LOW	MOD LOW
PCL2	LOW	LOW
CLN3	LOW	LOW
ACE2	LOW	MOD LOW
SWI5	MOD HIGH	HIGH

	t	t+1
SIC1	MOD LOW	EXTR HIGH
PCL2	LOW	HIGH
CLN3	LOW	MOD LOW
ACE2	MOD LOW	MOD LOW, DECREASE
SWI5	HIGH	MOD HIGH

In the last pattern, we see that the genes ACE2 and SWI5 decrease together modestly, in the situation where SIC1 and PCL2 increase substantially. Yet this does not happen in cases where SIC1 and PCL2 do not increase substantially.

6.4 Directional Symbolic Tabular Patterns

A yet more simplified visual presentation is achieved by taking symbolic tabular patterns and replacing the items in the “t+1” columns with the symbols INCREASE, DECREASE or STABLE.

Applying this treatment to the first example given above, for example, one obtains:

	t	t+1
FKH2	MOD LOW	DECREASE
MCM1	LOW	STABLE
ACE2	MOD LOW	DECREASE

This sort of pattern is very easy to interpret. In words, this one simply says: “If at time point t, the level of FKH2 is moderately low, the level of MCM1 is low, and the level of ACE2 is moderately low, then at time t+1 it is likely that FKH2 will decrease, MCM1 will be stable, and ACE2 will decrease.”

Applying the same treatment to the other patterns given above for this gene set, we have:

	t	t+1
FKH2	MOD LOW	INCREASE
MCM1	LOW	STABLE
ACE2	MOD LOW	INCREASE

	t	t+1
FKH2	MOD LOW	INCREASE
MCM1	MOD LOW	DECREASE
ACE2	LOW	INCREASE

	t	t+1
FKH2	MOD HIGH	STABLE
MCM1	LOW	INCREASE
ACE2	MOD LOW	STABLE

	t	t+1
FKH2	VERY HIGH	DECREASE
MCM1	LOW	INCREASE
ACE2	MOD LOW	DECREASE

As compared to logical or symbolic tabular patterns, these patterns look a little more like the patterns found by traditional pairwise genetic network inference methods, which determine pairs of genes that tend to increase or decrease together. However, these directional symbolic tabular patterns go deeper than the results provided by the pairwise methods, giving information about k-tuples rather than only pairs of genes, and indicating the preconditions under which common increase and/or decrease occur.

All the patterns we have given so far involve individual gene expression levels. But, generally speaking, both tabular and logical patterns may involve, not only individual genes, but categories of genes. And gene categories may be derived by clustering, or else based on biological background knowledge. In Part II of this paper we will extensively consider patterns involving additional time series derived from biological background knowledge. A simple illustrative example of a directional tabular pattern of this sort is:

	t	t+1
FKH2	MOD LOW	INCREASE
MCM1	LOW	STABLE
involved in transcriptional regulation of CUP1	MOD LOW OR LOW	INCREASE

This pattern may be interpreted in basically the same way as a purely gene-specific directional tabular pattern, but the third row refers to the average activation of all genes that have the property “involved in transcriptional regulation of CUP1.” The knowledge of which genes possess this property is obtained by the system from the SGD database; what the system does is figures out which properties to use in connection with which gene sets, similarly to how it decides which genes to group together.

6.4 Tabular Pattern Queries

Finally, we have developed an interesting way to interactively explore the tabular patterns inferred from a dataset. The basic idea is that the user is presented with a table, and can fill in parts of the table, either with symbols (chosen from a menu) or with specific numbers (normalized expression values, chosen from a slider). The program then fills in the rest of the table automatically.

There are at least three useful options here: purely directional, directional/symbolic, or numerical. We will give examples of each type. In these examples, underlined values denote values specified by the user as parts of a query.

In the purely directional case, the user specifies directional values for some of the genes, e.g.

	t+1
FKH2	DECREASE
MCM1	??
ACE2	DECREASE

The program then fills in the blank, responding in this case with two patterns, each of which may be associated with a certain weight.

Time	t+1
FKH2	DECREASE
MCM1	INCREASE OR STABLE
ACE2	DECREASE

In the directional/symbolic case, the user fills in the preconditions (time t) and postconditions (time t+1), leaving one or more of one of the postconditions blank. It is permissible to leave all of the postconditions blank. An example query would be:

	t	t+1
FKH2	VERY HIGH	??
MCM1	LOW	??
ACE2	MOD LOW	DECREASE

The program then fills in a guess as to the value of the blank postconditions, e.g.

	t	t+1
FKH2	VERY HIGH	HIGH
MCM1	LOW	MOD LOW
ACE2	MOD LOW	DECREASE

In general, for this type of query, the postconditions may be specified either as values or as directions (INCREASE, DECREASE or STABLE).

A final alternative is to allow the user to specify quantitative values e.g.

	t	t+1
FKH2	3.1	4.0
MCM1	.6	??
ACE2	1.2	??

In this case, the preconditions and postconditions must both be quantitative. The postconditions can be either quantitative or directional. The program will fill in quantitative values for the blank postconditions.

These examples have all been in the case of two time points (t, t+1), but conceptually the same ideas can be used for k time points where k>2. This may be valuable in some cases, say for k=3 or 4.

7. Conclusion

The techniques described here go far beyond approaches pursued by other researchers, inferring an extremely general class of patterns from gene expression data. Indeed, we have found it necessary to artificially narrow the class of patterns inferred in order to arrive at more simply comprehensible results, more likely to be pragmatically valuable to biologists.

The results we are obtaining by applying our techniques to yeast data are extremely exciting. However, we must emphasize that, as with all results derived from microarray data at this stage, these results must be considered tentative. The data is too noisy, and the processes underlying it are too complex, for any conclusions derived from data analysis to be considered definite. The power of techniques such as ours, at the present time, lies in their ability to lead biologists to more complex and subtle hypotheses than were previously formulable with any degree of confidence. These hypotheses may then be further explored using more narrowly focused and reliable laboratory techniques. The ultimate validation of our algorithms’ output will come only when the software is used in collaboration with biologists doing laboratory research, who use the discovered patterns to guide them in the process of hypothesis development.

Finally, though our techniques as they currently exist are already extremely valuable, we are pursuing several avenues for ongoing improvement. In particular, the focus of our current research is on increasingly sophisticated methods for incorporating biological background information into the process, and on exploring ways to present more complex inferred logical patterns in a manner comprehensible to the average biologist.

Acknowledgements

The authors would like to thank Dr. Margeret Werner-Washburne, of the UNM Biology Department, for many useful discussions, which helped us to better understand the various biological issues that make the effective inference of genetic regulatory patterns so difficult.

Also, thanks are due to Stephan Vladimir Bugaj, with whom we had many discussions in the early stages of this work.

References

· Tatsuya Akutsu, Satoru Miyano, Satoru Kuhara (1999). Identification Of Genetic Networks From A Small Number Of Gene Expression Patterns Under The Boolean Network Model, http://citeseer.nj.nec.com/akutsu99identification.html

· Arkin, Shen and Ross (1997). A test case of correlation metric construction of a reaction pathway from measurements. Science 277

· Catherine A. Ball, Kara Dolinski, Selina S. Dwight, Midori A. Harris, Laurie Issel-Tarver, Andrew Kasarskis, Charles R. Scafe, Gavin Sherlock, Gail Binkley, Heng Jin, Mira Kaloper, Sidney D. Orr, Mark Schroeder, Shuai Weng, Yan Zhu, David Botstein and J. Michael Cherry(2000). Integrating functional genomic information into the Saccharomyces Genome Database. Nucleic Acids Research. Online at http://nar.oupjournals.org/cgi/content/full/28/1/77

· Brazma, Avil and Jaak Vilo (2000). Gene Expression Data Analysis. FEBS Letters 480, http://www.elsevier.nl/febs/118/19/21/00023893.pdf

· Brown, Michael, William Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares Jr., David Haussler (1999). Support Vector Machine classification of microarray gene expression data. http://www.cse.ucsc.edu/research/compbio/genex/genexTR2html/genex.html

· Chen, Ting, Vladimir Filkov and Steven S. Skiena (1999). Identifying gene regulatory networks from experimental data. http://citeseer.nj.nec.com/chen99identifying.html

· Cho, R., et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2:65-73, raw data online at http://cmgm.stanford.edu/pbrown/sporulation/additional/

· Crutchfield, J. P. (1989). Inferring the Dynamic, Quantifying Physical Complexity, in Measures of Complexity and Chaos, A. M. Albano, N. B. Abraham, P. E. Rapp, and A. Passamante, editors, Plenum Press, New York, p. 327, http://www.santafe.edu/projects/CompMech/papers/CompMechCommun.html

· Crutchfield, J.P. and K. Young (1990). , Computation at the Onset of Chaos, in Entropy, Complexity, and Physics of Information, W. Zurek, editor, SFI Studies in the Sciences of Complexity, VIII, Addison-Wesley, Reading, Massachusetts, 223-269, http://www.santafe.edu/projects/CompMech/papers/CompMechCommun.html

· De Risi, Iyer and Brown ( 1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278.

· Goertzel, Ben and Pennachin, Cassio (2002). Novamente: Design for an Artificial General Intelligence. In preparation.

· Goldberg, David (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.

· Hartemink, Alexander, David K. Gifford, Tommi S. Jaakkola, Richard A. Young (2001). Using Graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing. http://www.psrg.lcs.mit.edu/publications/Papers/psbabs.html

· Kirillova, Olga (2001). Cellular Automata Model for Gene Networks. http://www.csa.ru/Inst/gorb_dep/inbios/genet/Netmodel/bool/camod.htm

· Koza, John (1992). Genetic Programming. MIT Press.

· Koza, John, William Mydlowec, Guido Lanza, Jessen Yu, Martin A. Keane (2000). Reverse Engineering of Metabolic Pathways from observed data using genetic programming. http://smi-web.stanford.edu/pubs/SMI_Abstracts/SMI-2000-0851.html

· Liang, Fuhrman and Somogyi (1998). REVEAL: a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. on Biocomputing '98, 3:18-- 29

· Maki, Y. D. Tominaga, M. Okamoto, S. Watanabe, Y. Eguchi (2001). Development of a system for the inference of large-scale genetic networks. http://citeseer.nj.nec.com/maki01development.html

· Raychaudhuri S, Stuart JM, Altman RB. (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series. Pacific Symposium on Biocomputing 2000;:455-66

· Savageau, M.A. (1998). Rules for the evolution of gene circuitry. Pacific Symposium on Biocomputing, 3:54--65.

· Sherman, Fred (1998). An Introduction to the Genetics and Molecular Biology of the Yeast Saccharomyces cerevisiae, http://dbb.urmc.rochester.edu/labs/Sherman_f/yeast/Index.html

· E.P. van Someren, L.F.A. Wessels and M.J.T. Reinders (2000). Linear Modeling of Genetic Networks from Experimental Data, In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB00), p. 355-366, La Jolla, California, August 2000, see http://www.genlab.tudelft.nl/publications/

· Jaak Vilo, Alvis Brazma, Inge Jonassen, Alan Robinson, and Esko Ukkonen (2000).
Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data, ISMB-2000 August 2000, http://industry.ebi.ac.uk/~vilo/Publications/ISMB2000.pdf

· Von Dassow, Eli Meir, Edwin Munro, Garrett Odell (2000). The Segment Polarity Network is a Robust Developmental Module, Nature 406 pp.188-192, http://www.euchromatin.org/Biomath01.htm

· Wang, Pei (2001). Confidence as Higher-Order Uncertainty. Proceedings of the Second International Symposium on Imprecise Probabilities and Their Applications, pages 352--361, Ithaca, New York, June 2001

· Wen, X., S. Fuhrman, G.S. Michales, D.B. Carr, S. Smith, J. Barker and R. Somogyi. PNAS 95:334-339 (1998). "Large-Scale Temporal Gene Expression Mapping of Central Nervous System Development"

· Wessels, L.F.A., E.P. van Someren and M.J.T. Reinders (1999), Genetic network models: a comparative study, http://citeseer.nj.nec.com/vansomeren01genetic.html

· Yuh, Bolouri and Davidson (1997) . Genomic Cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279

Inferring Genetic Regulatory Patterns

Part I: Quantitative Pattern Recognition

May, 2002

3. Pragmatic Data Issues

4. Multiple-Time-Series Pattern Recognition

FKH2

FKH2

MOD LOW

FKH2

MOD LOW

FKH2

MOD HIGH

FKH2

VERY HIGH

SIC1

LOW

SIC1

MOD LOW

SIC1

MOD LOW

FKH2

FKH2

MOD LOW

FKH2

MOD LOW

FKH2

MOD HIGH

FKH2

VERY HIGH

FKH2

MOD LOW

FKH2

DECREASE

DECREASE

Time

FKH2

DECREASE

DECREASE

In the directional/symbolic case, the user fills in the preconditions (time t) and postconditions (time t+1), leaving one or more of one of the postconditions blank. It is permissible to leave all of the postconditions blank. An example query would be:

FKH2

VERY HIGH

??

LOW

MOD LOW

DECREASE

FKH2

VERY HIGH

HIGH

LOW

MOD LOW

DECREASE

FKH2

3.1

4.0

.6

1.2

??

Acknowledgements

References