## 1. INTRODUCTION |
||

Two important issues in computational biology are the extent to which it is possible to model transcriptional interactions by large networks of interacting elements and how these interactions can be effectively learned from measured expression data [1]. The reverse engineering of transcriptional regulatory networks (TRNs) from expression data alone is far from trivial because of the combinatorial nature of the problem and the poor information content of the data [1]. An additional problem is that by focusing only on transcript data, the inferred network should not be considered as a biochemical regulatory network but as a gene-to-gene network, where many physical connections between macromolecules might be hidden by shortcuts.

In spite of these evident limitations, the bioinformatics community made important advances in this domain over the last few years. Examples are methods like Boolean networks, Bayesian networks, and Association networks [2].

This paper will focus on information-theoretic approaches [3–6] which typically rely on the estimation of mutual information from expression data in order to measure the statistical dependence between variables (the terms “variable” and “feature” are used interchangeably in this paper). Such methods have recently held the attention of the bioinformatics community for the inference of very large networks [4–6].

The adoption of mutual information in probabilistic model design can be traced back to Chow-Liu tree algorithm [3] and its extensions proposed by [7, 8]. Later [9, 10] suggested to improve network inference by using another information-theoretic quantity, namely multi-information.

This paper introduces an original information-theoretic method, called MRNET, inspired by a recently proposed feature selection technique, the maximum relevance/minimum redundancy (MRMR) algorithm [11, 12]. This algorithm has been used with success in supervised classification problems to select a set of nonredundant genes which are explicative of the targeted phenotype [12, 13]. The MRMR selection strategy consists in selecting a set of variables that has a high mutual information with the target variable (maximum relevance) and at the same time are mutually maximally independent (minimum redundancy between relevant variables). The advantage of this approach is that redundancy among selected variables is avoided and that the trade-off between relevance and redundancy is properly taken into account.

Our proposed MRNET strategy, preliminarily sketched in [14], consists of (i) formulating the network inference problem as a series of input/output supervised gene selection procedures, where one gene at the time plays the role of the target output, and (ii) adopting the MRMR principle to perform the gene selection for each supervised gene selection procedure.

The paper benchmarks MRNET against three state-of-the-art information-theoretic network inference methods, namely relevance networks (RELNET), CLR, and ARACNE. The comparison relies on thirty artificial microarray datasets synthesized by two public-domain generators. The extensive simulation setting allows us to study the effect of the number of samples, the number of genes, and the noise intensity on the inferred network accuracy. Also, the sensitivity of the performance to two alternative entropy estimators is assessed.

The outline of the paper is as follows. Section2 reviews the state-of-the-art network inference techniques based on information theory. Section3 introduces our original approach based on MRMR. The experimental framework and the results obtained on artificially generated datasets are presented in Sections 4and5, respectively. Section6 concludes the paper.

## 2. INFORMATION-THEORETIC NETWORK INFERENCE: STATE OF THE ART

::This section reviews some state-of-the-art methods for network inference which are based on information-theoretic notions.

These methods require at first the computation of the
mutual information matrix (MIM), a square matrix whose *i*, *j* element
(1)MIMij=I(Xi;Xj)=∑xi∈X∑xj∈Xp(xi,xj)log(p(xi,xj)p(xi)p(xj))is the mutual information
between *X** _{i}* and

*X*

*, where*

_{j}*X*

*∈*

_{i}*X*,

*i*= 1, …,

*n*, is a discrete random variable denoting the expression level of the

*i*th gene.

### 2.1 Chow-Liu tree

The Chow and Liu approach consists in finding the
maximum spanning tree (MST) of a complete graph, where the weights of the edges
are the mutual information quantities between the connected nodes [3]. The construction of the MST
with Kruskal's algorithm has an *O*(*n*^{2}log*n*) cost. The main
drawbacks of this method are: (i) the minimum spanning tree has typically a low
number of edges also for non sparse target networks and (ii) no parameter is
provided to calibrate the size of the inferred network.

### 2.2 Relevance network (RELNET)

The relevance network approach [4] has been introduced in gene
clustering problems and successfully applied to infer relationships between RNA
expression and chemotherapeutic susceptibility [15]. The approach consists in inferring a genetic
network, where a pair of genes {*X** _{i}*,

*X*

*} is linked by an edge if the mutual information*

_{j}*I*(

*X*

*;*

_{i}*X*

*) is larger than a given threshold*

_{j}*I*

_{0}. The complexity of the method is

*O*(

*n*

^{2}) since all pairwise interactions are considered.

Note that this method is prone to infer false
positives in the case of indirect interactions between genes. For example, if
gene *X*_{1} regulates both
gene *X*_{2} and gene *X*_{3}, a high mutual
information between the pairs {*X*_{1}, *X*_{2}}, {*X*_{1}, *X*_{3}}, and {*X*_{2}, *X*_{3}} would be
present. As a consequence, the algorithm would infer an edge between *X*_{2} and *X*_{3} although these
two genes interact only through gene *X*_{1}.

### 2.3 CLR algorithm

The CLR algorithm [6] is an extension of RELNET. This algorithm computes the
mutual information (MI) for each pair of genes and derives a score related to
the empirical distribution of these MI values. In particular, instead of
considering the information *I*(*X** _{i}*;

*X*

*) between genes*

_{j}*X*

*and*

_{i}*X*

*, it takes into account the score zij=zi2+zj2, where (2)zi=max(0,I(Xi;Xj)−μiσi)and*

_{j}*μ*

_{i}and

*σ*

_{i}are, respectively, the mean and the standard deviation of the empirical distribution of the mutual information values

*I*(

*X*

*,*

_{i}*X*

*),*

_{k}*k*= 1, …,

*n*. The CLR algorithm was successfully applied to decipher the

*E. coli*TRN [6]. Note that, like RELNET, CLR demands an

*O*(

*n*

^{2}) cost to infer the network from a given MIM.

### 2.4 ARACNE

The algorithm for the reconstruction of accurate
cellular networks (ARACNE) [5] is based on the data processing inequality [16]. This inequality states
that if gene *X*_{1} interacts with
gene *X*_{3} through gene *X*_{2}, then
(3)I(X1;X3)≤min(I(X1;X2),I(X2;X3)).The ARACNE procedure starts by
assigning to each pair of nodes a weight equal to their mutual information.
Then, as in RELNET, all edges for which *I*(*X** _{i}*;

*X*

*) <*

_{j}*I*

_{0}are removed, where

*I*

_{0}is a given threshold. Eventually, the weakest edge of each triplet is interpreted as an indirect interaction and is removed if the difference between the two lowest weights is above a threshold

*W*

_{0}. Note that by increasing

*I*

_{0}, we decrease the number of inferred edges while we obtain the opposite effect by increasing

*W*

_{0}.

If the network is a tree and only pairwise
interactions are present, the method guarantees the reconstruction of the
original network, once it is provided with the exact MIM. The ARACNE's
complexity for inferring the network is *O*(*n*^{3}) since the
algorithm considers all triplets of genes. In [5], the method has been able to
recover components of the TRN in mammalian cells and appeared to outperform Bayesian
networks and relevance networks on several inference tasks [5].

## 3. OUR PROPOSAL: MINIMUM REDUNDANCY NETWORKS (MRNET)

::We propose to infer a network using the maximum relevance/minimum redundancy (MRMR) feature selection method. The idea consists in performing a series of supervised MRMR gene selection procedures, where each gene in turn plays the role of the target output.

The MRMR method has been introduced in [11, 12] together with a best-first
search strategy for performing filter selection in supervised learning
problems. Consider a supervised learning task, where the output is denoted by *Y* and *V* is the set of
input variables. The method ranks the set *V* of inputs
according to a score that is the difference between the mutual information with
the output variable *Y* (maximum
relevance) and the average mutual information with the previously ranked
variables (minimum redundancy). The rationale is that direct interactions
(i.e., the most informative variables to the target *Y*) should be
well ranked, whereas indirect interactions (i.e., the ones with redundant
information with the direct ones) should be badly ranked by the method. The
greedy search starts by selecting the variable *X** _{i}* having the
highest mutual information to the target

*Y*. The second selected variable

*X*

*will be the one with a high information*

_{j}*I*(

*X*

*;*

_{j}*Y*) to the target and at the same time a low information

*I*(

*X*

*;*

_{j}*X*

*) to the previously selected variable. In the following steps, given a set*

_{i}*S*of selected variables, the criterion updates

*S*by choosing the variable (4)XjMRMR=argmaxXj∈V\S(uj−rj)that maximizes the score (5)sj=uj−rj,where

*u*

*is a relevance term and*

_{j}*r*

*is a redundancy term. More precisely, (6)uj=I(Xj;Y)is the mutual information of*

_{j}*X*

*with the target variable*

_{j}*Y*, and (7)rj=1|S|∑Xk∈SI(Xj;Xk)measures the average redundancy of

*X*

*to each already selected variable*

_{j}*X*

*∈*

_{k}*S*. At each step of the algorithm, the selected variable is expected to allow an efficient trade-off between relevance and redundancy. It has been shown in [12] that the MRMR criterion is an optimal “pairwise” approximation of the conditional mutual information between any two genes

*X*

*and*

_{j}*Y*given the set

*S*of selected variables

*I*(

*X*

*;*

_{j}*Y*|

*S*).

The MRNET approach consists in repeating this
selection procedure for each target gene by putting *Y* = *X*_{i} and *V* = *X* \ {*X** _{i}*},

*i*= 1, …,

*n*, where

*X*is the set of the expression levels of all genes. For each pair {

*X*

*,*

_{i}*X*

*}, MRMR returns two (not necessarily equal) scores*

_{j}*s*

*and*

_{i}*s*

*according to (5). The score of the pair {*

_{j}*X*

*,*

_{i}*X*

*} is then computed by taking the maximum of*

_{j}*s*

*and*

_{i}*s*

*. A specific network can then be inferred by deleting all the edges whose score lies below a given threshold*

_{j}*I*

_{0}(as in RELNET, CLR, and ARACNE). Thus, the algorithm infers an edge between

*X*

*and*

_{i}*X*

*either when*

_{j}*X*

*is a well-ranked predictor of*

_{i}*X*

*(*

_{j}*s*

*>*

_{i}*I*

_{0}) or when

*X*

*is a well-ranked predictor of*

_{j}*X*

*(*

_{i}*s*

*>*

_{j}*I*

_{0}).

An effective implementation of the *MRMR* best-first search is available in [17]. This implementation demands an *O*(*f* × *n*) complexity for
selecting *f* features using
a best-first search strategy. It follows that MRNET has an *O*(*f* × *n*^{2}) complexity
since the feature selection step is repeated for each of the *n* genes. In other
terms, the complexity ranges between *O*(*n*^{2}) and *O*(*n*^{3}) according to
the value of *f*. Note that the lower the *f* value, the
lower the number of incoming edges per node to infer and consequently the lower
the resulting complexity.

Note that since mutual information is a symmetric measure, it is not possible to derive the direction of the edge from its weight. This limitation is common to all the methods presented so far. However, this information could be provided by edge orientation algorithms (e.g., IC) commonly used in Bayesian networks [7].

## 4. EXPERIMENTS |
||

The experimental framework consists of four steps (see Figure 1): the artificial network and data generation, the computation of the mutual information matrix, the inference of the network, and the validation of the results. This section details each step of the approach.

### 4.1 Network and data generation

In order to assess the results returned by our
algorithm and compare it to other methods, we created a set of benchmarks on
the basis of artificially generated microarray datasets. In spite of the
evident limitations of using synthetic data, this makes possible a quantitative
assessment of the accuracy, thanks to the availability of the *true* network underlying the microarray dataset (see Figure 1).

We used two different generators of artificial gene
expression data: the data generator described in [18] (hereafter referred to as
the *sRogers* generator) and the *SynTReN* generator [19]. The two generators, whose
implementations are freely available on the World Wide Web, are sketched in the
following paragraphs.

### 4.2 Mutual information matrix estimation

In order to benchmark MRNET versus RELNET, CLR, and
ARACNE, the same MIM is used for the four inference approaches. Several
estimators of mutual information have been proposed in literature [5, 6, 20, 21]. Here, we test the
Miller-Madow entropy estimator [20] and a parametric Gaussian density estimator. Since
the Miller-Madow method requires quantized values, we pretreated the data with
the equal-sized intervals algorithm [22], where the size l=N. The parametric Gaussian estimator is directly
computed by *I*(*X** _{i}*,

*X*

*) = (1/2)log(*

_{j}*σ*

_{i}

_{i}*σ*

_{j}*/|*

_{j}*C*|), where |

*C*| is the determinant of the covariance matrix. Note that the complexity of both estimators is

*O*(

*N*), where

*N*is the number of samples. This means that since the whole MIM cost is

*O*(

*N*×

*n*

^{2}), the MIM computation could be the bottleneck of the whole network inference procedure for a large number of samples (

*N*≫

*n*). We deem, however, that at the current state of the technology, this should not be considered as a major issue since the number of samples is typically much smaller than the number of measured features.

### 4.3 Validation

A network inference problem can be seen as a binary decision problem, where the inference algorithm plays the role of a classifier: for each pair of nodes, the algorithm either adds an edge or does not. Each pair of nodes is thus assigned a positive label (an edge) or a negative one (no edge).

A positive label (an edge) predicted by the algorithm is considered as a true positive (TP) or as a false positive (FP) depending on the presence or not of the corresponding edge in the underlying true network, respectively. Analogously, a negative label is considered as a true negative (TN) or a false negative (FN) depending on whether the corresponding edge is present or not in the underlying true network, respectively.

The decision made by the algorithm can be summarized by a confusion matrix (see Table 2).

Table 2

Edge | Actual positive | Actual negative |
---|---|---|

Inferred positive | TP | FP |

Inferred negative | FN | TN |

It is generally recommended [23] to use receiver operator characteristic (ROC) curves when evaluating binary decision problems in order to avoid effects related to the chosen threshold. However, ROC curves can present an overly optimistic view of algorithm's performance if there is a large skew in the class distribution, as typically encountered in TRN inference because of sparseness.

To tackle this problem, precision-recall (PR) curves
have been cited as an alternative to ROC curves [24]. Let the precision
quantity
(8)p=TPTP+FP,measure the fraction of real
edges among the ones classified as positive and the recall
quantity
(9)r=TPTP+FN,also know as true positive rate,
denote the fraction of real edges that are correctly inferred. These quantities
depend on the threshold chosen to return a binary decision. The PR curve is a
diagram which plots the precision (*p*) versus recall (*r*) for different values of the threshold on a
two-dimensional coordinate system.

Note that a compact representation of the PR diagram
is returned by the maximum of the *F* -score
quantity
(10)F=2prr+p,which is a weighted harmonic
average of precision and recall. The following section will present the results
by means of PR curves and *F* -scores.

Also in order to asses the significance of the
results, a McNemar test can be performed. The McNemar test [25] states that if two
algorithms *A* and *B* have the same
error rate, then
(11)P((|NAB−NBA|−1)2NAB+NBA>3.841459)<0.05,where *N*_{A}* _{B}* is the number
of incorrect edges of the network inferred from algorithm

*A*that are correct in the network inferred from algorithm

*B*, and

*N*

_{B}*is the counterpart.*

_{A}
## 5. RESULTS AND DISCUSSION |
||

A thorough comparison would require the display of the
PR-curves (Figure 2) for each dataset. For reason of space, we decided to
summarize the PR-curve information by the maximum *F* -score in Table 3. Note that for each dataset, the accuracy of the best methods (i.e., those
whose score is not significantly lower than the highest one according to
McNemar test) is typed in boldface.

Table 3

RELNET | CLR | ARACNE | MRNET | RELNET | CLR | ARACNE | MRNET | |
---|---|---|---|---|---|---|---|---|

SN1 | 0.22 | 0.24 | 0.24 | 0.26 | ||||

SN2 | 0.26 | 0.21 | 0.25 | 0.25 | ||||

SN3 | 0.23 | 0.24 | 0.21 | 0.25 | 0.26 | |||

SN4 | 0.22 | 0.24 | 0.21 | 0.25 | 0.26 | |||

SSN5 | 0.23 | 0.25 | 0.24 | |||||

SS1 | 0.21 | 0.22 | 0.22 | 0.19 | 0.23 | |||

SS2 | 0.24 | 0.28 | 0.24 | 0.25 | ||||

SS3 | 0.21 | 0.24 | 0.27 | 0.24 | 0.25 | |||

SS4 | 0.24 | 0.24 | 0.26 | |||||

SS5 | 0.28 | 0.24 | 0.26 | |||||

SV1 | 0.36 | 0.3 | 0.4 | 0.38 | ||||

SV2 | 0.25 | 0.28 | 0.25 | 0.35 | 0.32 | |||

SV3 | 0.21 | 0.24 | 0.28 | 0.21 | 0.28 | 0.27 | ||

SV4 | 0.22 | 0.24 | 0.24 | 0.26 | ||||

SV5 | 0.23 | 0.22 | 0.24 | 0.26 | ||||

S-AVG | 0.23 | 0.25 | 0.28 | 0.28 | 0.21 | 0.26 | 0.30 | 0.27 |

RN1 | 0.59 | 0.61 | 0.89 | 0.87 | 0.92 | |||

RN2 | 0.89 | 0.87 | ||||||

RN3 | 0.5 | 0.5 | 0.52 | 0.89 | 0.87 | |||

RN4 | 0.89 | 0.87 | ||||||

RN5 | 0.41 | 0.4 | 0.88 | 0.86 | ||||

RS1 | 0.1 | 0.09 | 0.1 | 0.18 | ||||

RS2 | 0.32 | 0.31 | 0.31 | 0.45 | 0.44 | |||

RS3 | 0.32 | 0.36 | 0.58 | 0.56 | ||||

RS4 | 0.47 | 0.47 | 0.5 | 0.75 | 0.75 | 0.79 | ||

RS5 | 0.58 | 0.6 | 0.64 | 0.9 | 0.86 | |||

RV1 | 0.46 | |||||||

RV2 | 0.49 | 0.49 | ||||||

RV3 | 0.45 | 0.69 | 0.69 | |||||

RV4 | 0.47 | 0.48 | 0.48 | 0.69 | 0.7 | 0.72 | ||

RV5 | 0.47 | 0.48 | 0.7 | 0.68 | 0.73 | |||

R-AVG | 0.45 | 0.48 | 0.44 | 0.46 | 0.72 | 0.71 | 0.74 | 0.74 |

Tot-AVG | 0.34 | 0.36 | 0.36 | 0.37 | 0.47 | 0.49 | 0.52 | 0.51 |

We may summarize the results as follows.

### 5.1 Feature selection techniques in network inference

As shown experimentally in the previous section, MRNET is competitive with the state-of-the-art techniques. Furthermore, MRNET benefits from some additional properties which are common to all the feature selection strategies for network inference [26, 27], as follows.

Feature selection algorithms can often deal with thousands of variables in a reasonable amount of time. This makes inference scalable to large networks.

Feature selection algorithms may be easily made
parallel, since each of the *n* selections
tasks is independent.

Feature selection algorithms may be made faster by a priori knowledge. For example, knowing the list of regulator genes of an organism improves the selection speed and the inference quality by limiting the search space of the feature selection step to this small list of genes. The knowledge of existing edges can also improve the inference. For example, in a sequential selection process, as in the forward selection used with MRMR, the next variable is selected given the already selected features. As a result, the performance of the selection can be strongly improved by conditioning on known relationships.

However, there is a disadvantage in using a feature
selection technique for network inference. The objective of feature selection
is selecting, among a set of input variables, the ones that will lead to the
best predictive model. It has been proved in [28] that the minimum set that
achieves optimal classification accuracy under certain general conditions is
the Markov blanket of a target variable. The Markov blanket of a target
variable is composed of the variable's parents, the variable's children, and
the variable's children's parents [7]. The latter are indirect relationships. In other
words, these variables have a conditional mutual information to the target
variable *Y* higher than
their mutual information. Let us consider the following example. Let *Y* and *X** _{i}* be independent
random variables, and

*X*

*=*

_{j}*X*

*+*

_{i}*Y*(see Figure 7). Since the variables are independent,

*I*(

*X*

*;*

_{i}*Y*) = 0, and the conditional mutual information is higher than the mutual information, that is,

*I*(

*X*

*;*

_{i}*Y*|

*X*

*) > 0. It follows that*

_{j}*X*

*has some information to*

_{i}*Y*given

*X*

*but no information to*

_{j}*Y*taken alone. This behavior is colloquially referred to as

*explaining-away effect*in the Bayesian network literature [7]. Selecting variables, like

*X*

*, that take part into indirect interactions reduce the accuracy of the network inference task. However, since MRMR relies only on pairwise interactions, it does not take into account the gain in information due to conditioning. In our example, the MRMR algorithm, after having selected*

_{i}*X*

*, computes the score*

_{j}*s*

*=*

_{i}*I*(

*X*

*;*

_{i}*Y*) −

*I*(

*X*

*;*

_{i}*X*

*), where*

_{j}*I*(

*X*

*;*

_{i}*Y*) = 0 and

*I*(

*X*

*;*

_{i}*X*

*) > 0. This score is negative and is likely to be badly ranked. As a result, the MRMR feature selection criterion is less exposed to the inconvenient of most feature selection techniques while sharing their interesting properties. Further experiments will focus on this aspect.*

_{j}
## 6. CONCLUSION AND FUTURE WORK |
||

A new network inference method, MRNET, has been proposed. This method relies on an effective method of information-theoretic feature selection called MRMR. Similarly to other network inference methods, MRNET relies on pairwise interactions between genes, making possible the inference of large networks (up to several thousands of genes).

Another advantage of MRNET, which could be exploited in future work, is its ability to benefit explicitly from a priori knowledge.

MRNET was compared experimentally to three state-of-the-art information-theoretic network inference methods, namely RELNET, CLR, and ARACNE, on thirty inference tasks. The microarray datasets were generated artificially with two different generators in order to effectively assess their inference power. Also, two different mutual information estimation methods were used. The experimental results showed that MRNET is competitive with the benchmarked information-theoretic methods.

Future work will focus on three main axes: (i) the assessment of additional mutual information estimators, (ii) the validation of the techniques on the basis of real microarray data, (iii) a theoretical analysis of which conditions should be met for MRNET to reconstruct the true network.