ADOMETA - ADoption of Orphan METabolic Activities

The Challenge

Accurate reconstruction of metabolic networks is essential to understanding the biochemical organization of sequenced organisms. Homology-based methods can now annotate a significant fraction of metabolic genes in a given organism. However, even the state-of-the art homology methods cannot annotate sequences with no or little sequence identity to known enzymes. This resulted in orphan metabolic activities - metabolic reactions without any responsible sequences. We can classify orphan metabolic activities to be locally or globally orphan (Osterman and Overbeek 2003, Karp 2004). Global orphan activities do not have a single representative sequence in any organism. In contrast, local orphan activities represent reactions for which we do not have a representative sequence in a given organism, although one or several sequences catalyzing the reaction may be known in other organisms. Because of various reasons, major metabolic databases and models do not always agree upon enzymatic functional annotations. Thus the situation occurs that sometimes one metabolic activity is deemed to be locally orphan in a specific organism based on one model while not the other. Nevertheless, a simple calculation reveals that the number of global orphan activities is substantial. Also a list of global orphan activities as of Feb 2006 can be found here.

In recent years, so-called genomic context based methods, such as phylogenetic profile, mRNA co-expression profile, gene clustering and protein fusion have shown promising points in recovering metabolic networks, particularly for sequences with little or no sequence similarity to any known sequences. We believe that the combination of the homology and context-based methods will enhance our ability of identifying genes for orphan activities trememdously. Towards this goal have been developing a bioinformatics resource named ADOMETA, ADOption of Orphan METabolic Activities, as part of the research projects in the Vitkup Laboratory at Columbia University. ADOMETA currently provides computationally generated predictions of genes for orphan activities in the model bacteria Escherichia coli and B. subtilis as well as the baker's yeast S. cerevisiae.

Metabolic Networks

Metabolic networks are represented by the metabolic gene connectivity graph. The nodes of the graph correspond to metabolic genes, and the edges correspond to the dependencies (connections) established by metabolic reactions (See the figure below). Metabolic genes X and Y will be connected if the corresponding proteins share a common metabolite among their reactants or products.

The construction of a metabolic network from a list of metabolic reactions and visualization of a “network gap” (orphan activity). Abbreviations: M: Metabolite; E: Enzyme. Metabolites serve as edges in the network and enzymes as nodes. The orphan activity or “metabolic gap” is designated by “?” and surrounded by known metabolic genes. We refer the "first layer neighbors" (yellow in the figure) of a target gene to describe the collection of genes with distance one to the target gene, "second layer neighbors" (grey in the figure) to describe genes with distance two and so on.

Context-based Approaches

Several context-based correlations were successfully introduced previously (Enright et al., 1999; Huynen and Bork, 1998; Marcotte et al., 1999; Pellegrini et al., 1999). Context-based correlations between genes establish a similarity in the selective pressures during evolution. This similarity allows one to infer a functional connection. The following graph shows the idea.

(a) The gene clustering method identifies genes closely located on a chromosome. (b) The Rosetta Stone method identifies gene fusion events between homologs of a gene pair in other organisms. (c) The phylogenetic profiles method identifies genes which tend to co-occur together in different genomes. (d) The method based on the similarity of genes co-expression across different conditions.Genes with similar functions tend to co-express.

Context-based Signals in a Metabolic Network

It is important to understand if context-based signals are strong enough to determine the exact network location of a hypothetical gene. The established metabolic network allows us to quantify the relationship between the gene distance in the network and the average genomic context-based information. In the figure (S. cerevisiae) below we show correlations of different context-based methods between a target gene and all other network genes separated from the target by distance one, two, three, and so on. For all types of the context-based methods, we see a clear trend of monotonical decreasing (or increasing) relationship between the context-based correlation and metabolic network distance (Chen et al).

The figure above shows different types of context-based associations versus the metabolic network distance for the S. cerevisiae metabolic network:

(a) Gene co-expression (DeRisi et al., 1997; Wu et al., 2002). Numerous studies demonstrated that genes with similar mRNA co-expression profiles usually have related functions. Co-expression distances were calculated as |1-Spearman’s rank correlation| between co-expression profiles for a given organism, for example, E. coli, B. subtilis or S. cerevisiae. E. coli co-expression profiles were obtained from the Stanford Microarray Database (Ball, 2005). S. cerevisiae co-expression profiles were obtained from the Rosetta “compendium” dataset (Hughes et al., 2000). B. subtilis co-expression information were transferred from E. coli orthologous genes, which was determined by using BLAST (Altschul et al., 1997) best bi-directional search.
(b) Gene fusion events (Rosetta Stone Methods) (Enright et al., 1999; Marcotte et al., 1999; Yanai et al., 2001). The Rosetta Stone method is based on fusion events between proteins with similar function. Such fusion event between proteins in consecutive metabolic reactions may lead to an increase in the catalysis rate (due to channeling of the first reaction products to the second reaction) and efficient co-regulation of the reactions. In order to detect protein fusion event for yeast proteins, we used BLAST to find homologs between all proteins from a given organism, say E. coli and all protein in the nr database. We considered two ORFs X and Y to have a Rosetta Stone protein if both X and Y align over 70% of their length to different regions of a third protein in another organism with an E-value smaller than 1e-3.
(c) Phylogenetic profiles. (Huynen and Bork 1998; Pellegrini et al., 1999). The phylogenetic profile method measures co-occurrence between a pair of genes across genomes. Phylogenetic profiles were constructed for all genes of the organisms of interest using automated BLAST searches against a large collection of genomes. We deliberately filtered out evolutionary similar genomes. To calculate phylogenetic profile correlations between genes we used an N-dimensional binary vector representing presence (1) or absence (0) of target gene homologs in query genomes. Alignments were considered significant if the E-value is less than 1e-3.
(d) Chromosomal clustering. (Dandekar et al., 1998; Lee and Sonnhammer 2003; Overbeek et al., 1999). Bacterial genes forming functional pathways are often organized into transcriptional units (operons). Chromosomal clustering of functionally related metabolic genes was also demonstrated in eukaryotes (Lee and Sonnhammer 2003). The degree to which orthologs of two genes are clustered on the chromosome was calculated based on the null hypothesis that genes are randomly distributed across the chromosomes. Instead of considering gene sizes and exact nucleotide positions, we concentrated on gene order statistics. Let P(dg(x, y)) be the probability of observing gene order distance dg(x, y) between a pair of genes x and y equal or less than a threshold (i.e., x and y are “clustered”) in a genome g under the null hypothesis. Assuming gene order distances are independent across different organisms, we calculated the product of P(dg(x, y)) across a set of genomes to represent the strength of the clustering. The clustering results were based on 105 bacterial and three eukaryotic genomes (S. cerevisiae, S. pombe and C. elegans). Orthology mapping was established using the KEGG SSDB best bi-directional hits database (Kanehisa et al., 2002).

Other types of genomic context-based associations, such as gene distance (measuring physical distance between genes on the chromosome of the genome of interest) are not plotted on the figure but it shows the same trend.

Additionally although no network structure exists in annotated pathways, it is oftentimes benificial to consider the context-based correlations between a candidate gene and known pathway genes. Genes that are highly correlated with known pathway genes are more likely to carry out a reaction (sometimes a gap) in that particular pathway. The pathway distance descriptor measure the shortest chromosomal distance of a candidate gene to any known genes in the pathway that the target activity is involoved.

General Strategy to Adopt an Orphan Activity using Context Data

The conceptual idea of the our approach is illustrated in the figure below. The fit of a particular gene for a network gap is determined by the strength of functional correlations with known genes close to the gap. The context-based methods are used to calculate the functional correlations. The strength of the fit is determined by fitness (cost) functions such as the one listed below:

where x is the candidate gene, |N| is the total of number genes contributing to the fitting, R is the maximal layer around the gap contributing to the fitness function, Ni is the set of genes in i-th layer (1st, 2nd, 3rd, etc.) around the gap, d(x,g) is a context-based correlation between genes x

and the gap g, wi is a weight determining the relative contribution of the different layers, p is a positive factor determining how the decrease in a context-based correlation should decrease the contribution to the fit. In the function, the first summation is for all the genes of a certain network layer around the gap, the second summation is for each contributing layer. For more information about cost functions, please see Kharchenko et al. and Chen et al.The performance of our fitting procedure can be quantified by the fraction of correct genes ranked - based on the fitness function - as the top 1 or within the top 10 genes out of a large collection of unrelated genes (for example more than 6,000 non-metabolic genes in S. cerevisiae). These ranks are shown separately in the prediction table.

Testing a set of candidate genes for the fit into a network gap (orphan metabolic activity). The genes from the candidate set are tested one by one and the fit is determined by a fitness function (see Chen et al.). The function values are higher for candidates with good context-based correlations to the genes neighboring the gap. Based on the fitness function values the genes are ranked (right column of the figure). We use the following self-consistent procedure to evaluate performance of the method: Each known gene is removed from the network – leaving a gap in its place - and the gene rank is calculated among all non-metabolic genes (>6,000 genes for yeast) for the created gap. This procedure is repeated for all known metabolic genes. The method performance is quantified by the fraction of the correct genes ranked as the top (corresponding to P value < 1.6*10-4) or within the top 10 (P value < 1.6*10-3) out of all non-metabolic genes.

Homology information

All candidate genes were aligned against the Swissprot Database (Bairoch and Apweiler, 2005) sequences to obtain homology scores using BLAST searches. Two types of scores were recorded:

(a) Best Sequence Idenity to Proteins Assigned the full EC number of interest (SEQ4) is calculated as the best sequence identity score to all known Swissprot enzymes (excluding enzymes from the investigated organism) assigned the EC number of interest. For example, if the EC number of interest is “”, the best sequence identity of a candidate gene to all Swissprot enzymes annotated as “” is recorded as SEQ4 (“4” means it is specific to ALL 4 digits of EC). Sequence identity scores obtained were considered valid if the corresponding alignment has an E-value of at least 2e-5. We use the very loose E-value in that we intend to use even the most remote homology. If, for example, no sequence homology was detected between a candidate gene and any Swissprot enzymes assigned the EC number of interest, "0" and "20" are assigned as the SEQ4 and E-value, respectively.
(b) Best Sequence Idenity to Proteins Assigned the partial EC number of interest (SEQ3) was calculated based on the same idea. The only difference was that instead of calculating the best sequence identity score to enzymes assigned the full EC number, all enzymes that have the identical first three digits (excluding enzymes assigned to the EC itself) to the EC number of interest were considered in the calculation. Using the example above, SEQ3 is calculated as the best sequence identity score to all Swissprot enzymes that have an EC number of 4.1.3 (excluding as the first three digits. The use of SEQ3 allows us to utilize homology to similar enzymes when the homology information to the exact EC is not available (e.g., in the case of global orphan activities). Again, If no sequence homology was detected between a candidate gene and all Swissprot enzymes assigned the first three digits of the EC number of interest, "0" and "20" are assigned as the SEQ3 and E-value, respectively.

Combining different associations and calculation of P-value

Currently, we use the MLjava implementaion of the AdaBoost algorithm and Alternating Decision Tree (ADT) developed by Dr. Freund et al (Ref 1, 2, 3) to combine genomic context-based associations and sequence homology score into a single accurate predictor. We plan to use several additional strategies to integrate different types of associations in the near future. The "Overall Rank" presented in the prediction table represents the overall strength of evidence that a candidate gene may be responsible for a particular orphan activity of interest.We assess the confidence in each predictions by calculating the tail probability (P-value) of each score. To do so, we need a reliable estimate of the distribution of scores under the null hypothesis, which is the assumption that the gene’s product does not catalyze the metabolic reaction we are interested in. For a given metabolic gap, the scores of virtually all candidate genes can be regarded as samples taken from the background distribution, as most genes do not encode the enzyme for that particular enzymatic activity. Hence, we can estimate the background distribution under the null hypothesis from the set of scores of all candidate genes for a particular metabolic gap. In practice, we found that this distribution does not resemble either a normal distribution or an extreme-value distribution, rendering a P-value estimation from such a fitted distribution inaccurate. Instead, having a large sample of background scores for each gap allows us to use the kernel density estimation method (Silverman, Density Estimation for Statistics and Data Analysis, 1986, Chapman and Hall, New York) (Scott, Multivariate Density Estimation; Theory, Practice, and Visualization, 1992, John Wiley and Sons, New York), which can estimate a distribution non-parametrically. For a Gaussian kernel, the estimated background distribution is

where xi are the n background scores, is the kernel bandwidth, and is the standard deviations of the background scores. To calculate the P-value for a given score x, we integrate the estimated distribution from x to infinity, yielding

In contrast to parametrically estimated distribution functions, the kernel estimate of the P-value takes the distance between the top score and runner-up scores directly into account. If the score x is much larger than all other scores xi, the error function approaches unity, and the P-value will be small. On the other hand, if the top score is attained by several genes, each of them will contribute a term 1/2n to the P-value.

E-value or the expected value takes into account the size of the candidate gene sets and is calculated simply as the P-value multiplied by the number of candidate genes in a particular gene set.