ADOMETA - ADoption of Orphan METabolic Activities

Available organisms

Currently, predictions are available for three organisms: S. cerevisiae, E. coli and B. subtilis. We will add predictions for other organisms in the near future. To see predictions for the organisms above, select the organism of interest from the "Select an organism" drop down menu.

Getting predictions

Queries can be made in a number of ways. For instance, one can query using "EC #" (Enzyme Nomenclature numbers), which represent enzymatic activities and are strings of four digits separated by periods . For more information on EC numbers, please see the Enzyme Nomenclature website. You can also see below for several examples.

Alternatively, one can use reaction name to query. For example, select the query type "By Reaction Name" from the "Select Type of Query" drop down menu and input the name, for instance, " ketol-acid reductoisomerase" in the query box and click "submit". However, unlike the standard EC #, reaction names are usually heterogeneous and/or ambiguous and is not recommended to use in querying.

Examples

Example 1. Assume you want to see predictions for E. coli for the EC number "2.6.1.19, 4-aminobutyrate transaminase", choose "By EC #" from the "Select a Type of Query" drop down menu and input the EC # without any prefix or suffix in the query box and click "submit".

Example 2. Choose B. subtilis from the organism list --> choose "By EC #" as the type of query--> choose any candidate gene set--> input "5.1.3.13" in the query box--> click "submit'.

Example 3. Choose S. cerevisiae from the organism list--> choose EC # as the type of query -->choose a candidate gene set --> input "2.1.1.10" in the query box --> click "submit".

Candidate set

Only genes from a candidate set are tested for any given orphan activity of interest. For each organism, top 20 predictions are available based on:

1) Genes with no known metabolic function. This set includes all genes of unknown functions and genes of no known metabolic functions;
2) All genes (except neighbors). This set includes all genes of unknown function, all genes of no known metabolic functions and metabolic genes that are not the direct or the second-layer neighbor of the orphan activity of interest. This set allows discovery of multi-functional enzymes which is not allowed by 1);
3) Only hypothetical genes. This set contains only genes that are of unknown functions;

4) Homologous genes with no metabolic function. Same idea as in 1) but only genes with some homology to metabolic genes known to carry out the reaction of interest in other organisms will be considered, regardless their ranking based on the combined algorithm. In practice, we selected genes that have at least modest sequence homology (E value cutoff 5e-2) to known enzymes.
5) All homologous genes (except neighbors). Same idea as in 2) and genes with at least modest homology as described in 4) are displayed, regardless of its ranking based on our algorithm;

6) Homologous hypothetical genes. Same idea as in 3) and genes with at least modest homology as described in 4) are displayed, regardless of its ranking based on our algorithm.

Determining the status of metabolic activities

For various reasons, opinions on the status of one activity (being orphan or assigned to genes) in a specific organism are not always consensus across major metabolic databases or metabolic models. On the prediction page, we list the status of metabolic activities in KEGG, Swissprot or well-established metabolic models (iJR904 for E. coli and iLL672 for S. cerevisiae) for corresponding organisms. If the status of an activity shows "Assigned to genes" in a database or model, it means the activity has been assigned to genes in that source. In contrast, if the status shows "Orphan", it means the activity is a local orphan for the specific organism of interest (i.e., not assigned to genes in that organism) based on the database or model. A reaction will be listed as "Global orphan" if no responsible sequence could be identified not only in the three organisms considered but in all known organisms (as of Feb 2006).

We provide organism-specific lists of reactions to facilitate browsing. For B. subtilis, the list of reactions that have genes assigned is obtained from KEGG. For E. coli and S. cerevisiae, the lists are obtained from the well-curated metabolic models iJR904 and iLL672, respectively. Therefore, it is possible that a reaction assigned genes in other sources appears "orphan" in the lists or vice versa.

It is even trickier to determine the list of orphan activities existing in an organism. It is ideal to determine the existence of certain activities in a specific organisms based on biochemical experiments. However in practice, it is nearly impossible to test exhaustively the presence/abscence of all activities in all organisms. Therefore, we have obtained KEGG reference pathways and if part of the pathway ECs are assigned in one orgnaism, assumed that all pathway reactions without known enzymes are orphan in the specific organism. This assumption is obviously simplistic and sometimes may lead to false positives. However, as we observed, if one activity is in fact absent in an organism, the neighborhood is usually composed of large percentage of gaps, meaning the branch is possibly missing in that organism. These gaps usually lead to poor prediction, as indicated in the p-value.

Why sometimes no predictions are made?

Sometimes no predictions will be displayed for an orphan activity of interest. There are several possible reasons:

1. No evidence indicates the existence of such activity in the organism of interest. As a result, we did not perform predictions on such activities. If you believe the activity exists, please send us an email at lifeng.chen@dbmi.columbia.edu or vitkup@dbmi.columbia.edu.
2. The orphan activity is only surrounded by other orphan activities, in which case no context-based information could be utilized. If, for example, no genes in the candidate set has any homology to other Swissprot enzymes assigned the EC of interest, then the gap looks the same (i.e., equally unsuitable) to any candidate gene. Since we use the average rank when there is a tie (e.g., if three genes tie at the top rank, then their ranks are all (1+2+3)/3 =2), when the above situation occurs, no gene would be predicted as one of the top 20 genes since all genes will be tied at rank 500- 2000 depending on the size of the candidate set.

3. The position is very non-specific. As a result, more than 40 genes are tied at the top rank and thus all of them would be assigned a rank of greater than 20 because of the averaging ranking method. This situation is somewhat rare.
4. The orphan activity is not connected to other nodes in the network. This happens when a) the orphan activity involves in some unique metabolites that no other reactions share or b) the reaction involves in only very common metabolites, for example, nad + nadph --> nadh + nadp. Because the most common metabolites were deleted before buiding the network, these nodes appear "disconnected" from the rest of the network. Again, no context-based methods could be utilized. If homology information is not available, the gap becomes un-predictable.

For similar reasons, sometimes our algorithm does not output "exactly" the top 20 predictions - a user may see five or six top predictions and the others are too non-specific to be picked up by our algorithm. When no predictions are displayed, a user can choose different candidate gene sets and/or select another querying EC # to see if predictions are available.