Metabolomics Interpretation with GO and GO-CAM

IN_PROGRESS PIPELINE

Metabolomics Interpretation with GO and GO-CAM

Overview

This project explores how the Gene Ontology (GO) and GO Causal Activity Models
(GO-CAMs) can be used to interpret metabolomics data
— for example the
differential-abundance metabolite lists that come out of studies deposited in
MetaboLights (EBI) and the
Metabolomics Workbench (NIH). The
guiding question is deliberately practical:

Can we adopt or build something that demonstrates concrete value of GO /
GO-CAM for interpreting metabolomics data, beyond the KEGG/SMPDB pathway
enrichment that is standard today?

There are two complementary handles, matching the two things GO has that a flat
pathway database does not:

  1. Reaction-grounded enzyme activities. GO molecular-function terms are
    enzyme activities, and via Rhea (whose reaction participants are
    ChEBI compounds) plus the rhea2go mapping, a metabolite can be linked
    to the GO MF terms of the enzymes that produce/consume it, and onward to the
    GO biological processes those enzymes participate in. This is exactly the
    forward direction already characterised in the
    RHEA → GO project — here we run it from the metabolite side.
  2. The full curated causal network. GO-CAM models of metabolic processes
    already encode enzyme activities connected by their shared input/output
    ChEBI molecules
    and by regulation edges. A GO-CAM is a curated
    metabolite-reaction-gene network with causal direction — the missing
    ingredient in the reaction-only networks that drive tools like mummichog.

The core technical insight is that ChEBI is the bridge. Metabolomics
repositories curate metabolite identities to ChEBI; Rhea reactions are written
in ChEBI; GO-CAM activity inputs/outputs are ChEBI; and rhea2go / GOA carry
those reactions up to GO MF/BP and to gene products. So a chain already exists:

metabolite (ChEBI)
    reaction (Rhea, ChEBI participants)
       enzyme activity (GO MF, via rhea2go)
          gene product (GOA)  &  biological process (GO BP)
             causal pathway context (GO-CAM)

The deliverable is to demonstrate that walking this chain gives a metabolomics
analyst something the KEGG/SMPDB enrichment they already run does not:
ontology-structured (closure-aware) enrichment, causal direction and
regulation, and a shared vocabulary with gene-level omics for multi-omics
integration.

There is a catch that has to be handled first (and the pilot below confirms
it dominates): Rhea writes participants in their major protonation state at
pH 7.3
(citrate(3-), ATP(4-), succinyl-CoA(5-)), whereas metabolomics
repositories report the neutral species (citric acid, ATP). A direct
ChEBI-id match between the two therefore fails almost completely. The fix is to
expand each metabolite over ChEBI's is_protonated_form_of /
is_deprotonated_form_of relations into its protonation family before
matching.

Pilot result — the bridge works, but only after protonation normalization

A reproducible coverage probe (probe/,
RESULTS.md) ran the
metabolite → Rhea → rhea2go → GO bridge on 26 central-carbon / amino-acid /
nucleotide / cofactor metabolites, reported by neutral name as a repository
would. Computed live from OLS4 (ChEBI), the Rhea REST API and GO rhea2go:

Match strategy Metabolites reaching a Rhea reaction / GO MF
Exact ChEBI id (neutral, as reported) 0 / 26
Protonation-normalized (ChEBI conjugate family) 22 / 26

So the bridge is essentially empty without protonation normalization and
near-complete with it
— e.g. neutral ATP (CHEBI:15422) matches nothing, but
its family member ATP(4-) (CHEBI:30616) (the form Rhea uses) reaches 492 GO
molecular-function terms
; NAD+ → 447, acetyl-CoA → 385. This both
validates the core idea and pins down protonation normalization as a mandatory
preprocessing step for any metabolite→GO bridge.

The 4 residual misses (isocitric acid, L-lactic acid, D-glucose, D-glucose
6-phosphate) are themselves informative: they are a second, distinct
ID-mismatch class — stereochemistry/anomer and generic-vs-structurally-
specific
ChEBI mismatch — that protonation expansion does not fix, and which a
follow-up must address with tautomer/enantiomer traversal or InChIKey-skeleton
matching.

Confirmed on a real study (MetaboLights MTBLS1)

The same probe was run on the curator-assigned ChEBI metabolites of a real
deposited study
MTBLS1
(Salek et al., type-2-diabetes urine NMR), 64 distinct metabolites pulled live
from the study's MAF. Real curated data is heterogeneous (a mix of neutral,
charged and generic forms), so it is a sterner test than the idealized demo set:

Match strategy Metabolites reaching a Rhea reaction
Exact ChEBI id (as the curators recorded) 8 / 64
Protonation-normalized 49 / 64

A 6× uplift (41 metabolites recovered only by normalization) on real data.
The 8 exact matches are almost all the uncharged metabolites (acetone,
ethanol, adenosine, uridine, creatinine, allantoin) plus the one anion the
curator happened to enter charged (2-oxoglutarate(2-)) — i.e. exactly the
metabolites with no protonation ambiguity. Everything with an ionisable group
needed normalization.

The biological kicker is in the residual misses: they were dominated by the
branched-chain amino acids isoleucine, leucine, valinethe canonical
type-2-diabetes biomarkers — plus glutamate, glutamine, histidine,
phenylalanine and citrulline. These are lost not to protonation but because the
curators used the generic (non-stereospecific) ChEBI ids (e.g. isoleucine CHEBI:24898) while Rhea uses the L-stereospecific zwitterion
(L-isoleucine zwitterion).

Second tier — structure (skeleton) normalization recovers the amino acids

That residual class is fixed by a second normalization tier: expand each
metabolite over ChEBI's broader structural relations (tautomer/enantiomer and
the generic→specific children hop) bounded to the seed's InChIKey skeleton
(the connectivity block, ignoring charge and stereochemistry). This recovers the
generic↔stereospecific mismatches:

Tier MTBLS1 metabolites in a Rhea reaction
Exact ChEBI id 8 / 64
+ protonation family 49 / 64
+ structure (skeleton) 58 / 64

The skeleton tier recovers all the BCAAs (isoleucine → L-isoleucine zwitterion), glutamate, glutamine, histidine, phenylalanine and citrulline. It
is deliberately stereo/charge-blind (a generic, stereo-unspecified ChEBI —
common for NMR — legitimately denotes either enantiomer), so it is reported as a
separate, more permissive tier. The 6 final residuals are acyl/conjugate
derivatives Rhea only represents in bound form (e.g. N-hexanoylglycine,
methyl N-acetylvalinate, the N-methylpyridone-carboxamides).

Three enrichments on the same metabolites: KEGG vs GO-MF vs GO-BP

With 58/64 metabolites connected, the bridge supports closure-aware
hypergeometric enrichment (BH-FDR), run three ways on the same metabolite
set with the same test, so the contribution of GO is directly visible:

KEGG pathway (membership) GO molecular function (activity) GO biological process (enzyme layer)
Alanine, aspartate & glutamate metabolism amino-acid transaminase activity amino acid metabolic process (4.3×, FDR 9e-44)
Citrate cycle (TCA) L-/D-amino-acid oxidase activity dicarboxylic acid metabolic process (6.0×)
Pyruvate metabolism amino-acid racemase activity amino acid transport / transmembrane transport (5–7×)
Nicotinate & nicotinamide metabolism primary methylamine oxidase activity proteinogenic amino acid metabolic process (4.2×)
Glyoxylate & dicarboxylate metabolism dicarboxylic-acid transporter activity carboxylic acid metabolic process (2.4×)

All three recover the amino-acid / central-carbon biology expected for a T2D
urine study, validating the bridge. The value GO adds is the two right
columns: where KEGG gives broad pathway buckets, GO resolves the specific
molecular activities (transamination, amino-acid oxidation/racemisation,
methylamine oxidation — the methylamine hit tracks the urinary
dimethylamine/trimethylamine/sarcosine in the panel) and, through the enzyme
layer, the specific biological processes (amino-acid metabolism and
transport, dicarboxylic-acid metabolism), each aggregated by the ontology's
is_a/part_of closure. The BP lift is organism-specific (human) by
construction and currency metabolites are excluded by Rhea reaction-degree.

Why GO is not used for metabolomics today (the gap)

GO annotates gene products, not metabolites — so out of the box GO says
nothing about a list of m/z features or ChEBI IDs. Consequently the metabolomics
field uses metabolite-centric resources instead:

The gap this project targets: none of these use GO, so metabolomics
interpretation lives in a separate ontology universe from the gene-level GO
annotations this repository curates. Bridging them via ChEBI/Rhea/GO-CAM is the
opportunity.

See METABOLOMICS/TOOLS-LANDSCAPE.md for the
full survey of MAGI, MetaboAnalyst/mummichog, Metabolomics Workbench, MetaboLights
and the Wishart-group resources, with what is reusable vs what we would build.

Existing tools surveyed

Tool Group What it does Relevance / reuse
MAGI LBNL (Northen, Bowen, et al.) Scores metabolite↔gene associations through a biochemical reaction network, boosting consensus between compound IDs and genome enzyme annotations Closest in spirit: a metabolite-reaction-gene consensus score. We can attach GO semantics (rhea2go MF, GO BP) on top of MAGI-style associations
mummichog Li (Emory) MS1 peaks → all formula matches → genome-scale reaction network → local network enrichment; bypasses up-front metabolite ID Reuse the permutation / null-model framework; swap/augment the reaction network with GO-CAM curated causal networks
MetaboAnalyst Xia / Wishart Web platform: ORA, MSEA, pathway analysis (KEGG/SMPDB), "MS Peaks to Pathways" (mummichog), multi-omics Adopt the enrichment UX + statistics; add a GO-BP enrichment backend
SMPDB / HMDB Wishart Curated small-molecule pathways and the human metabolome DB with ChEBI/KEGG cross-refs Cross-reference target; HMDB↔ChEBI gives the ID bridge into Rhea/GO
Metabolomics Workbench NIH (UCSD) Repository + RefMet standardized nomenclature (~500k annotations) + 174k-entry Metabolite DB cross-linked to ChEBI/KEGG/LIPID MAPS RefMet → ChEBI harmonization is the data on-ramp for real study inputs
MetaboLights EBI Open metabolomics repository; curators assign metabolites ChEBI IDs during curation Primary ChEBI-grounded study source for a demonstration dataset

Proposed approaches (to demonstrate value)

Approach A — GO biological-process enrichment of a metabolite set (Rhea-grounded)

A GO-native counterpart to SMPDB/KEGG enrichment:

  1. Harmonize the study's significant metabolites to ChEBI (via RefMet /
    MetaboLights mappings).
  2. Normalize protonation state — expand each ChEBI compound into its
    protonation family (is_protonated_form_of / is_deprotonated_form_of) so it
    can match the charged form Rhea uses. The pilot shows this step is not
    optional: it moves coverage from 0/26 to 22/26.
  3. For each family, collect Rhea reactions in which any member is a
    participant.
  4. Map those reactions to GO MF enzyme-activity terms via rhea2go
    (already tooled in the RHEA project), and to genes via GOA.
  5. Lift to GO biological process (via the enzymes' BP annotations and/or
    GO-CAM context) and run closure-aware ORA over the GO BP graph.

The argument for value: GO's is_a/part_of structure gives principled
closure (no arbitrary pathway boundaries), the same ontology is used for
transcriptomics/proteomics (multi-omics on one vocabulary), and it is
cross-species by construction.

Approach B — Locate the metabolite in the curated GO-CAM causal network

Use GO-CAM as the "full metabolic network" the user asked about:

  1. A perturbed ChEBI metabolite is matched to GO-CAM activity inputs/outputs.
  2. Trace upstream producing and downstream consuming activities along
    the causal chain, including regulatory edges (which mummichog's
    reaction-only network lacks).
  3. Report the enzymes/genes and processes whose curated causal neighbourhood is
    enriched for perturbed metabolites — a causal, direction-aware analogue of
    mummichog's local-enrichment idea.

The Genetics 2023 study (human glucose-metabolism GO-CAMs → mouse, mapping
pathway perturbations to phenotypes) is direct precedent that GO-CAM metabolic
models support exactly this kind of perturbation→outcome reasoning.

Approach C — MAGI-style consensus, GO-annotated

Run a MAGI-style metabolite↔gene consensus over the reaction network, then
annotate the resulting associations with GO MF/BP so the output is
GO-interpretable and joinable to gene-level omics. This is the most direct
"adopt an existing tool and add GO" path.

What we can adopt vs build

Relationship to other projects in this repo

This sits at the intersection of several existing efforts and reuses their
tooling rather than duplicating it:

Generality across studies

The pipeline has been run on four MetaboLights studies spanning biofluid,
platform and disease (urine NMR T2D; serum LC-MS cardiovascular; urine LC-MS
physiology; serum LC-MS hepatocellular carcinoma) — see
CROSS-STUDY.md. Protonation+skeleton normalization
is decisive in every study (exact match captures only 8–39% of the normalized
coverage), and GO enrichment reads out each study's own chemistry: the urine
studies surface amino-acid/organic-acid metabolism, the serum LC-MS studies
surface lipid/fatty-acid metabolism (FDR to 3e-89). The clearest remaining gap is
complex lipids, which Rhea does not carry as discrete participants — the next
methodological target.

Limitations & honest caveats

The pipeline is a scoping demonstration, not a finished tool. The main caveats,
in one place:

Toward an interactive demo

The working probe/ pipeline is the engine for an
interactive demo where a user interprets their own metabolomics data with GO
(paste a metabolite list or a MetaboLights accession → coverage + three-way
enrichment). Real-time arbitrary input is too heavy for the static Pages site, so
the plan is a precomputed static showcase here plus a small FastAPI app in a
new repo
(metabolomics-go-demo) reusing this engine. Full design, repo
strategy, phasing, and the KEGG-licensing caveat are in
DEMO-PLAN.md.

Open questions

Follow-Up Targets

Target Status Rationale
ChEBI→Rhea→rhea2go coverage probe + protonation normalization DONE (probe) Bridge connects 22/26 only after protonation normalization (0/26 exact)
Run the probe over a real MetaboLights study DONE (MTBLS1) 64 curated metabolites: 8/64 exact → 49/64 normalized (6× uplift); fetch_metabolights.py pulls any accession
Add stereochemistry/anomer + generic-vs-specific resolution DONE (skeleton tier) InChIKey-skeleton normalization; recovers the MTBLS1 diabetes BCAAs (Ile/Leu/Val), 49→58/64
Closure-aware GO enrichment (ORA) + KEGG baseline DONE (MF level) GO MF enrichment vs KEGG baseline on MTBLS1, same test
Lift the enrichment from GO MF to GO BP DONE GO BP enrichment via Rhea→UniProt human enzymes→GOA BP; amino-acid metabolism/transport (FDR 9e-44)
Run the pipeline over more MetaboLights studies DONE (4 studies) Cross-study summary: MTBLS1/90/404/19 — normalization decisive everywhere; each study recovers its own biology (urine→amino/organic acid, serum→lipid)
Extend the bridge to complex lipids TODO (next method gap) Serum LC-MS residuals are lipids absent from Rhea as discrete participants (LIPID MAPS/SwissLipids → ChEBI)
Reactome as a curated BP cross-check TODO Reactome maps reactions→pathways→GO-BP with human curation (and is ChEBI-grounded); validate the enzyme-layer BP route against it
Metabolomics → Reactome black-box-event prioritization TODO (high-leverage) The bridge names the reactions a metabolite implicates; Reactome BBEs are reactions with unknown catalyst/transporter → feed the existing Reactome gap-filling collaboration
Approach B — GO-CAM causal trace TODO Locate a perturbed metabolite as a GO-CAM activity input/output and trace upstream/downstream (Reactome is the practical substrate)
Interactive demo (precomputed showcase → FastAPI app) PLANNED See DEMO-PLAN.md: static gallery here + app in a new repo reusing the probe engine
Inventory GO-CAM metabolic models + their ChEBI input/output compounds TODO Feasibility of Approach B (the "full network" path)
Glucose-metabolism GO-CAM perturbation worked example TODO Direct analogue of the Genetics 2023 precedent

References

Project Status