Metabolomics Interpretation with GO and GO-CAM

Overview

This project explores how the Gene Ontology (GO) and GO Causal Activity Models
(GO-CAMs) can be used to interpret metabolomics data — for example the
differential-abundance metabolite lists that come out of studies deposited in
MetaboLights (EBI) and the
Metabolomics Workbench (NIH). The
guiding question is deliberately practical:

Can we adopt or build something that demonstrates concrete value of GO /
GO-CAM for interpreting metabolomics data, beyond the KEGG/SMPDB pathway
enrichment that is standard today?

There are two complementary handles, matching the two things GO has that a flat
pathway database does not:

Reaction-grounded enzyme activities. GO molecular-function terms are
enzyme activities, and via Rhea (whose reaction participants are
ChEBI compounds) plus the rhea2go mapping, a metabolite can be linked
to the GO MF terms of the enzymes that produce/consume it, and onward to the
GO biological processes those enzymes participate in. This is exactly the
forward direction already characterised in the
RHEA → GO project — here we run it from the metabolite side.
The full curated causal network. GO-CAM models of metabolic processes
already encode enzyme activities connected by their shared input/output
ChEBI molecules and by regulation edges. A GO-CAM is a curated
metabolite-reaction-gene network with causal direction — the missing
ingredient in the reaction-only networks that drive tools like mummichog.

The core technical insight is that ChEBI is the bridge. Metabolomics
repositories curate metabolite identities to ChEBI; Rhea reactions are written
in ChEBI; GO-CAM activity inputs/outputs are ChEBI; and rhea2go / GOA carry
those reactions up to GO MF/BP and to gene products. So a chain already exists:

metabolite (ChEBI)
   → reaction (Rhea, ChEBI participants)
      → enzyme activity (GO MF, via rhea2go)
         → gene product (GOA)  &  biological process (GO BP)
            → causal pathway context (GO-CAM)

The deliverable is to demonstrate that walking this chain gives a metabolomics
analyst something the KEGG/SMPDB enrichment they already run does not:
ontology-structured (closure-aware) enrichment, causal direction and
regulation, and a shared vocabulary with gene-level omics for multi-omics
integration.

There is a catch that has to be handled first (and the pilot below confirms
it dominates): Rhea writes participants in their major protonation state at
pH 7.3 (citrate(3-), ATP(4-), succinyl-CoA(5-)), whereas metabolomics
repositories report the neutral species (citric acid, ATP). A direct
ChEBI-id match between the two therefore fails almost completely. The fix is to
expand each metabolite over ChEBI's is_protonated_form_of /
is_deprotonated_form_of relations into its protonation family before
matching.

Pilot result — the bridge works, but only after protonation normalization

A reproducible coverage probe (probe/,
RESULTS.md) ran the
metabolite → Rhea → rhea2go → GO bridge on 26 central-carbon / amino-acid /
nucleotide / cofactor metabolites, reported by neutral name as a repository
would. Computed live from OLS4 (ChEBI), the Rhea REST API and GO rhea2go:

Match strategy	Metabolites reaching a Rhea reaction / GO MF
Exact ChEBI id (neutral, as reported)	0 / 26
Protonation-normalized (ChEBI conjugate family)	22 / 26

So the bridge is essentially empty without protonation normalization and
near-complete with it — e.g. neutral ATP (CHEBI:15422) matches nothing, but
its family member ATP(4-) (CHEBI:30616) (the form Rhea uses) reaches 492 GO
molecular-function terms; NAD+ → 447, acetyl-CoA → 385. This both
validates the core idea and pins down protonation normalization as a mandatory
preprocessing step for any metabolite→GO bridge.

The 4 residual misses (isocitric acid, L-lactic acid, D-glucose, D-glucose
6-phosphate) are themselves informative: they are a second, distinct
ID-mismatch class — stereochemistry/anomer and generic-vs-structurally-
specific ChEBI mismatch — that protonation expansion does not fix, and which a
follow-up must address with tautomer/enantiomer traversal or InChIKey-skeleton
matching.

Confirmed on a real study (MetaboLights MTBLS1)

The same probe was run on the curator-assigned ChEBI metabolites of a real
deposited study — MTBLS1
(Salek et al., type-2-diabetes urine NMR), 64 distinct metabolites pulled live
from the study's MAF. Real curated data is heterogeneous (a mix of neutral,
charged and generic forms), so it is a sterner test than the idealized demo set:

Match strategy	Metabolites reaching a Rhea reaction
Exact ChEBI id (as the curators recorded)	8 / 64
Protonation-normalized	49 / 64

A 6× uplift (41 metabolites recovered only by normalization) on real data.
The 8 exact matches are almost all the uncharged metabolites (acetone,
ethanol, adenosine, uridine, creatinine, allantoin) plus the one anion the
curator happened to enter charged (2-oxoglutarate(2-)) — i.e. exactly the
metabolites with no protonation ambiguity. Everything with an ionisable group
needed normalization.

The biological kicker is in the residual misses: they were dominated by the
branched-chain amino acids isoleucine, leucine, valine — the canonical
type-2-diabetes biomarkers — plus glutamate, glutamine, histidine,
phenylalanine and citrulline. These are lost not to protonation but because the
curators used the generic (non-stereospecific) ChEBI ids (e.g. isoleucine CHEBI:24898) while Rhea uses the L-stereospecific zwitterion
(L-isoleucine zwitterion).

Second tier — structure (skeleton) normalization recovers the amino acids

That residual class is fixed by a second normalization tier: expand each
metabolite over ChEBI's broader structural relations (tautomer/enantiomer and
the generic→specific children hop) bounded to the seed's InChIKey skeleton
(the connectivity block, ignoring charge and stereochemistry). This recovers the
generic↔stereospecific mismatches:

Tier	MTBLS1 metabolites in a Rhea reaction
Exact ChEBI id	8 / 64
+ protonation family	49 / 64
+ structure (skeleton)	58 / 64

The skeleton tier recovers all the BCAAs (isoleucine → L-isoleucine zwitterion), glutamate, glutamine, histidine, phenylalanine and citrulline. It
is deliberately stereo/charge-blind (a generic, stereo-unspecified ChEBI —
common for NMR — legitimately denotes either enantiomer), so it is reported as a
separate, more permissive tier. The 6 final residuals are acyl/conjugate
derivatives Rhea only represents in bound form (e.g. N-hexanoylglycine,
methyl N-acetylvalinate, the N-methylpyridone-carboxamides).

Three enrichments on the same metabolites: KEGG vs GO-MF vs GO-BP

With 58/64 metabolites connected, the bridge supports closure-aware
hypergeometric enrichment (BH-FDR), run three ways on the same metabolite
set with the same test, so the contribution of GO is directly visible:

KEGG-pathway ORA — the incumbent baseline
(kegg_baseline.py →
result): ChEBI→KEGG
compound→pathway, vs all KEGG pathway compounds.
GO molecular-function ORA — metabolite→Rhea→rhea2go→GO MF + closure
(go_enrichment.py →
result), vs the Rhea/GO
metabolic universe (7,174 metabolites).
GO biological-process ORA — lifted through the enzyme/gene layer:
metabolite→Rhea reaction→human Swiss-Prot enzyme (UniProt)→GO BP + closure
(go_bp_enrichment.py →
result), a gene-set
ORA of the 488 implicated enzymes vs all 4,052 human metabolic enzymes.

KEGG pathway (membership)	GO molecular function (activity)	GO biological process (enzyme layer)
Alanine, aspartate & glutamate metabolism	amino-acid transaminase activity	amino acid metabolic process (4.3×, FDR 9e-44)
Citrate cycle (TCA)	L-/D-amino-acid oxidase activity	dicarboxylic acid metabolic process (6.0×)
Pyruvate metabolism	amino-acid racemase activity	amino acid transport / transmembrane transport (5–7×)
Nicotinate & nicotinamide metabolism	primary methylamine oxidase activity	proteinogenic amino acid metabolic process (4.2×)
Glyoxylate & dicarboxylate metabolism	dicarboxylic-acid transporter activity	carboxylic acid metabolic process (2.4×)

All three recover the amino-acid / central-carbon biology expected for a T2D
urine study, validating the bridge. The value GO adds is the two right
columns: where KEGG gives broad pathway buckets, GO resolves the specific
molecular activities (transamination, amino-acid oxidation/racemisation,
methylamine oxidation — the methylamine hit tracks the urinary
dimethylamine/trimethylamine/sarcosine in the panel) and, through the enzyme
layer, the specific biological processes (amino-acid metabolism and
transport, dicarboxylic-acid metabolism), each aggregated by the ontology's
is_a/part_of closure. The BP lift is organism-specific (human) by
construction and currency metabolites are excluded by Rhea reaction-degree.

Why GO is not used for metabolomics today (the gap)

GO annotates gene products, not metabolites — so out of the box GO says
nothing about a list of m/z features or ChEBI IDs. Consequently the metabolomics
field uses metabolite-centric resources instead:

Pathway / set databases: KEGG pathways, SMPDB (Wishart group), HMDB
classes, RefMet/Lipid-class hierarchies.
Enrichment methods: Over-Representation Analysis (ORA) and Metabolite
Set Enrichment Analysis (MSEA) over those sets, and for untargeted
(MS1-peak) data the network method mummichog, which maps all plausible
formula matches onto a genome-scale metabolic network and looks for local
enrichment (true signal clusters; false matches scatter). These are packaged
in MetaboAnalyst (Xia/Wishart) — "MS Peaks to Pathways" is mummichog
re-implemented in R over five curated genome-scale models + ~21 KEGG-derived
organism models.

The gap this project targets: none of these use GO, so metabolomics
interpretation lives in a separate ontology universe from the gene-level GO
annotations this repository curates. Bridging them via ChEBI/Rhea/GO-CAM is the
opportunity.

See METABOLOMICS/TOOLS-LANDSCAPE.md for the
full survey of MAGI, MetaboAnalyst/mummichog, Metabolomics Workbench, MetaboLights
and the Wishart-group resources, with what is reusable vs what we would build.

Existing tools surveyed

Tool	Group	What it does	Relevance / reuse
MAGI	LBNL (Northen, Bowen, et al.)	Scores metabolite↔gene associations through a biochemical reaction network, boosting consensus between compound IDs and genome enzyme annotations	Closest in spirit: a metabolite-reaction-gene consensus score. We can attach GO semantics (rhea2go MF, GO BP) on top of MAGI-style associations
mummichog	Li (Emory)	MS1 peaks → all formula matches → genome-scale reaction network → local network enrichment; bypasses up-front metabolite ID	Reuse the permutation / null-model framework; swap/augment the reaction network with GO-CAM curated causal networks
MetaboAnalyst	Xia / Wishart	Web platform: ORA, MSEA, pathway analysis (KEGG/SMPDB), "MS Peaks to Pathways" (mummichog), multi-omics	Adopt the enrichment UX + statistics; add a GO-BP enrichment backend
SMPDB / HMDB	Wishart	Curated small-molecule pathways and the human metabolome DB with ChEBI/KEGG cross-refs	Cross-reference target; HMDB↔ChEBI gives the ID bridge into Rhea/GO
Metabolomics Workbench	NIH (UCSD)	Repository + RefMet standardized nomenclature (~500k annotations) + 174k-entry Metabolite DB cross-linked to ChEBI/KEGG/LIPID MAPS	RefMet → ChEBI harmonization is the data on-ramp for real study inputs
MetaboLights	EBI	Open metabolomics repository; curators assign metabolites ChEBI IDs during curation	Primary ChEBI-grounded study source for a demonstration dataset

Proposed approaches (to demonstrate value)

Approach A — GO biological-process enrichment of a metabolite set (Rhea-grounded)

A GO-native counterpart to SMPDB/KEGG enrichment:

Harmonize the study's significant metabolites to ChEBI (via RefMet /
MetaboLights mappings).
Normalize protonation state — expand each ChEBI compound into its
protonation family (is_protonated_form_of / is_deprotonated_form_of) so it
can match the charged form Rhea uses. The pilot shows this step is not
optional: it moves coverage from 0/26 to 22/26.
For each family, collect Rhea reactions in which any member is a
participant.
Map those reactions to GO MF enzyme-activity terms via rhea2go
(already tooled in the RHEA project), and to genes via GOA.
Lift to GO biological process (via the enzymes' BP annotations and/or
GO-CAM context) and run closure-aware ORA over the GO BP graph.

The argument for value: GO's is_a/part_of structure gives principled
closure (no arbitrary pathway boundaries), the same ontology is used for
transcriptomics/proteomics (multi-omics on one vocabulary), and it is
cross-species by construction.

Approach B — Locate the metabolite in the curated GO-CAM causal network

Use GO-CAM as the "full metabolic network" the user asked about:

A perturbed ChEBI metabolite is matched to GO-CAM activity inputs/outputs.
Trace upstream producing and downstream consuming activities along
the causal chain, including regulatory edges (which mummichog's
reaction-only network lacks).
Report the enzymes/genes and processes whose curated causal neighbourhood is
enriched for perturbed metabolites — a causal, direction-aware analogue of
mummichog's local-enrichment idea.

The Genetics 2023 study (human glucose-metabolism GO-CAMs → mouse, mapping
pathway perturbations to phenotypes) is direct precedent that GO-CAM metabolic
models support exactly this kind of perturbation→outcome reasoning.

Approach C — MAGI-style consensus, GO-annotated

Run a MAGI-style metabolite↔gene consensus over the reaction network, then
annotate the resulting associations with GO MF/BP so the output is
GO-interpretable and joinable to gene-level omics. This is the most direct
"adopt an existing tool and add GO" path.

What we can adopt vs build

Adopt: ChEBI/RefMet harmonization (Workbench RefMet, MetaboLights ChEBI),
the mummichog permutation/null framework, MetaboAnalyst's enrichment
statistics + UX patterns, MAGI's reaction-network consensus idea.
Build (in this repo): the ChEBI → Rhea → rhea2go GO MF → GOA gene → GO BP enrichment, reusing the RHEA project's rhea2go/ec2go
tooling; and a GO-CAM "trace the perturbed metabolite" demo. Both are small,
self-contained, and exercise data we already work with.

Relationship to other projects in this repo

This sits at the intersection of several existing efforts and reuses their
tooling rather than duplicating it:

RHEA → GO — provides the rhea2go/ec2go mappings and the
ChEBI-grounded reaction→GO machinery that Approaches A and C depend on. The
"rare project" the request alludes to.
Metabolic Model Analysis — genome-scale
metabolic models (GEMs) are exactly the networks mummichog uses; this project
found systematic GEM gene-EC representation issues that matter for any
network-based metabolite interpretation.
UniPathway and Enzyme Specificity
— pathway/reaction representation and substrate-specificity altitude, which
govern how cleanly a metabolite maps to a single GO MF term.
Reactome Gap Filling — Reactome is another
curated, ChEBI-grounded reaction source convertible to/from GO-CAM.

Generality across studies

The pipeline has been run on four MetaboLights studies spanning biofluid,
platform and disease (urine NMR T2D; serum LC-MS cardiovascular; urine LC-MS
physiology; serum LC-MS hepatocellular carcinoma) — see
CROSS-STUDY.md. Protonation+skeleton normalization
is decisive in every study (exact match captures only 8–39% of the normalized
coverage), and GO enrichment reads out each study's own chemistry: the urine
studies surface amino-acid/organic-acid metabolism, the serum LC-MS studies
surface lipid/fatty-acid metabolism (FDR to 3e-89). The clearest remaining gap is
complex lipids, which Rhea does not carry as discrete participants — the next
methodological target.

Limitations & honest caveats

The pipeline is a scoping demonstration, not a finished tool. The main caveats,
in one place:

Skeleton normalization is stereo/charge-blind. The second coverage tier
matches on InChIKey skeleton, so it can match a reported generic compound to
either enantiomer (e.g. an unspecified amino acid to both L- and D- Rhea
forms). This is appropriate when the reported ChEBI is itself stereo-unspecified
(common for NMR) but should not be read as stereospecific evidence; it is
reported as a separate, more permissive tier.
The GO-BP enrichment is organism-specific and inference-based. It routes
through human (taxon 9606) Swiss-Prot enzymes and their GOA BP annotations, so
it is only valid for human studies (configurable via --taxon) and inherits any
GOA annotation bias. A curated reaction→process source (Reactome) would be a
cleaner cross-check — see follow-ups.
Lipid coverage gap. Complex lipids (sphingomyelins, phosphatidylcholines,
triacylglycerols) are largely absent from Rhea as discrete participants, so
lipid-rich serum LC-MS studies connect at ~53–59% vs ~65–91% for polar-metabolite
urine studies. This is a real, localizable bridge gap, not noise.
Cofactor handling is a heuristic. Currency metabolites are excluded by Rhea
reaction-degree (default ≥500); the threshold is data-driven and reported, but
it is still a cutoff, not a curated cofactor list.
ORA, not topology. The enrichment is over-representation with ontology
closure; it does not (yet) use reaction-network topology (mummichog-style) or
causal direction (GO-CAM). That is Approach B.
KEGG baseline is illustrative only. Retained for the recognizable
side-by-side on MTBLS1, with a licensing note; KEGG carries redistribution
restrictions, so the going-forward pathway baseline is Reactome/SMPDB and KEGG
is not expanded to other studies.

Toward an interactive demo

The working probe/ pipeline is the engine for an
interactive demo where a user interprets their own metabolomics data with GO
(paste a metabolite list or a MetaboLights accession → coverage + three-way
enrichment). Real-time arbitrary input is too heavy for the static Pages site, so
the plan is a precomputed static showcase here plus a small FastAPI app in a
new repo (metabolomics-go-demo) reusing this engine. Full design, repo
strategy, phasing, and the KEGG-licensing caveat are in
DEMO-PLAN.md.

Open questions

How many MetaboLights/Workbench study metabolites actually resolve to ChEBI
IDs that participate in a rhea2go-mapped reaction (coverage of the bridge)?
Does GO-BP enrichment recover the same biology as SMPDB/KEGG enrichment on a
benchmark study, and where does the GO closure structure add or remove signal?
Do GO-CAM causal/regulatory edges materially change interpretation versus a
reaction-only network on the same metabolite set?
Can GO-CAM metabolic coverage (which processes have models) support a useful
fraction of typical metabolomics readouts, or is it currently too sparse?

Follow-Up Targets

Target	Status	Rationale
ChEBI→Rhea→`rhea2go` coverage probe + protonation normalization	DONE (probe)	Bridge connects 22/26 only after protonation normalization (0/26 exact)
Run the probe over a real MetaboLights study	DONE (MTBLS1)	64 curated metabolites: 8/64 exact → 49/64 normalized (6× uplift); `fetch_metabolights.py` pulls any accession
Add stereochemistry/anomer + generic-vs-specific resolution	DONE (skeleton tier)	InChIKey-skeleton normalization; recovers the MTBLS1 diabetes BCAAs (Ile/Leu/Val), 49→58/64
Closure-aware GO enrichment (ORA) + KEGG baseline	DONE (MF level)	GO MF enrichment vs KEGG baseline on MTBLS1, same test
Lift the enrichment from GO MF to GO BP	DONE	GO BP enrichment via Rhea→UniProt human enzymes→GOA BP; amino-acid metabolism/transport (FDR 9e-44)
Run the pipeline over more MetaboLights studies	DONE (4 studies)	Cross-study summary: MTBLS1/90/404/19 — normalization decisive everywhere; each study recovers its own biology (urine→amino/organic acid, serum→lipid)
Extend the bridge to complex lipids	TODO (next method gap)	Serum LC-MS residuals are lipids absent from Rhea as discrete participants (LIPID MAPS/SwissLipids → ChEBI)
Reactome as a curated BP cross-check	TODO	Reactome maps reactions→pathways→GO-BP with human curation (and is ChEBI-grounded); validate the enzyme-layer BP route against it
Metabolomics → Reactome black-box-event prioritization	TODO (high-leverage)	The bridge names the reactions a metabolite implicates; Reactome BBEs are reactions with unknown catalyst/transporter → feed the existing Reactome gap-filling collaboration
Approach B — GO-CAM causal trace	TODO	Locate a perturbed metabolite as a GO-CAM activity input/output and trace upstream/downstream (Reactome is the practical substrate)
Interactive demo (precomputed showcase → FastAPI app)	PLANNED	See DEMO-PLAN.md: static gallery here + app in a new repo reusing the probe engine
Inventory GO-CAM metabolic models + their ChEBI input/output compounds	TODO	Feasibility of Approach B (the "full network" path)
Glucose-metabolism GO-CAM perturbation worked example	TODO	Direct analogue of the Genetics 2023 precedent

References

MAGI — Erbilgin et al., MAGI: A Method for Metabolite Annotation and Gene
Integration, ACS Chem. Biol. 2019. PMID:30896917.
ACS
GO-CAM — Thomas et al., Gene Ontology Causal Activity Modeling (GO-CAM)
moves beyond GO annotations to structured descriptions of biological functions
and systems, Nat. Genet. 2019. PMID:31548717.
PMC
GO-CAM metabolic phenotypes — Biochemical pathways represented by GO
Causal Activity Models identify distinct phenotypes resulting from mutations in
pathways, Genetics 225(2):iyad152, 2023. PMID:37579192.
Genetics
mummichog — Li et al., Predicting Network Activity from High Throughput
Metabolomics, PLoS Comput. Biol. 2013.
PMC
MetaboAnalyst 6.0 — Pang et al., Nucleic Acids Res. 2024. PMID:38587201.
NAR
MetaboLights — Yurekten et al., MetaboLights: open data repository for
metabolomics, Nucleic Acids Res. 2024.
PMC
Metabolomics Workbench / RefMet —
Workbench ·
RefMet
gocam-py — GO-CAM Python tooling.
docs

Project Status

Started: 2026-06-21
Maturity: IN_PROGRESS — problem framed, tool landscape
surveyed, and a working end-to-end pipeline validated on four studies
(probe/): two-tier ChEBI normalization
(protonation + InChIKey-skeleton), the metabolite→Rhea→rhea2go→GO bridge, a
closure-aware GO ORA, and a like-for-like KEGG-pathway baseline — demonstrated
on a real study.
Computed live (OLS4 ChEBI, Rhea REST = 18,558 reactions / 14,251
participant ChEBIs, GO rhea2go = 7,746 mappings, go-basic.obo closure,
UniProt REST = 4,142 human enzymes with Rhea+GO, KEGG REST):
Coverage, MetaboLights MTBLS1 (64 curated metabolites): exact 8/64 →
+protonation 49/64 → +skeleton 58/64; demo set 0/26 → 22/26 → 25/26.
Three enrichments on MTBLS1, same hypergeometric test — KEGG pathways,
GO molecular function, and GO biological process (via the human
enzyme layer) — all recover amino-acid / central-carbon biology; GO
additionally resolves specific activities (transaminase, amino-acid
oxidase/racemase) and processes (amino-acid metabolism/transport, FDR 9e-44).
Generality across 4 MetaboLights studies (CROSS-STUDY.md):
coverage 53–91% after normalization; study-appropriate GO BP (urine→amino/
organic acid, serum→lipid/fatty-acid metabolism, FDR to 3e-89).
Key idea: ChEBI is the bridge from metabolite repositories (MetaboLights,
Workbench/RefMet) into Rhea reactions, rhea2go GO molecular functions, GOA
genes, and GO-CAM causal networks — giving a GO-native, closure-aware,
multi-omics-compatible alternative/complement to KEGG/SMPDB/mummichog
enrichment.