Cross-Study Generality — Four MetaboLights Studies

Supporting findings for the
Metabolomics × GO/GO-CAM project. Before investing in the
interactive demo, we checked that the
metabolite → Rhea → GO bridge and its enrichment hold up beyond the original
MTBLS1 pilot, across different biofluids, platforms, diseases and metabolite
classes. All numbers are computed live by the probe/
pipeline (fetch_metabolights.py → coverage_probe.py → go_enrichment.py /
go_bp_enrichment.py); per-study reports are linked below.

The four studies

Study	Biofluid / platform	Phenotype	Metabolites
MTBLS1	urine, NMR	type-2 diabetes	64
MTBLS90	serum, LC-MS	cardiovascular / ageing (PIVUS)	208
MTBLS404	urine, LC-MS	age / BMI / sex (Sacurine)	109
MTBLS19	serum, LC-MS	hepatocellular carcinoma	34

Coverage — normalization is essential in every study

Study	Exact	+ protonation	+ skeleton	Final %
MTBLS1	8/64	49/64	58/64	91%
MTBLS90	39/208	74/208	110/208	53%
MTBLS404	5/109	60/109	71/109	65%
MTBLS19	5/34	12/34	20/34	59%

Two robust patterns:

Protonation normalization is decisive everywhere — exact match captures
only 8–39% of what the two-tier normalization reaches; the protonation tier
alone multiplies coverage 1.9–12× in every study. The skeleton tier then adds
a further 5–18 metabolites each (the generic↔stereospecific amino-acid class).
The headline insight from MTBLS1 is not a one-off.
Final coverage tracks metabolite chemistry, not study quality. The
polar-metabolite studies (urine: 91%, 65%) connect better than the lipid-rich
serum LC-MS studies (53%, 59%). The serum residuals are dominated by
complex lipids (sphingomyelins, phosphatidylcholines, triacylglycerols)
that Rhea does not carry as discrete reaction participants — a real,
localisable gap in the Rhea/GO bridge for lipid metabolism, not a failure
of the method. This is itself a useful finding for where curation/representation
effort would pay off.

Enrichment recovers each study's own biology

GO biological-process enrichment (via the human enzyme layer; same hypergeometric
test throughout) returns sharply different, study-appropriate processes —
strong evidence the signal is real and not an artefact of the pipeline:

Study	Top GO biological processes (fold, FDR)
MTBLS1 (urine, T2D)	amino acid metabolic process (4.3×, 9e-44); dicarboxylic acid metabolic process (6.0×); amino acid transport (5–7×)
MTBLS404 (urine)	carboxylic/organic acid transport (4.2×, 1e-36); amino acid metabolic process (3.3×); oxoacid metabolic process (2.4×)
MTBLS90 (serum, CVD)	lipid metabolic process (2.9×, 3e-89); fatty acid metabolic process (3.9×); long-chain fatty acid metabolic process (5.5×)
MTBLS19 (serum, HCC)	lipid metabolic process (4.2×, 3e-63); lipid catabolic process (8.3×); glycerolipid catabolic process (15.0×)

The two urine studies surface amino-acid and organic-acid metabolism and
transport; the two serum LC-MS studies surface lipid and fatty-acid
metabolism. The pipeline does not impose a template — it reads out the chemistry
that is actually in each sample. (GO molecular-function enrichments per study:
MTBLS90,
MTBLS404,
MTBLS19.)

Conclusions for the demo

The bridge + two-tier normalization + GO enrichment generalise across
biofluid, platform and disease; the protonation insight is universal.
The lipid coverage gap in serum LC-MS studies is the clearest next
methodological target (extend the bridge to lipid classes — e.g. via Rhea's
generic lipid participants or LIPID MAPS/SwissLipids → ChEBI).
A demo should therefore include at least one urine and one serum/lipid
example so users see both the strong-coverage and the gap cases honestly.

Reproduce

for ACC in MTBLS1 MTBLS90 MTBLS404 MTBLS19; do
  uv run python fetch_metabolights.py $ACC
  uv run python coverage_probe.py    --chebi-file studies/$ACC.chebi.txt --out studies/$ACC-RESULTS.md          --source "$ACC"
  uv run python go_enrichment.py     --chebi-file studies/$ACC.chebi.txt --out studies/$ACC-GO-ENRICHMENT.md    --source "$ACC"
  uv run python go_bp_enrichment.py  --chebi-file studies/$ACC.chebi.txt --out studies/$ACC-GO-BP-ENRICHMENT.md --source "$ACC"
done