Interactive Demo — Plan & Architecture
Planning sub-doc for the
Metabolomics × GO/GO-CAM project. It answers one question:
what is the best path from the working probe/ pipeline to
an interactive demo that lets someone interpret their own metabolomics data with
GO? — and where that demo should live.
TL;DR recommendation
- The full interactive app does not belong on the static Pages site. Live,
arbitrary-input enrichment needs server-side compute and ~tens of MB of
reference indices (GO closure, the human-enzyme table, the Rhea network). A
static page cannot do that per request. - But a precomputed showcase does belong here, and is the right first
step: ship the MTBLS1 (and a few more) results as static JSON + a small
client-side viewer onai4curation.io/ai-gene-review/. Zero infra, immediate,
and it nails the UX before any server exists. - Build the interactive app in a new repo (
ai4curation/metabolomics-go-demo)
as a Streamlit app wrapping the existing engine (signed-off choice for
the MVP), deployed to Hugging Face Spaces. Pathway baseline =
Reactome/SMPDB (not KEGG). (A FastAPI JSON API can be added later if an
embeddable/headless interface is needed.) - Keep the engine here. The
probe/modules are the
reproducible scientific core; the app repo depends on them (pip install from a
pinned tag), so the science stays version-controlled with the curation work
and the app stays a thin, redeployable shell.
Why not a static page (the hard constraint)
The current pipeline, per request, needs:
| Need | Source | Why it blocks pure-static |
|---|---|---|
| ChEBI protonation/skeleton normalization | OLS4 (live, cached) | per-metabolite API calls; CORS + latency |
Rhea reaction network + rhea2go |
Rhea REST / GO | ~18.6k reactions; rebuilt index |
| GO closure | go-basic.obo (32 MB) |
too big to parse client-side per visit |
| Human enzyme → Rhea + GO-BP | UniProt REST | ~4.1k entries, paginated |
| KEGG pathway membership | KEGG REST | licensing (see Risks) + CORS |
| Hypergeometric ORA + BH-FDR | compute | fine anywhere, but needs the indices above |
Pyodide/WASM "run it all in the browser" is tempting but a trap here: the public
APIs do not all send permissive CORS headers, and shipping a 32 MB OBO + the
UniProt table to every visitor is slow. So: static for precomputed results,
a server for live arbitrary input.
What the demo should let a user do
Three input modes, increasing in ambition:
- Pick an example study from a dropdown (MTBLS1 + a handful) → instant
precomputed results. (works statically) - Paste a metabolite list — names or ChEBI ids, or a pasted MAF column →
live normalization + enrichment. (needs the app) - Enter a MetaboLights accession → fetch its MAF, extract ChEBI, run.
(needs the app)
Output panels (the actual value on screen):
- Coverage / normalization — per-metabolite tier (
exact→protonation→
skeleton→ miss) and the matched Rhea form. A small Sankey or stacked bar
makes the protonation insight (0→49→58 on MTBLS1) immediately legible. This
panel is the headline finding. - Three-way enrichment — side-by-side ranked tables: KEGG pathway, GO
molecular function, GO biological process, each with fold + FDR, GO/KEGG
ids linked out. The point lands when they sit next to each other. - Network view (stretch) — interactive metabolite → reaction → enzyme → GO
graph (Cytoscape.js), click a metabolite to highlight its enzymes and
enriched terms. - GO-CAM causal view (future, Approach B) — trace a perturbed metabolite as
an activity input/output through a curated causal model. - Shareable permalink — encode the input set in the URL so a result is
citable.
Architecture options
| Option | Where | Interactivity | Effort | Verdict |
|---|---|---|---|---|
| A. Precomputed static gallery | this repo / Pages | example studies only | Low (~1 day) | Do now (Phase 1) |
| B. Pyodide in-browser | Pages | full, client-side | Med–High | Avoid — CORS + payload size |
| C. Streamlit app | new repo / HF Spaces | full | Med (~3–5 days) | Good MVP if speed matters |
| D. FastAPI service + static frontend | new repo (API) + Pages (UI) | full, embeddable | Med–High (~1–2 wk) | Target (Phase 2) |
Recommendation: A now, then D (with C as an acceptable shortcut to a first
live MVP). D is preferred over C for longevity because the JSON API is reusable
(notebooks, other front-ends) and the frontend can be embedded directly in the
Pages site, resolving the "belongs on the page vs needs an app" tension: the UI
lives on ai4curation.io, calling a small hosted API.
Repository strategy
ai4curation/ai-gene-review (this repo — unchanged role)
projects/METABOLOMICS/ methodology, results, provenance
probe/ the ENGINE (reproducible reference)
pages/projects/METABOLOMICS/showcase/ Phase-1 precomputed static gallery
ai4curation/metabolomics-go-demo (NEW repo — the app)
engine: depends on the probe package (pinned tag) or vendors it
api/ FastAPI: /normalize /coverage /enrich /study/{acc}
web/ static frontend (also embeddable into the Pages site)
data/ prebuilt indices (built in CI, see below)
Dockerfile -> Hugging Face Spaces / Render / Fly
To make the engine installable, lightly refactor probe/ into
a package (metabolomics_go) with a clean function API
(normalize(), coverage(), enrich(aspect=…)) — the CLIs become thin wrappers.
That refactor is also good hygiene for this repo.
Data & caching strategy (the make-or-break detail)
Per-request latency must be small, so the heavy reference data is prebuilt
offline (a scheduled CI job) into compact artifacts shipped with the app:
| Artifact | Built from | ~size | Refresh |
|---|---|---|---|
rhea_participant_index |
Rhea REST | small | monthly |
reaction_to_go (rhea2go) |
GO external2go | small | monthly |
reaction_to_human_enzyme + enzyme_to_go_bp |
UniProt REST | ~MB | monthly |
go_bp_closure |
go-basic.obo |
~MB (pickled ancestors) | monthly |
kegg_pathway_compounds |
KEGG REST | small | see licensing |
chebi_normalization_cache |
OLS4 (warm) | grows | persistent disk |
Per request the app then only does: normalize the user's metabolites
(OLS4, cached → mostly hits) + stats (fast). A prebuilt SQLite/Parquet of
the ChEBI protonation+skeleton families for the few thousand common metabolites
makes the common case instant; arbitrary metabolites fall back to live OLS4 with
write-through caching.
Phased plan
Phase 1 — Precomputed static showcase (in this repo, ~1 day).
- Emit the probe results as showcase/<study>.json (coverage tiers + 3 enrichment
tables) for MTBLS1 + 2–3 more studies.
- A single showcase/index.html + vanilla JS renders the coverage bar and the
three side-by-side tables. Lives under pages/projects/METABOLOMICS/showcase/.
- Deliverable: a clickable demo on the existing site, no server. Validates UX.
Phase 2 — Live app MVP (new repo, ~1 week).
- Package the engine; build the prebuilt indices + CI refresh.
- FastAPI: /normalize, /coverage, /enrich, /study/{acc}.
- Frontend: paste-a-list + pick-a-study; render the same panels live.
- Deploy to HF Spaces (Docker). Embed the frontend into the Pages site via an
<iframe> or by hosting the static UI on Pages and pointing it at the API.
Phase 3 — Polish & depth (stretch).
- Cytoscape network view; permalinks; CSV/MAF upload; multiple organisms for BP.
- GO-CAM causal trace (Approach B) once the GO-CAM inventory exists.
Risks & constraints
- KEGG licensing. ChEBI, Rhea, GO, UniProt, MetaboLights are open
(CC-BY / CC0) and fine to cache and redistribute in a demo. KEGG is not —
KEGG REST is free for academic use but its data has redistribution
restrictions. For a public app: keep KEGG as an optional, live-fetched
baseline (don't bundle/redistribute KEGG tables), or swap the baseline to an
open pathway source (Reactome / SMPDB) to avoid the issue entirely.
Flag this prominently in the UI. - API rate limits / etiquette. A public app must not proxy every click to
OLS4/UniProt/Rhea. The prebuilt-index + persistent-cache design above is what
keeps it polite and fast; add a request cap + cache-by-input. - CORS. Several upstreams don't set permissive CORS, which is exactly why the
browser calls them through the FastAPI backend rather than directly. - BP is organism-specific. The GO-BP enrichment routes through human enzymes;
the app should expose an organism selector (default human) and say so. - Normalization latency on cold input. First-seen metabolites incur OLS4
round-trips; mitigate with the prebuilt common-metabolite table + a spinner +
write-through cache.
Decisions (signed off)
- Baseline → Reactome / SMPDB, not KEGG. Both are openly redistributable, so
the whole demo can be cached and shared without the KEGG restriction. The
currentkegg_baseline.pystays only as an internal cross-check; the app's
pathway baseline will be Reactome (ChEBI participants are native to Reactome,
so the ChEBI bridge applies directly) and/or SMPDB. - Frontend → Streamlit on Hugging Face Spaces. Fastest route to a live,
interactive MVP; single Python app reusing the engine. (Embedding into the
Pages site, if wanted later, can be an<iframe>to the Space.) - Hosting → Hugging Face Spaces (Docker or native Streamlit).
- Sequencing → validate generality first. Before building the app, run the
pipeline over several more MetaboLights studies (different biofluid / platform /
disease / organism) to confirm the coverage uplift and enrichment hold up. See
the cross-study summary linked from the project page.
Remaining open questions
- SMPDB vs Reactome (or both) as the concrete pathway baseline to implement.
- Scope of v1: is "paste a list / pick a study → coverage + 3-way enrichment"
the right MVP cut, deferring the network and GO-CAM views to Phase 3?