Reusing glycoscience resources for GO curation — strategy

Reusing glycoscience resources for GO curation — strategy

Decision record for the Glycobiology project: how to reuse the external
glycoscience resources catalogued in GLYCOBIOLOGY-resources.md as
inputs to GO curation, rather than just describing them.

Framing

Most glycoscience resources reach GO only indirectly today (via UniProt / EC / InterPro /
Reactome). There is no official cazy2go or glyconnect2go. So "reuse via mappings" means the
mapping is the deliverable
— but the repo already has the exact machinery for this: curated
SSSOM mapping sets validated with linkml-term-validator, exactly as for
RHEA (rhea2go), InterPro (interpro2go), and NCBIFam.

Modes of reuse — decisions

Mode Decision Notes
Forward propagation — cazy2go family→GO MF DO Same EC-masking situation as RHEA. The "risk" of producing over-general terms is not a risk — redundant rows are trivially closure-filtered against ec2go/interpro2go (drop terms those already supply); what remains is the genuine contribution.
Ontology alignment — GlycoCoO / GlycoEnzOnto → GO (SSSOM) DO (cheap one-shot) GlycoEnzOnto appears abandoned, but that doesn't matter: a one-shot SSSOM alignment salvages its glyco-enzyme MF terms into GO space at low cost.
Confirmatory cross-check (GlyGen join) DO UniProt → CAZy family + GlyConnect enzyme record → compare to the gene's GO MF. Read-only QC; catches the wrong-paralog / over-general errors the 7 exemplar reviews fixed.
GO-CAM / pathway modelling PUNT (for now) Glycan biosynthesis is a natural causal/multi-enzyme model, but deferred — not in current scope.

cazy2go: the forward-propagation source

A CAZy family → GO molecular-function mapping, the glyco analogue of interpro2go. Seeded in
cazy2go.sssom.yaml from the project's five exemplar GT families, each row
backed by a completed gene review:

CAZy family GO MF Predicate Why
GT13 (GnT-I) GO:0003827 exactMatch mono-specific family (EC 2.4.1.101)
GT65 (POFUT1) GO:0046922 exactMatch animal GT65 = POFUT1 (EC 2.4.1.221)
GT29 (sialyl-Ts) GO:0008373 exactMatch family scope == the sialyltransferase activity class
GT7 (β4Gal/GalNAc-Ts) GO:0003831 narrowMatch term fits B4GALT-type members only → needs subfamily
GT31 (β3Gal/GalNAc/Fringe) GO:0008376 narrowMatch highly heterogeneous family → needs subfamily

Predicate semantics (mirrors interpro2go.sssom.yaml): exactMatch = holds for ~all (animal)
family members → ready to propagate; narrowMatch = the GO term describes a subfamily only, so
the family is too coarse and subfamily/EC resolution is required first.

The trivial redundancy filter

For each row, the family's EC(s) usually already reach the same GO term via ec2go
(GO_REF:0000003) or an InterPro entry reaches it via interpro2go (GO_REF:0000002). Those rows
are redundant (no marginal annotation) and are dropped by a closure filter — the same
EC-masking computation RHEA already runs (ec2go is fetchable from
current.geneontology.org/ontology/external2go/ec2go). The genuine cazy2go contribution is the
non-redundant remainder: families with no EC, no ec2go row, or no InterPro coverage. Redundancy
status is noted per row in the seed; computing it live across all families is the scale-up step.

Beyond the seed — the generated full mapping

The 5-family hand-curated seed is now extended to all CAZy enzyme families derivable from public
data
, reproducibly, by build_cazy2go.py
cazy2go.generated.sssom.yaml. No row is hand-typed: each is the
live join

CAZy family --(reviewed Swiss-Prot CAZy↔EC xref, UniProt REST)--> EC --(ec2go)--> GO MF

Current build (2026-06): 3,183 reviewed UniProtKB CAZy+EC entries → 302 families → 702 mapping
rows over 283 families
(19 families carry an EC with no ec2go term — gap candidates). The same
exact/narrow logic as the seed is applied automatically:

families meaning
exactMatch 162 family's ECs resolve to a single GO MF → mono-specific, ready to propagate
narrowMatch 121 poly-specific family; GO term applies to a subfamily → resolve subfamily/EC first

The trivial redundancy filter is built in: 80 generic-parent rows (glycosyltransferase activity, hexosyltransferase activity, hydrolase acting on glycosyl bonds, …) were dropped where
the family also resolved to a specific child — so the output already prefers the specific term. The
generated rows are cross-consistent with the hand-curated seed (e.g. GT13→GO:0003827,
GT65→GO:0046922 both resolve to the same exactMatch the seed asserts).

Caveat (the redundancy property): because every row is derived through an EC that already
reaches GO via ec2go, the rows are EC-reachable by construction. Their value is the same as
interpro2go's — annotating a protein from CAZy family membership when it lacks an EC/ec2go
annotation — but the marginal-vs-ec2go contribution still needs closure-filtering across organisms
(the RHEA EC-masking method). Finer subfamily resolution (for the narrowMatch
families) is the next refinement, seedable from the dbCAN-sub subfamily → EC table
(dbCAN3).

Did it yield anything new, or just confirm?

At the GO-term level it is confirmatory by construction — derived through ec2go, it cannot mint
a term ec2go lacks. But the join surfaced genuinely new, actionable signal:

cazy2go vs interpro2go (computed) — the marginal-knowledge test, via
compare_cazy_interpro.py (UniProt CAZy+EC+InterPro × ec2go ×
interpro2go). This is the opposite of the RHEA result: where RHEA was ~84% masked by EC2GO,
cazy2go is only 20% masked by interpro2go

class pairs meaning
DESCENDANT_MASKED (drop) 15 (3%) InterPro already gives G or a more specific child → no gain
ALTITUDE_GAIN 405 (74%) InterPro gives only a generic ancestor; cazy2go adds the specific reaction-level MF
TRUE_GAP 124 (23%) InterPro is silent on G's branch for this family → genuinely new coverage

So only 3% of the marginal is closure-masked; 97% is real (529 pairs = altitude + gap). The
TRUE_GAP set spans 98 families and includes biologically central activities InterPro2GO does not
supply for the family at all: GT27/CBM13 → polypeptide-GalNAc-T (mucin O-glyco initiators),
GH19 → lysozyme, GH14 → α-amylase, GH33 → sialidase, PL1 → pectin lyase,
GT43 → β-1,4-xylan synthase, GH32 → fructan fructosyltransferase. (A few are still generic,
e.g. GH72 → hexosyltransferase activity, or noisy multi-activity families like GH23 → peptidase.)

Caveat: is_a-only closure; an ALTITUDE_GAIN whose only InterPro ancestor is near the
molecular_function root is a weak gain. And the partition is per (family, term): a TRUE_GAP in a
poly-specific family (e.g. GH1 90 members → one plant glucosyltransferase) is real coverage but
still unsafe to propagate at family level — route via subfamily.

Bottom line: cazy2go is not mostly redundant with interpro2go and closure does not
explain it away — 97% of the exact-match marginal survives closure as genuine altitude (74%) or
coverage (23%) gain. It adds reaction-level MF specificity for ~110 mono-specific families safely
(family-level) and more via subfamilies; 124 pairs across 98 families are activities InterPro2GO is
silent on entirely
.

Status