Glycobiology Project

IN_PROGRESS BIOLOGY_DOMAINPIPELINE

Species: human

Genes: B3GALNT2, LGALS3, PMM2, POFUT1, MGAT1, ST6GAL1, B4GALT1

Glycobiology Project

Overview

Glycobiology studies the structure, biosynthesis, degradation, and biological
roles of glycans (sugars) and glycoconjugates (glycoproteins, glycolipids,
proteoglycans, GPI-anchored proteins). Glycosylation is the most common and
structurally diverse co-/post-translational modification: glycans decorate the
majority of the cell-surface and secreted proteome, mediate cell–cell and
host–pathogen recognition, and tune protein folding, trafficking, stability, and
signalling. Genetic defects in the glycosylation machinery cause the
congenital disorders of glycosylation (CDGs)
a heterogeneous family of >130 Mendelian diseases — and aberrant glycosylation is
a hallmark of cancer, inflammation, and autoimmune disease.

This project has two complementary aims:

  1. GO-usage audit (animals). Survey how the glycobiology-related GO term
    landscape — glycosyltransferase/glycosidase activities, carbohydrate binding
    (lectins), and the N-/O-linked/GPI glycosylation biological processes — is
    actually used to annotate animal gene products (human, mouse, rat, worm), and
    identify systematic over-/under-/mis-annotation patterns the way the
    SPKW, RHEA, and INTERPRO source-audit
    projects do for their pipelines.
  2. External-resource landscape. Map the dedicated glycoscience databases,
    ontologies, and tools (the GlySpace Alliance — GlyGen, GlyCosmos,
    Glyco@Expasy/GlyConnect — plus GlyTouCan, UniCarbKB, CAZy, and the
    GlycoConjugate Ontology GlycoCoO) and assess how each could feed, cross-check,
    or extend GO-based curation of glycogenes. Reference dossier:
    GLYCOBIOLOGY-resources.md.

The two aims meet at one question: where does GO under-represent glycan biology
that the specialist glyco resources capture well, and where does GO
over-annotate glycogenes
(e.g. propagating downstream "guilt-by-substrate"
metabolic processes onto an enzyme that only adds one sugar)?

The glycobiology GO term landscape

Key entry points into the ontology (GO IDs/labels verified against QuickGO,
GO release current as of 2026-06). Closure (descendant) sets under these terms
define the working scope of the audit.

Molecular function (the enzymes and binders)

GO ID Label Notes
GO:0016757 glycosyltransferase activity builds glycosidic bonds from activated sugar donors; ~the CAZy GT families
GO:0016798 hydrolase activity, acting on glycosyl bonds glycosidases; ~the CAZy GH families (parent of glycosidase children)
GO:0030246 carbohydrate binding parent of the lectin activities
GO:0120153 calcium-dependent carbohydrate binding C-type lectin domain signature
GO:0097367 carbohydrate derivative binding nucleotide-sugar / activated-donor binding

Biological process (where the sugars go)

GO ID Label Notes
GO:0070085 glycosylation broad parent process
GO:0006486 protein glycosylation protein-acceptor branch
GO:0006487 protein N-linked glycosylation Asn-linked; dolichol/OST pathway
GO:0006493 protein O-linked glycosylation Ser/Thr-linked
GO:0036066 protein O-linked glycosylation via fucose e.g. POFUT1/2 on EGF/TSR repeats (Notch)
GO:0180059 protein O-linked glycosylation via glucose e.g. POGLUT on EGF repeats
GO:0006505 GPI anchor metabolic process PIG-/PGAP- gene family
GO:0006506 GPI anchor biosynthetic process dolichol-phosphate / ER-luminal assembly
GO:0120574 GPI anchor remodelling post-attachment editing (PGAP genes)

These anchor terms give a reproducible way to pull the animal glycogene set from
GOA for the usage audit (filter by aspect + closure under these IDs, restricted
to the animal taxa we curate).

External glycoscience resources (landscape)

Full dossier with URLs, identifier schemes, licences, and programmatic-access
notes: GLYCOBIOLOGY-resources.md.
Headline resources:

Resource Type Role for GO curation
GlyGen Integrating portal (glycoprotein- + glycan-centric); REST API + SPARQL one-stop cross-reference hub; harmonises GlyConnect, UniCarbKB, GlyTouCan, CAZy, UniProt
GlyTouCan International glycan-structure repository (accessions) the canonical glycan-structure identifier space
GlyCosmos Web portal integrating glyco- with omics (JSCR) gene/disease/pathway links; RDF
GlyConnect / Glyco@Expasy Glycan structures, sites, biosynthetic enzymes enzyme↔glycan↔site evidence to cross-check MF annotations
UniCarbKB Curated glycan structures + glycoprotein sites site-level glycosylation evidence
CAZy Carbohydrate-active enzyme families (GT/GH/PL/CE/CBM) sequence-family ↔ activity mapping; sanity-checks MF over-/under-annotation
GlycoCoO GlycoConjugate Ontology semantic model for glycoconjugate annotation; alignment target for GO

The GlySpace Alliance (GlyGen + Glyco@Expasy + GlyCosmos) is the coordinating
umbrella; GlyTouCan is the shared structure-ID backbone they all link to.

Reusing these resources for curation

How the resources feed GO curation (forward cazy2go propagation, GlycoCoO→GO
alignment, and confirmatory GlyGen cross-checks; GO-CAM/pathway deferred) is a
decision record in
GLYCOBIOLOGY-resource-reuse.md,
with a seeded cazy2go.sssom.yaml (CAZy
family → GO molecular function, the glyco analogue of interpro2go) built from
the exemplar GT families.

Exemplar reviews (calibration set)

Seven human genes were reviewed to calibrate the over-/under-annotation
hypotheses across the main functional axes of animal glycobiology. Phase 1
covered one glycosyltransferase, one lectin, and one CDG gene;
Phase 2 broadened the GT axis across four more transferase sub-types
(O-fucosyl-, GlcNAc-branching-, sialyl-, galactosyl-). Each Phase-2 gene was
reviewed with a FutureHouse Falcon deep-research report integrated from the
start
, and the Phase-1 three were retro-fitted with their Falcon findings.

Gene UniProt Axis (CAZy) Why chosen
B3GALNT2 Q8NCR0 β-1,3-GalNAc-T (GT31) α-dystroglycan O-mannosyl (matriglycan) elongation; dystroglycanopathy (secondary CDG); RHEA-flagged GO gap
LGALS3 P17931 lectin (galectin) chimera-type β-galactoside-binding galectin; large pleiotropic set — protein-binding + pleiotropy stress-test
PMM2 O15305 CDG (precursor) phosphomannomutase 2; PMM2-CDG is the commonest CDG; "guilt-by-association" on a precursor-supply enzyme
POFUT1 Q9H488 O-fucosyl-T (GT65) ER O-fucosyltransferase on EGF repeats; Notch O-fucosylation; ER (not Golgi) localization test
MGAT1 P26572 GlcNAc-T I (GT13) medial-Golgi gatekeeper committing N-glycans to hybrid/complex processing
ST6GAL1 P15907 sialyl-T (GT29) trans-Golgi α-2,6-sialyltransferase (CD22 ligand / CD75 epitope)
B4GALT1 P15291 β-1,4-Gal-T (GT7) LacNAc synthase; bifunctional lactose synthase with α-lactalbumin; CDG-IID

All seven validate clean (status: DRAFT).

Verdict distributions (authoritative, from the YAMLs)

Gene N ACCEPT NON_CORE OVER MODIFY REMOVE NEW
B3GALNT2 16 5 2 2 7 0 0
LGALS3 106 21 64 20 0 0 1
PMM2 23 16 2 1 1 3 0
POFUT1 21 15 2 3 1 0 0
MGAT1 26 11 6 5 4 0 0
ST6GAL1 35 23 5 2 4 1 0
B4GALT1 76 44 21 6 5 0 0
Total 303 135 102 39 22 4 1

Only 4/303 REMOVE (all high-throughput-interactome protein binding), against
102 NON_CORE + 39 OVER + 22 MODIFY — i.e. ~54% of annotations are demoted or
refined
but ~99% are retained in some form. The mis-annotation signal is
overwhelmingly altitude/specificity and pleiotropy, not wrong functions —
exactly the project's prediction.

What the exemplars confirmed (audit hypotheses → evidence)

Net: across 144 annotations the verdict skew (only 3 REMOVE, all
high-throughput-interactome protein binding; heavy NON_CORE + MODIFY) matches
the project's prediction — glycogene mis-annotation is dominated by altitude /
specificity and pleiotropy
, not outright wrong functions.

Candidate animal genes already in the repo

The corpus already contains glyco-relevant gene reviews that can seed the audit
without new fetches — useful for calibrating annotation patterns before scaling:

A reproducible GOA closure query (per the term table above) will produce the
fuller candidate list; these are the already-curated anchors.

Curation considerations (hypotheses to test in the audit)

Open questions

Status