NCBIFAM / CDD → GO Contribution & Gap Project

SCOPING PIPELINE

NCBIFAM / CDD → GO Contribution & Gap Project

Overview

This project audits the NCBI protein-family resources — NCBIFAM (the
PGAP/TIGRFAM HMM collection) and CDD (Conserved Domain Database) — as sources of
GO annotations
, in the same spirit as the RHEA, SPKW, and
UniPathway source-audit projects, and develops the missing
ncbifam2go / cdd2go mappings the way RHEA develops rhea2go gap-fills.

The structural fact that organises everything: NCBIFAM and CDD are InterPro
member databases, so they reach GOA through exactly one pipeline
— InterPro
integration → interpro2goGO_REF:0000002 (evidence IEA,
assigned_by=InterPro). There is no public ncbifam2go or cdd2go
external2go file (verified: both 403 on current.geneontology.org; only
interpro2go is served). So like RHEA, the analysis runs in two directions:

  1. Contribution (forward). Where is a GO_REF:0000002 annotation actually
    carried by an NCBIFAM/CDD signature (versus Pfam, PROSITE, etc.), and is that
    contribution useful or over-/mis-propagated? — the SPKW-style audit, here
    complicated by InterPro masking the member DB.
  2. Gaps (reverse). Where does NCBIFAM/CDD assert a function — an NCBI-curated
    GO/EC term, or a precise equivalog family name — that never reaches GO,
    either because the signature is unintegrated into InterPro or because the
    integrated InterPro entry has no interpro2go row? These are gap-filling
    opportunities, not over-annotations.

See NCBIFAM-METHODOLOGY.md for the queries, the
reproducible probe (ncbifam_cdd_probe.py), and
the masking/closure caveats.

Key Findings (scoping pass)

NCBIFAM/CDD is mostly masked by InterPro

UniProt enzymes and families almost always carry multiple member signatures
(an NCBIFAM equivalog and a Pfam domain and a CDD model …), all collapsed into
one or a few InterPro entries. So an NCBIFAM/CDD signature's "real"
GO_REF:0000002 contribution is only the slice where it is the integrating /
distinguishing signature for the InterPro entry, and where no other member DB
already supplies the same term. Two consequences:

Un-masking: member-DB attribution on this repo's annotations

The masking claim is now measured, not just asserted. Every GO_REF:0000002
annotation in this repo's genes/**/*-goa.tsv was re-joined to its InterPro entry's
member_databases via the InterPro API
(interpro_member_attribution.py,
resumable). Across 5,549 resolved InterPro-citation rows (1,827 distinct entries):

Member DB distinct entries annotation rows sole signature
pfam 708 (39%) 2,374 (43%)
panther 459 (25%) 1,082 (19%)
ncbifam 272 (15%) 705 (13%) 250
prints / profile / smart ~190 each ~820 each
cdd 183 (10%) 469 (8%) 116
hamap / pirsf / ssf / cathgene3d 117–159 403–627

NCBIFAM contributes a signature to 705 (13%) and CDD to 469 (8%) of the repo's
InterPro2GO annotations — a contribution entirely invisible in GOA, which shows
only the InterPro:IPR… id. Stronger still, NCBIFAM is the sole integrated
signature for 250 rows and CDD for 116
— annotations that exist purely because of
an NCBIFAM/CDD model, with no other member DB in the entry. And these are not exotic:
NCBIFAM-sole entries back annotations on GAPDH (IPR006424GO:0006006), RPS3
(IPR005703GO:0006412), ATP6V1A (IPR005725GO:0016887), and HMGCS2
(IPR010122GO:0008299) — mainstream genes whose InterPro2GO term traces solely to a
TIGRFAM/NCBIFAM signature. This is the quantitative form of the masking finding, and
the join the sibling InterPro Mapping Review project needs to attribute
each interpro2go term to the member DB that earned it.

Specificity and quality cut by family_type

NCBIFAM's family_type is a built-in altitude/quality signal absent from CDD:

family_type N Function-transfer safety
equivalog 13,253 High — all members one function; GO/EC transfer safe
domain 10,974 Low — domain, not whole-protein function
subfamily 4,564 Medium — clade-specific; check altitude
PfamEq / PfamAutoEq 1,807 / 1,204 Equivalent to a Pfam entry → likely already InterPro-covered
exception / hypoth_equivalog 1,341 / 434 Curated caveat / hypothetical — review individually

CDD models, by contrast, are domain/architecture-oriented and lack this typing,
so CDD's forward contribution skews toward broad domain terms (the
protein binding / generic-domain altitude problem) and its reverse gap is
harder to curate than NCBIFAM's.

Does CDD have its own GO mappings? No — for CDD-proper

It is worth pinning down, because the CDD search database superficially looks
like it carries GO. The answer separates two things CDD conflates:

Conclusion: the GO-bearing NCBI family resource is NCBIFAM, not CDD. CDD
"has GO" only as a search aggregator surfacing NCBIFAM/Pfam GO under cd/PRK/
TIGR accessions. This is why the curated mapping deliverable targets
ncbifam2go, and a cdd2go would be largely redundant with it (and with
Pfam→InterPro). Reproduce with ncbifam_cdd_probe.py and the FTP/Entrez checks in
NCBIFAM-METHODOLOGY.md.

Gaps Found (scoping)

# Gap Size What it is
G1 InterPro masking all GO_REF:0000002 rows GOA hides which member DB fired → member contribution unattributable without re-join
G2 NCBIFAM unintegrated 11,064 / 18,511 (60%) NCBIFAM signatures not in InterPro → contribute no GO
G3 CDD unintegrated 14,843 / 19,902 (75%) CDD signatures not in InterPro → contribute no GO
G4 NCBI-curated GO not ingested 11,228 models w/ GO NCBIFAM models carry NCBI GO that GO has no ncbifam2go to ingest
G5 NCBI-curated EC not ingested 6,417 models w/ EC Could seed EC2GO-bridged GO mappings (the RHEA EC-bridge pattern)
G6 Integrated-but-unmapped InterPro entries staged NCBIFAM/CDD integrated into an IPR entry that has no interpro2go row

G4/G5 are the high-value half: an equivalog with a clean NCBI go_terms /
ec_numbers value is a ready-to-curate mapping; an ec_numbers-only equivalog
can be bridged through ec2go exactly as RHEA bridges reactions through
rhea2ec/ec2go.

Curated new mappings (SSSOM)

The curation deliverable mirrors RHEA's rhea2go.sssom.yaml:
ncbifam2go.sssom.yaml records the NCBIFAM-family →
GO mapping we propose for ingestionnot a transcription of NCBI's
hmm_PGAP.tsv go_terms. Each is backed by the model's family_type,
product_name, EC, and PMIDs, plus the live UniProtKB propagation gain. A
28-mapping seed spanning all three GO aspects (MF, BP, CC — NCBIFAM is a
whole-protein family resource, not enzyme-only) is in place, with predicate classes
parallel to RHEA:

We suggest our own term where NCBI's was too broad — and that unmasks the real
gain.
For five families NCBI's go_terms gave only a broad parent — twice the
ontology near-root GO:0003824 catalytic activity (enoyl-CoA hydratase NF005804,
spermidine synthase TIGR00417) — even though a precise, EC-bridged child already
exists. Rather than record the useless broad term, the seed proposes the specific
child as an exactMatch (dGTPase→GO:0008832, enoyl-CoA hydratase→GO:0004300,
dihydroorotase→GO:0004151, spermidine synthase→GO:0004766, LL-DAP
aminotransferase→GO:0010285). This is not cosmetic: the broad parent is
near-universal so its propagation gain looks ~0, but the specific term reveals
large gaps the parent masked
— spermidine synthase 575, LL-DAP aminotransferase
1,185, dihydroorotase 491, and dGTPase 456 (incl. 13 reviewed/Swiss-Prot)
entries missing the precise activity. Proposing our own term is what turns these from
invisible into actionable gap-fills.

…but more specific is not always right — the FtsX cell-division case. The
mirror-image judgement is TIGR00439 (permease-like cell division protein FtsX),
where chasing specificity would be over-annotation. NCBI assigned GO:0000910 cytokinesis; the ontology shows cytokinesis is part_of GO:0051301 cell division
(so NCBI's term is actually the narrower one — an earlier draft of this seed had
that backwards). Empirically, all 7 reviewed FtsX proteins carry GO:0051301 cell division but only 2/7 carry cytokinesis or the most specific GO:0043093 FtsZ-dependent cytokinesis. FtsX/FtsEX regulates septal peptidoglycan hydrolysis
and divisome assembly, so curators annotate the safe participation term, not the
constriction act. The gain numbers make the trap explicit: mapping to GO:0051301
gains only 22 (it is already near-universal — a confirmatory mapping), whereas
mapping to GO:0043093 would show a 3,304 apparent gap — but propagating
FtsZ-dependent cytokinesis to every FtsX would assert more than the family supports.
We therefore propose the curator-consensus GO:0051301 cell division as the
exactMatch and decline the higher-gain specific term. Specificity is the goal
only up to the altitude the evidence supports.

Verification matters and catches real errors. One scoping-sample NCBIFAM
go_terms value (GO:0009448 on a GABA transaminase) is obsolete; a
diacylglycerol-kinase model (NF009874, EC 2.7.1.107) is tagged with the wrong
activity GO:0003951 NAD+ kinase activity (the correct GO:0004143 exists); and
several assignments sit at near-root altitude. The obsolete and wrong ids were
excluded; the broad ones were replaced by our own specific term (above) — so
the seed drops obsolete/incorrect ids, prefers specific children, and
EC-bridge-confirms enzymes. Every GO id/label was checked non-obsolete
against QuickGO (2026-06-20); every family id/name/type/EC is from hmm_PGAP.tsv;
every EC→GO bridge against the live ec2go. Validate
with just validate-ncbifam-mappings — SSSOM structural validation plus GO
term/label validation (object bound to the full GO graph, MF+BP+CC; generated
nested view ncbifam2go.terms.yaml). The seed
passes validation.

A cdd2go set is not planned: per the CDD section,
CDD-proper has no native GO, and the GO surfaced through CDD belongs to NCBIFAM
(captured here) or Pfam (InterPro-routed) — a cdd2go would be largely redundant.

Scaling the seed to the whole collection (EC-bridge candidates)

The 28-row seed is hand-reviewed; the EC bridge lets us scale the same evidence
standard
to the whole collection with no per-row human judgement, because the
agreement of two independent curated resources (NCBI's go_terms and GO's ec2go)
is the verification. ncbifam2go_candidates.py
walks every NCBIFAM model and emits each (model, GO) where ec2go(model's EC)
confirms one of the model's own NCBI go_terms. The live funnel:

Stage Count
GO-bearing NCBIFAM models 11,228
…with both an EC and a GO term 3,782
…where ec2go(EC) confirms a model GO → exactMatch candidates 2,455 (2,503 rows)
…where ec2go(EC) would refine NCBI's broader/absent GO (the spermidine-synthase pattern, at scale) 843
…candidates already in the reviewed seed (cross-check) 17

The generated set is ncbifam2go.candidates.tsv
(2,503 rows, clearly marked generated; mapping_justification would be
semapv:CompositeMatching). The 17 rows that coincide with the reviewed seed are
exactly the seed's EC-bridge enzyme rows — an automatic confirmation that the
generator agrees with manual curation where they overlap. These 2,455 are
AMR-rich (trimethoprim-resistant dihydrofolate reductases → GO:0004146,
β-lactamases → GO:0008800, aminoglycoside 6′-N-acetyltransferases → GO:0047663,
…) and are the natural ready-to-add core of a real ncbifam2go. The 843
"refine"
models are the scaled version of the five hand-fixed altitude rows: NCBIFAM
gave a broad/near-root term but ec2go supplies the specific child — a second,
also-automatable candidate class (propose ec2go's term), pending the same
altitude/over-annotation check the FtsX case shows is still needed.

Methods

The interpro2go characterisation, InterPro member-integration counts, and the
NCBIFAM PGAP GO/EC coverage are computed live by
ncbifam_cdd_probe.py (stdlib only; no go-db, no
auth). The masking evidence (0 member signatures in WITH/FROM; 1,160 NCBIfam +
2,174 CDD DR lines) is computed from this repo's genes/**/*-goa.tsv and
*-uniprot.txt. The annotation-gain numbers are computed live against the
UniProtKB REST API by ncbifam_go_gain.py
(closure-aware go: query; see
NCBIFAM-ANNOTATION-GAIN.md). The CDD-own-GO
check uses the NCBI CDD FTP (cddannot*.dat, cddid_all.tbl) and Entrez cdd
records. The member-DB attribution (which member DB backs each GO_REF:0000002
row in the repo) is computed live by
interpro_member_attribution.py against the
InterPro API (resumable, cached). The forward closure-filtered cross-organism
contribution table reuses the UniPathway/RHEA uniqueness
query and is staged pending the go-db DuckDBs (absent in the web container). Full
queries and caveats: NCBIFAM-METHODOLOGY.md.

How this differs from RHEA, SPKW, and UniPathway

SPKW (GO_REF:0000043) RHEA (GO_REF:0000116) NCBIFAM/CDD (GO_REF:0000002)
GO aspect mostly BP MF (enzyme activity) MF + BP + CC (family/domain models)
Provenance in GOA direct (keyword visible) direct (assigned_by=RHEA) masked — only the integrated InterPro:IPR… shows
Dedicated *2go file yes yes no (ncbifam2go/cdd2go do not exist)
Dominant failure mode process conflation parent/child altitude; wrong substrate domain-altitude (CDD); unintegrated coverage gap
Built-in quality signal none reaction precision NCBIFAM family_type (equivalog)
Curation emphasis over-annotation removal gap-filling both, but attribution-first then equivalog gap-fill

NCBIFAM is unusual among these sources in carrying its own curated GO/EC that
GO does not ingest, so — like RHEA — the expected verdict skew is toward
NEW / gap-filling (for equivalog models) on the reverse side, with the
forward side dominated by the attribution/masking problem rather than outright
over-annotation.

Curation Recommendations (preliminary)

  1. Attribute before auditing. A GO_REF:0000002 annotation cannot be praised
    or blamed on NCBIFAM/CDD until GOA is re-joined to InterPro member integration;
    build that join first.
  2. Mine NCBIFAM equivalog GO/EC as a mapping source. The 13,253 equivalogs
    with NCBI go_terms/ec_numbers are the cleanest gap-fill substrate — start
    the ncbifam2go.sssom.yaml here.
  3. EC-bridge where only EC is given. An equivalog with ec_numbers but no
    go_terms can be mapped through ec2go, the RHEA EC-bridge pattern.
  4. Treat CDD as domain-altitude-risky. Without family_type, CDD's forward
    contribution skews broad; prefer the specific child the curated entry supports.
  5. Unintegrated signatures with a real function are proposed_new_terms /
    InterPro-integration requests
    , not silent gaps.

Follow-Up Targets

Target Rationale
✅ GOA × InterPro member-integration re-join Done (attribution section): NCBIFAM backs 13% / CDD 8% of the repo's InterPro2GO rows (sole signature for 250 / 116).
Forward closure-filtered cross-organism scan UniPathway-style uniqueness for member-attributed rows; needs go-db DuckDBs. Now seeded by the member-attribution join above.
Promote the 2,455 EC-bridge candidates Altitude/obsolete-check ncbifam2go.candidates.tsv and fold the clean rows into the reviewed SSSOM → a near-complete ingestible ncbifam2go.
Build the 843 "refine" class Auto-propose ec2go's specific term where NCBI's go_terms is broad/absent (the spermidine-synthase pattern), then altitude-review as FtsX shows is needed.
Non-EC families (defense/secretion/transport) The high-gain non-enzyme equivalogs (transposases, anti-phage, T4SS, encapsulins) have no EC bridge → need a different verification (literature/SPARCLE), curated like the seed's CC/BP rows.
Full-collection gain run Replace the 60-model gain sample with the complete equivalog set for a definitive reviewed-vs-TrEMBL gain figure.
Exemplar gene reviews Pick 2–3 genes whose only MF/BP support is an NCBIFAM equivalog (e.g. an anti-phage or secretion family) and run the full review workflow.

Project Status