NCBIFAM / CDD → GO Contribution & Gap Project

Overview

This project audits the NCBI protein-family resources — NCBIFAM (the
PGAP/TIGRFAM HMM collection) and CDD (Conserved Domain Database) — as sources of
GO annotations, in the same spirit as the RHEA, SPKW, and
UniPathway source-audit projects, and develops the missing
ncbifam2go / cdd2go mappings the way RHEA develops rhea2go gap-fills.

The structural fact that organises everything: NCBIFAM and CDD are InterPro
member databases, so they reach GOA through exactly one pipeline — InterPro
integration → interpro2go → GO_REF:0000002 (evidence IEA,
assigned_by=InterPro). There is no public ncbifam2go or cdd2go
external2go file (verified: both 403 on current.geneontology.org; only
interpro2go is served). So like RHEA, the analysis runs in two directions:

Contribution (forward). Where is a GO_REF:0000002 annotation actually
carried by an NCBIFAM/CDD signature (versus Pfam, PROSITE, etc.), and is that
contribution useful or over-/mis-propagated? — the SPKW-style audit, here
complicated by InterPro masking the member DB.
Gaps (reverse). Where does NCBIFAM/CDD assert a function — an NCBI-curated
GO/EC term, or a precise equivalog family name — that never reaches GO,
either because the signature is unintegrated into InterPro or because the
integrated InterPro entry has no interpro2go row? These are gap-filling
opportunities, not over-annotations.

See NCBIFAM-METHODOLOGY.md for the queries, the
reproducible probe (ncbifam_cdd_probe.py), and
the masking/closure caveats.

Key Findings (scoping pass)

One route in, and a dedicated mapping file is missing. NCBIFAM/CDD GO
reaches GOA only via interpro2go / GO_REF:0000002; there is no
ncbifam2go or cdd2go (403). interpro2go is sizeable — 30,200 mapping
rows over 14,799 distinct InterPro entries (GO release 2026-04-28) — but a
member signature contributes GO only if it is integrated into one of those
entries.
NCBIFAM/CDD are masked by InterPro in GOA — and the contribution is now
measured. A GO_REF:0000002 row's WITH/FROM names the integrated
InterPro:IPRnnnnnn entry, never the member signature (0 TIGR…/NF…/cd… ids
in any *-goa.tsv WITH/FROM, vs 1,160 DR NCBIfam + 2,174 DR CDD in the
proteins' UniProt records). Re-joining each annotation to its InterPro entry's
member databases shows NCBIFAM backs 705 (13%) and CDD 469 (8%) of the repo's
5,549 InterPro2GO rows, and is the sole integrated signature for 250 / 116
— invisible in GOA (the attribution section).
This is the NCBIFAM/CDD analog of RHEA being "masked by EC".
NCBIFAM carries its own NCBI-curated GO/EC that GO does not ingest. The PGAP
HMM metadata (hmm_PGAP.tsv, 34,351 models) assigns function directly:
11,228 models (33%) carry GO terms (3,622 distinct GO ids) and 6,417
(19%) carry EC numbers. None of this propagates except where the model is
integrated into InterPro — a large, curated, unused mapping source.
equivalog makes the gap-fill tractable. 13,253 NCBIFAM models are typed
equivalog (all members share one function → safe GO/EC transfer), the direct
analog of RHEA's reviewed enzymes backing each curated mapping. domain /
subfamily models (10,974 / 4,564) need altitude care.
The coverage gap is large — most signatures are unintegrated. Computed live
from the InterPro API: NCBIFAM 7,447 / 18,511 integrated (40%); CDD 5,059 /
19,902 integrated (25%). The 11,064 unintegrated NCBIFAM and 14,843
unintegrated CDD signatures contribute zero GO by construction — the
NCBIFAM/CDD analog of RHEA's "no rhea2go line" reactions.
CDD-proper has no native GO of its own; NCBIFAM does. Checked directly
(see CDD section): the
NCBI-curated cd… models — what InterPro calls the "cdd" member DB — carry no
GO in any bulk file (cddannot*.dat, cddid_all.tbl → 0 GO) or in the NCBI
Entrez cdd record (no GO field). Where the broad CDD search database appears
to "have GO", that GO is borrowed from the non-cd models CDD bundles —
chiefly the 13,669 NCBIFAM models (PRK 7,039 + TIGR 4,488 + NF 2,142) inside
CDD, plus Pfam/COG — not CDD-curated. So the GO-bearing source is NCBIFAM;
CDD-proper reaches GO only via InterPro.
Ingesting NCBIFAM GO would gain mostly TrEMBL, modestly Swiss-Prot. A live
UniProtKB propagation probe (closure-aware; see
NCBIFAM-ANNOTATION-GAIN.md) over a 60-model
equivalog sample finds Σ gain = 19 reviewed (Swiss-Prot) vs 26,578
all-UniProtKB — the opposite emphasis from RHEA (whose gain was reviewed-heavy),
because NCBIFAM is prokaryote-heavy and curated entries already carry the term.
The gap concentrates in mobile elements (IS630 transposase: 18,874 entries
missing the term), conjugation/secretion (VirB5 T4SS, conjugative ATPases),
CRISPR/anti-phage defense, sporulation, encapsulins, and cell division (FtsX) —
the newer microbial biology where InterPro integration lags.

NCBIFAM/CDD is mostly masked by InterPro

UniProt enzymes and families almost always carry multiple member signatures
(an NCBIFAM equivalog and a Pfam domain and a CDD model …), all collapsed into
one or a few InterPro entries. So an NCBIFAM/CDD signature's "real"
GO_REF:0000002 contribution is only the slice where it is the integrating /
distinguishing signature for the InterPro entry, and where no other member DB
already supplies the same term. Two consequences:

Attribution requires re-joining GOA to InterPro member integration (API
entry/integrated/{db}/); GOA alone cannot tell you which member DB fired.
The interesting NCBIFAM cases are the equivalog-grounded ones, where the
member model is more specific/curated than the Pfam-style domain it shares an
InterPro entry with — the place NCBIFAM resolves a function the broader members
lump (mirroring RHEA resolving a cofactor split EC hides).

Un-masking: member-DB attribution on this repo's annotations

The masking claim is now measured, not just asserted. Every GO_REF:0000002
annotation in this repo's genes/**/*-goa.tsv was re-joined to its InterPro entry's
member_databases via the InterPro API
(interpro_member_attribution.py,
resumable). Across 5,549 resolved InterPro-citation rows (1,827 distinct entries):

Member DB	distinct entries	annotation rows	sole signature
pfam	708 (39%)	2,374 (43%)	—
panther	459 (25%)	1,082 (19%)	—
ncbifam	272 (15%)	705 (13%)	250
prints / profile / smart	~190 each	~820 each	—
cdd	183 (10%)	469 (8%)	116
hamap / pirsf / ssf / cathgene3d	117–159	403–627	—

NCBIFAM contributes a signature to 705 (13%) and CDD to 469 (8%) of the repo's
InterPro2GO annotations — a contribution entirely invisible in GOA, which shows
only the InterPro:IPR… id. Stronger still, NCBIFAM is the sole integrated
signature for 250 rows and CDD for 116 — annotations that exist purely because of
an NCBIFAM/CDD model, with no other member DB in the entry. And these are not exotic:
NCBIFAM-sole entries back annotations on GAPDH (IPR006424→GO:0006006), RPS3
(IPR005703→GO:0006412), ATP6V1A (IPR005725→GO:0016887), and HMGCS2
(IPR010122→GO:0008299) — mainstream genes whose InterPro2GO term traces solely to a
TIGRFAM/NCBIFAM signature. This is the quantitative form of the masking finding, and
the join the sibling InterPro Mapping Review project needs to attribute
each interpro2go term to the member DB that earned it.

Specificity and quality cut by family_type

NCBIFAM's family_type is a built-in altitude/quality signal absent from CDD:

family_type	N	Function-transfer safety
`equivalog`	13,253	High — all members one function; GO/EC transfer safe
`domain`	10,974	Low — domain, not whole-protein function
`subfamily`	4,564	Medium — clade-specific; check altitude
`PfamEq` / `PfamAutoEq`	1,807 / 1,204	Equivalent to a Pfam entry → likely already InterPro-covered
`exception` / `hypoth_equivalog`	1,341 / 434	Curated caveat / hypothetical — review individually

CDD models, by contrast, are domain/architecture-oriented and lack this typing,
so CDD's forward contribution skews toward broad domain terms (the
protein binding / generic-domain altitude problem) and its reverse gap is
harder to curate than NCBIFAM's.

Does CDD have its own GO mappings? No — for CDD-proper

It is worth pinning down, because the CDD search database superficially looks
like it carries GO. The answer separates two things CDD conflates:

CDD-proper (the NCBI-curated cd… models) — the 21,955 cd models, which
is exactly what InterPro ingests as its "cdd" member database — has no native
GO. Verified three ways: (1) the NCBI bulk annotation files carry only residue
features, not GO — cddannot.dat / cddannot_generic.dat contain 0 GO:
lines, and cddid_all.tbl has none; (2) the NCBI Entrez cdd record for a cd
model (e.g. cd00009 AAA) exposes an abstract, PSSM, and site descriptions but
no GO field; (3) GO serves no cdd2go file (HTTP 403). So CDD-proper
reaches GO only through InterPro integration → interpro2go.
The CDD search database is a superset of 74,122 models drawn from many
sources (cd 21,955 · pfam 19,637 · PRK 7,039 · COG 5,137 · KOG 4,825 ·
TIGR 4,488 · NF 2,142 · …). Where a CDD hit does come with GO, that GO is
borrowed from the source resource — overwhelmingly the 13,669 NCBIFAM
models (PRK + TIGR + NF) bundled inside CDD, whose GO is the
hmm_PGAP.tsv go_terms we mine here, plus Pfam/COG (themselves InterPro-routed).

Conclusion: the GO-bearing NCBI family resource is NCBIFAM, not CDD. CDD
"has GO" only as a search aggregator surfacing NCBIFAM/Pfam GO under cd/PRK/
TIGR accessions. This is why the curated mapping deliverable targets
ncbifam2go, and a cdd2go would be largely redundant with it (and with
Pfam→InterPro). Reproduce with ncbifam_cdd_probe.py and the FTP/Entrez checks in
NCBIFAM-METHODOLOGY.md.

Gaps Found (scoping)

#	Gap	Size	What it is
G1	InterPro masking	all `GO_REF:0000002` rows	GOA hides which member DB fired → member contribution unattributable without re-join
G2	NCBIFAM unintegrated	11,064 / 18,511 (60%)	NCBIFAM signatures not in InterPro → contribute no GO
G3	CDD unintegrated	14,843 / 19,902 (75%)	CDD signatures not in InterPro → contribute no GO
G4	NCBI-curated GO not ingested	11,228 models w/ GO	NCBIFAM models carry NCBI GO that GO has no `ncbifam2go` to ingest
G5	NCBI-curated EC not ingested	6,417 models w/ EC	Could seed EC2GO-bridged GO mappings (the RHEA EC-bridge pattern)
G6	Integrated-but-unmapped InterPro entries	staged	NCBIFAM/CDD integrated into an IPR entry that has no `interpro2go` row

G4/G5 are the high-value half: an equivalog with a clean NCBI go_terms /
ec_numbers value is a ready-to-curate mapping; an ec_numbers-only equivalog
can be bridged through ec2go exactly as RHEA bridges reactions through
rhea2ec/ec2go.

Curated new mappings (SSSOM)

The curation deliverable mirrors RHEA's rhea2go.sssom.yaml:
ncbifam2go.sssom.yaml records the NCBIFAM-family →
GO mapping we propose for ingestion — not a transcription of NCBI's
hmm_PGAP.tsv go_terms. Each is backed by the model's family_type,
product_name, EC, and PMIDs, plus the live UniProtKB propagation gain. A
28-mapping seed spanning all three GO aspects (MF, BP, CC — NCBIFAM is a
whole-protein family resource, not enzyme-only) is in place, with predicate classes
parallel to RHEA:

skos:exactMatch (26 rows) — the GO term is the family's function; a
ready-to-add ncbifam2go row. The enzyme majority are EC-bridge supported
(verified live: ec2go(EC) = this GO term, e.g. formamidase EC
3.5.1.49→GO:0004328, β-lactamase EC 3.5.2.6→GO:0008800, uridine kinase EC
2.7.1.48→GO:0004849). Spans AMR (two distinct β-lactamase families → one GO
term, the family→GO many-to-one analog of RHEA's reactions→activity), central
metabolism, phosphonate/arsenate/cobalamin pathways, plus the highest-gap case
IS630 transposase → GO:0004803 (18,874 entries missing it), an encapsulin-shell
CC term, and an anti-phage defense BP term. Five of these rows propose our own
specific term to replace an NCBI value that was too broad (see below).
skos:broadMatch (1 row) — reserved for the case where no more-specific GO
term exists to adopt: VirB5 → type IV secretion system complex (a subunit
part_of the whole complex, no VirB5-specific CC term).

We suggest our own term where NCBI's was too broad — and that unmasks the real
gain. For five families NCBI's go_terms gave only a broad parent — twice the
ontology near-root GO:0003824 catalytic activity (enoyl-CoA hydratase NF005804,
spermidine synthase TIGR00417) — even though a precise, EC-bridged child already
exists. Rather than record the useless broad term, the seed proposes the specific
child as an exactMatch (dGTPase→GO:0008832, enoyl-CoA hydratase→GO:0004300,
dihydroorotase→GO:0004151, spermidine synthase→GO:0004766, LL-DAP
aminotransferase→GO:0010285). This is not cosmetic: the broad parent is
near-universal so its propagation gain looks ~0, but the specific term reveals
large gaps the parent masked — spermidine synthase 575, LL-DAP aminotransferase
1,185, dihydroorotase 491, and dGTPase 456 (incl. 13 reviewed/Swiss-Prot)
entries missing the precise activity. Proposing our own term is what turns these from
invisible into actionable gap-fills.

…but more specific is not always right — the FtsX cell-division case. The
mirror-image judgement is TIGR00439 (permease-like cell division protein FtsX),
where chasing specificity would be over-annotation. NCBI assigned GO:0000910 cytokinesis; the ontology shows cytokinesis is part_of GO:0051301 cell division
(so NCBI's term is actually the narrower one — an earlier draft of this seed had
that backwards). Empirically, all 7 reviewed FtsX proteins carry GO:0051301 cell division but only 2/7 carry cytokinesis or the most specific GO:0043093 FtsZ-dependent cytokinesis. FtsX/FtsEX regulates septal peptidoglycan hydrolysis
and divisome assembly, so curators annotate the safe participation term, not the
constriction act. The gain numbers make the trap explicit: mapping to GO:0051301
gains only 22 (it is already near-universal — a confirmatory mapping), whereas
mapping to GO:0043093 would show a 3,304 apparent gap — but propagating
FtsZ-dependent cytokinesis to every FtsX would assert more than the family supports.
We therefore propose the curator-consensus GO:0051301 cell division as the
exactMatch and decline the higher-gain specific term. Specificity is the goal
only up to the altitude the evidence supports.

Verification matters and catches real errors. One scoping-sample NCBIFAM
go_terms value (GO:0009448 on a GABA transaminase) is obsolete; a
diacylglycerol-kinase model (NF009874, EC 2.7.1.107) is tagged with the wrong
activity GO:0003951 NAD+ kinase activity (the correct GO:0004143 exists); and
several assignments sit at near-root altitude. The obsolete and wrong ids were
excluded; the broad ones were replaced by our own specific term (above) — so
the seed drops obsolete/incorrect ids, prefers specific children, and
EC-bridge-confirms enzymes. Every GO id/label was checked non-obsolete
against QuickGO (2026-06-20); every family id/name/type/EC is from hmm_PGAP.tsv;
every EC→GO bridge against the live ec2go. Validate
with just validate-ncbifam-mappings — SSSOM structural validation plus GO
term/label validation (object bound to the full GO graph, MF+BP+CC; generated
nested view ncbifam2go.terms.yaml). The seed
passes validation.

A cdd2go set is not planned: per the CDD section,
CDD-proper has no native GO, and the GO surfaced through CDD belongs to NCBIFAM
(captured here) or Pfam (InterPro-routed) — a cdd2go would be largely redundant.

Scaling the seed to the whole collection (EC-bridge candidates)

The 28-row seed is hand-reviewed; the EC bridge lets us scale the same evidence
standard to the whole collection with no per-row human judgement, because the
agreement of two independent curated resources (NCBI's go_terms and GO's ec2go)
is the verification. ncbifam2go_candidates.py
walks every NCBIFAM model and emits each (model, GO) where ec2go(model's EC)
confirms one of the model's own NCBI go_terms. The live funnel:

Stage	Count
GO-bearing NCBIFAM models	11,228
…with both an EC and a GO term	3,782
…where `ec2go(EC)` confirms a model GO → exactMatch candidates	2,455 (2,503 rows)
…where `ec2go(EC)` would refine NCBI's broader/absent GO (the spermidine-synthase pattern, at scale)	843
…candidates already in the reviewed seed (cross-check)	17

The generated set is ncbifam2go.candidates.tsv
(2,503 rows, clearly marked generated; mapping_justification would be
semapv:CompositeMatching). The 17 rows that coincide with the reviewed seed are
exactly the seed's EC-bridge enzyme rows — an automatic confirmation that the
generator agrees with manual curation where they overlap. These 2,455 are
AMR-rich (trimethoprim-resistant dihydrofolate reductases → GO:0004146,
β-lactamases → GO:0008800, aminoglycoside 6′-N-acetyltransferases → GO:0047663,
…) and are the natural ready-to-add core of a real ncbifam2go. The 843
"refine" models are the scaled version of the five hand-fixed altitude rows: NCBIFAM
gave a broad/near-root term but ec2go supplies the specific child — a second,
also-automatable candidate class (propose ec2go's term), pending the same
altitude/over-annotation check the FtsX case shows is still needed.

Methods

The interpro2go characterisation, InterPro member-integration counts, and the
NCBIFAM PGAP GO/EC coverage are computed live by
ncbifam_cdd_probe.py (stdlib only; no go-db, no
auth). The masking evidence (0 member signatures in WITH/FROM; 1,160 NCBIfam +
2,174 CDD DR lines) is computed from this repo's genes/**/*-goa.tsv and
*-uniprot.txt. The annotation-gain numbers are computed live against the
UniProtKB REST API by ncbifam_go_gain.py
(closure-aware go: query; see
NCBIFAM-ANNOTATION-GAIN.md). The CDD-own-GO
check uses the NCBI CDD FTP (cddannot*.dat, cddid_all.tbl) and Entrez cdd
records. The member-DB attribution (which member DB backs each GO_REF:0000002
row in the repo) is computed live by
interpro_member_attribution.py against the
InterPro API (resumable, cached). The forward closure-filtered cross-organism
contribution table reuses the UniPathway/RHEA uniqueness
query and is staged pending the go-db DuckDBs (absent in the web container). Full
queries and caveats: NCBIFAM-METHODOLOGY.md.

How this differs from RHEA, SPKW, and UniPathway

	SPKW (`GO_REF:0000043`)	RHEA (`GO_REF:0000116`)	NCBIFAM/CDD (`GO_REF:0000002`)
GO aspect	mostly BP	MF (enzyme activity)	MF + BP + CC (family/domain models)
Provenance in GOA	direct (keyword visible)	direct (`assigned_by=RHEA`)	masked — only the integrated `InterPro:IPR…` shows
Dedicated `*2go` file	yes	yes	no (`ncbifam2go`/`cdd2go` do not exist)
Dominant failure mode	process conflation	parent/child altitude; wrong substrate	domain-altitude (CDD); unintegrated coverage gap
Built-in quality signal	none	reaction precision	NCBIFAM `family_type` (`equivalog`)
Curation emphasis	over-annotation removal	gap-filling	both, but attribution-first then equivalog gap-fill

NCBIFAM is unusual among these sources in carrying its own curated GO/EC that
GO does not ingest, so — like RHEA — the expected verdict skew is toward
NEW / gap-filling (for equivalog models) on the reverse side, with the
forward side dominated by the attribution/masking problem rather than outright
over-annotation.

Curation Recommendations (preliminary)

Attribute before auditing. A GO_REF:0000002 annotation cannot be praised
or blamed on NCBIFAM/CDD until GOA is re-joined to InterPro member integration;
build that join first.
Mine NCBIFAM equivalog GO/EC as a mapping source. The 13,253 equivalogs
with NCBI go_terms/ec_numbers are the cleanest gap-fill substrate — start
the ncbifam2go.sssom.yaml here.
EC-bridge where only EC is given. An equivalog with ec_numbers but no
go_terms can be mapped through ec2go, the RHEA EC-bridge pattern.
Treat CDD as domain-altitude-risky. Without family_type, CDD's forward
contribution skews broad; prefer the specific child the curated entry supports.
Unintegrated signatures with a real function are proposed_new_terms /
InterPro-integration requests, not silent gaps.

Follow-Up Targets

Target	Rationale
✅ GOA × InterPro member-integration re-join	Done (attribution section): NCBIFAM backs 13% / CDD 8% of the repo's InterPro2GO rows (sole signature for 250 / 116).
Forward closure-filtered cross-organism scan	UniPathway-style uniqueness for member-attributed rows; needs go-db DuckDBs. Now seeded by the member-attribution join above.
Promote the 2,455 EC-bridge candidates	Altitude/obsolete-check `ncbifam2go.candidates.tsv` and fold the clean rows into the reviewed SSSOM → a near-complete ingestible `ncbifam2go`.
Build the 843 "refine" class	Auto-propose `ec2go`'s specific term where NCBI's `go_terms` is broad/absent (the spermidine-synthase pattern), then altitude-review as FtsX shows is needed.
Non-EC families (defense/secretion/transport)	The high-gain non-enzyme equivalogs (transposases, anti-phage, T4SS, encapsulins) have no EC bridge → need a different verification (literature/SPARCLE), curated like the seed's CC/BP rows.
Full-collection gain run	Replace the 60-model gain sample with the complete equivalog set for a definitive reviewed-vs-TrEMBL gain figure.
Exemplar gene reviews	Pick 2–3 genes whose only MF/BP support is an NCBIFAM equivalog (e.g. an anti-phage or secretion family) and run the full review workflow.

Project Status

Started: 2026-06-20
Maturity: SCOPING — pipeline identified, masking demonstrated on the repo
gene set, NCBIFAM GO/EC source and the integration coverage gap characterised
live, CDD-own-GO question resolved, annotation gain measured, a validated
28-row ncbifam2go seed in place, a 2,455-model EC-bridge candidate set
generated at collection scale, and the member-DB attribution re-join done on the
repo's annotations.
Computed live (via NCBIFam/ncbifam_cdd_probe.py
and ncbifam_go_gain.py):
interpro2go = 30,200 rows / 14,799 InterPro ids (GO 2026-04-28); NCBIFAM PGAP
= 34,351 models, 11,228 (33%) with GO, 6,417 (19%) with EC, 13,253 equivalogs;
InterPro integration NCBIFAM 7,447/18,511 (40%), CDD 5,059/19,902 (25%); CDD-proper
carries 0 native GO (FTP + Entrez); 60-model gain Σ = 19 reviewed / 26,578
all-UniProtKB; member-DB attribution = NCBIFAM backs 705 (13%) / CDD 469 (8%) of
5,549 repo InterPro2GO rows (sole signature 250 / 116); masking verified from this
repo's *-goa.tsv / *-uniprot.txt.
Curated mappings: NCBIFam/ncbifam2go.sssom.yaml
— 28 verified SSSOM rows (27 exactMatch ready-to-add, incl. 5 proposing our own
specific term over NCBI's broad one and 1 — FtsX — declining a too-specific term;
1 broadMatch, VirB5, where no specific term exists), spanning MF/BP/CC, each with
live propagation gain; passes just validate-ncbifam-mappings.
Scaled candidates: NCBIFam/ncbifam2go.candidates.tsv
— 2,503 generated EC-bridge-confirmed rows (2,455 models; ncbifam2go_candidates.py),
AMR-rich; 17 coincide with the reviewed seed as a cross-check, plus 843 "refine" models
where ec2go supplies a specific term over NCBI's broad one.
Current conclusion: NCBIFAM/CDD reach GO only through InterPro, which
masks their contribution in GOA and leaves the majority of signatures
unintegrated. CDD-proper has no native GO (it is NCBIFAM, bundled inside
CDD, that carries it). The highest-value work is (a) re-attributing
GO_REF:0000002 rows to the firing member DB, and (b) ingesting NCBIFAM's own
curated equivalog GO via the ncbifam2go SSSOM mapping (seeded here) — the
RHEA pattern applied to a family resource; the gain is large but TrEMBL-weighted,
concentrated in mobile-element/defense/secretion biology.