NCBIFAM / CDD → GO Contribution & Gap Project
Overview
This project audits the NCBI protein-family resources — NCBIFAM (the
PGAP/TIGRFAM HMM collection) and CDD (Conserved Domain Database) — as sources of
GO annotations, in the same spirit as the RHEA, SPKW, and
UniPathway source-audit projects, and develops the missing
ncbifam2go / cdd2go mappings the way RHEA develops rhea2go gap-fills.
The structural fact that organises everything: NCBIFAM and CDD are InterPro
member databases, so they reach GOA through exactly one pipeline — InterPro
integration → interpro2go → GO_REF:0000002 (evidence IEA,
assigned_by=InterPro). There is no public ncbifam2go or cdd2go
external2go file (verified: both 403 on current.geneontology.org; only
interpro2go is served). So like RHEA, the analysis runs in two directions:
- Contribution (forward). Where is a
GO_REF:0000002annotation actually
carried by an NCBIFAM/CDD signature (versus Pfam, PROSITE, etc.), and is that
contribution useful or over-/mis-propagated? — the SPKW-style audit, here
complicated by InterPro masking the member DB. - Gaps (reverse). Where does NCBIFAM/CDD assert a function — an NCBI-curated
GO/EC term, or a precise equivalog family name — that never reaches GO,
either because the signature is unintegrated into InterPro or because the
integrated InterPro entry has nointerpro2gorow? These are gap-filling
opportunities, not over-annotations.
See NCBIFAM-METHODOLOGY.md for the queries, the
reproducible probe (ncbifam_cdd_probe.py), and
the masking/closure caveats.
Key Findings (scoping pass)
- One route in, and a dedicated mapping file is missing. NCBIFAM/CDD GO
reaches GOA only viainterpro2go/GO_REF:0000002; there is no
ncbifam2goorcdd2go(403).interpro2gois sizeable — 30,200 mapping
rows over 14,799 distinct InterPro entries (GO release2026-04-28) — but a
member signature contributes GO only if it is integrated into one of those
entries. - NCBIFAM/CDD are masked by InterPro in GOA — and the contribution is now
measured. AGO_REF:0000002row'sWITH/FROMnames the integrated
InterPro:IPRnnnnnnentry, never the member signature (0TIGR…/NF…/cd…ids
in any*-goa.tsvWITH/FROM, vs 1,160DR NCBIfam+ 2,174DR CDDin the
proteins' UniProt records). Re-joining each annotation to its InterPro entry's
member databases shows NCBIFAM backs 705 (13%) and CDD 469 (8%) of the repo's
5,549 InterPro2GO rows, and is the sole integrated signature for 250 / 116
— invisible in GOA (the attribution section).
This is the NCBIFAM/CDD analog of RHEA being "masked by EC". - NCBIFAM carries its own NCBI-curated GO/EC that GO does not ingest. The PGAP
HMM metadata (hmm_PGAP.tsv, 34,351 models) assigns function directly:
11,228 models (33%) carry GO terms (3,622 distinct GO ids) and 6,417
(19%) carry EC numbers. None of this propagates except where the model is
integrated into InterPro — a large, curated, unused mapping source. equivalogmakes the gap-fill tractable. 13,253 NCBIFAM models are typed
equivalog(all members share one function → safe GO/EC transfer), the direct
analog of RHEA's reviewed enzymes backing each curated mapping.domain/
subfamilymodels (10,974 / 4,564) need altitude care.- The coverage gap is large — most signatures are unintegrated. Computed live
from the InterPro API: NCBIFAM 7,447 / 18,511 integrated (40%); CDD 5,059 /
19,902 integrated (25%). The 11,064 unintegrated NCBIFAM and 14,843
unintegrated CDD signatures contribute zero GO by construction — the
NCBIFAM/CDD analog of RHEA's "norhea2goline" reactions. - CDD-proper has no native GO of its own; NCBIFAM does. Checked directly
(see CDD section): the
NCBI-curatedcd…models — what InterPro calls the "cdd" member DB — carry no
GO in any bulk file (cddannot*.dat,cddid_all.tbl→ 0 GO) or in the NCBI
Entrezcddrecord (no GO field). Where the broad CDD search database appears
to "have GO", that GO is borrowed from the non-cdmodels CDD bundles —
chiefly the 13,669 NCBIFAM models (PRK 7,039 + TIGR 4,488 + NF 2,142) inside
CDD, plus Pfam/COG — not CDD-curated. So the GO-bearing source is NCBIFAM;
CDD-proper reaches GO only via InterPro. - Ingesting NCBIFAM GO would gain mostly TrEMBL, modestly Swiss-Prot. A live
UniProtKB propagation probe (closure-aware; see
NCBIFAM-ANNOTATION-GAIN.md) over a 60-model
equivalogsample finds Σ gain = 19 reviewed (Swiss-Prot) vs 26,578
all-UniProtKB — the opposite emphasis from RHEA (whose gain was reviewed-heavy),
because NCBIFAM is prokaryote-heavy and curated entries already carry the term.
The gap concentrates in mobile elements (IS630 transposase: 18,874 entries
missing the term), conjugation/secretion (VirB5 T4SS, conjugative ATPases),
CRISPR/anti-phage defense, sporulation, encapsulins, and cell division (FtsX) —
the newer microbial biology where InterPro integration lags.
NCBIFAM/CDD is mostly masked by InterPro
UniProt enzymes and families almost always carry multiple member signatures
(an NCBIFAM equivalog and a Pfam domain and a CDD model …), all collapsed into
one or a few InterPro entries. So an NCBIFAM/CDD signature's "real"
GO_REF:0000002 contribution is only the slice where it is the integrating /
distinguishing signature for the InterPro entry, and where no other member DB
already supplies the same term. Two consequences:
- Attribution requires re-joining GOA to InterPro member integration (API
entry/integrated/{db}/); GOA alone cannot tell you which member DB fired. - The interesting NCBIFAM cases are the equivalog-grounded ones, where the
member model is more specific/curated than the Pfam-style domain it shares an
InterPro entry with — the place NCBIFAM resolves a function the broader members
lump (mirroring RHEA resolving a cofactor split EC hides).
Un-masking: member-DB attribution on this repo's annotations
The masking claim is now measured, not just asserted. Every GO_REF:0000002
annotation in this repo's genes/**/*-goa.tsv was re-joined to its InterPro entry's
member_databases via the InterPro API
(interpro_member_attribution.py,
resumable). Across 5,549 resolved InterPro-citation rows (1,827 distinct entries):
| Member DB | distinct entries | annotation rows | sole signature |
|---|---|---|---|
| pfam | 708 (39%) | 2,374 (43%) | — |
| panther | 459 (25%) | 1,082 (19%) | — |
| ncbifam | 272 (15%) | 705 (13%) | 250 |
| prints / profile / smart | ~190 each | ~820 each | — |
| cdd | 183 (10%) | 469 (8%) | 116 |
| hamap / pirsf / ssf / cathgene3d | 117–159 | 403–627 | — |
NCBIFAM contributes a signature to 705 (13%) and CDD to 469 (8%) of the repo's
InterPro2GO annotations — a contribution entirely invisible in GOA, which shows
only the InterPro:IPR… id. Stronger still, NCBIFAM is the sole integrated
signature for 250 rows and CDD for 116 — annotations that exist purely because of
an NCBIFAM/CDD model, with no other member DB in the entry. And these are not exotic:
NCBIFAM-sole entries back annotations on GAPDH (IPR006424→GO:0006006), RPS3
(IPR005703→GO:0006412), ATP6V1A (IPR005725→GO:0016887), and HMGCS2
(IPR010122→GO:0008299) — mainstream genes whose InterPro2GO term traces solely to a
TIGRFAM/NCBIFAM signature. This is the quantitative form of the masking finding, and
the join the sibling InterPro Mapping Review project needs to attribute
each interpro2go term to the member DB that earned it.
Specificity and quality cut by family_type
NCBIFAM's family_type is a built-in altitude/quality signal absent from CDD:
| family_type | N | Function-transfer safety |
|---|---|---|
equivalog |
13,253 | High — all members one function; GO/EC transfer safe |
domain |
10,974 | Low — domain, not whole-protein function |
subfamily |
4,564 | Medium — clade-specific; check altitude |
PfamEq / PfamAutoEq |
1,807 / 1,204 | Equivalent to a Pfam entry → likely already InterPro-covered |
exception / hypoth_equivalog |
1,341 / 434 | Curated caveat / hypothetical — review individually |
CDD models, by contrast, are domain/architecture-oriented and lack this typing,
so CDD's forward contribution skews toward broad domain terms (the
protein binding / generic-domain altitude problem) and its reverse gap is
harder to curate than NCBIFAM's.
Does CDD have its own GO mappings? No — for CDD-proper
It is worth pinning down, because the CDD search database superficially looks
like it carries GO. The answer separates two things CDD conflates:
- CDD-proper (the NCBI-curated
cd…models) — the 21,955cdmodels, which
is exactly what InterPro ingests as its "cdd" member database — has no native
GO. Verified three ways: (1) the NCBI bulk annotation files carry only residue
features, not GO —cddannot.dat/cddannot_generic.datcontain 0GO:
lines, andcddid_all.tblhas none; (2) the NCBI Entrezcddrecord for acd
model (e.g.cd00009 AAA) exposes an abstract, PSSM, and site descriptions but
no GO field; (3) GO serves nocdd2gofile (HTTP 403). So CDD-proper
reaches GO only through InterPro integration →interpro2go. - The CDD search database is a superset of 74,122 models drawn from many
sources (cd21,955 ·pfam19,637 ·PRK7,039 ·COG5,137 ·KOG4,825 ·
TIGR4,488 ·NF2,142 · …). Where a CDD hit does come with GO, that GO is
borrowed from the source resource — overwhelmingly the 13,669 NCBIFAM
models (PRK + TIGR + NF) bundled inside CDD, whose GO is the
hmm_PGAP.tsv go_termswe mine here, plus Pfam/COG (themselves InterPro-routed).
Conclusion: the GO-bearing NCBI family resource is NCBIFAM, not CDD. CDD
"has GO" only as a search aggregator surfacing NCBIFAM/Pfam GO under cd/PRK/
TIGR accessions. This is why the curated mapping deliverable targets
ncbifam2go, and a cdd2go would be largely redundant with it (and with
Pfam→InterPro). Reproduce with ncbifam_cdd_probe.py and the FTP/Entrez checks in
NCBIFAM-METHODOLOGY.md.
Gaps Found (scoping)
| # | Gap | Size | What it is |
|---|---|---|---|
| G1 | InterPro masking | all GO_REF:0000002 rows |
GOA hides which member DB fired → member contribution unattributable without re-join |
| G2 | NCBIFAM unintegrated | 11,064 / 18,511 (60%) | NCBIFAM signatures not in InterPro → contribute no GO |
| G3 | CDD unintegrated | 14,843 / 19,902 (75%) | CDD signatures not in InterPro → contribute no GO |
| G4 | NCBI-curated GO not ingested | 11,228 models w/ GO | NCBIFAM models carry NCBI GO that GO has no ncbifam2go to ingest |
| G5 | NCBI-curated EC not ingested | 6,417 models w/ EC | Could seed EC2GO-bridged GO mappings (the RHEA EC-bridge pattern) |
| G6 | Integrated-but-unmapped InterPro entries | staged | NCBIFAM/CDD integrated into an IPR entry that has no interpro2go row |
G4/G5 are the high-value half: an equivalog with a clean NCBI go_terms /
ec_numbers value is a ready-to-curate mapping; an ec_numbers-only equivalog
can be bridged through ec2go exactly as RHEA bridges reactions through
rhea2ec/ec2go.
Curated new mappings (SSSOM)
The curation deliverable mirrors RHEA's rhea2go.sssom.yaml:
ncbifam2go.sssom.yaml records the NCBIFAM-family →
GO mapping we propose for ingestion — not a transcription of NCBI's
hmm_PGAP.tsv go_terms. Each is backed by the model's family_type,
product_name, EC, and PMIDs, plus the live UniProtKB propagation gain. A
28-mapping seed spanning all three GO aspects (MF, BP, CC — NCBIFAM is a
whole-protein family resource, not enzyme-only) is in place, with predicate classes
parallel to RHEA:
skos:exactMatch(26 rows) — the GO term is the family's function; a
ready-to-addncbifam2gorow. The enzyme majority are EC-bridge supported
(verified live:ec2go(EC)= this GO term, e.g. formamidase EC
3.5.1.49→GO:0004328, β-lactamase EC 3.5.2.6→GO:0008800, uridine kinase EC
2.7.1.48→GO:0004849). Spans AMR (two distinct β-lactamase families → one GO
term, the family→GO many-to-one analog of RHEA's reactions→activity), central
metabolism, phosphonate/arsenate/cobalamin pathways, plus the highest-gap case
IS630 transposase →GO:0004803(18,874 entries missing it), an encapsulin-shell
CC term, and an anti-phage defense BP term. Five of these rows propose our own
specific term to replace an NCBI value that was too broad (see below).skos:broadMatch(1 row) — reserved for the case where no more-specific GO
term exists to adopt: VirB5 →type IV secretion system complex(a subunit
part_ofthe whole complex, no VirB5-specific CC term).
We suggest our own term where NCBI's was too broad — and that unmasks the real
gain. For five families NCBI's go_terms gave only a broad parent — twice the
ontology near-root GO:0003824 catalytic activity (enoyl-CoA hydratase NF005804,
spermidine synthase TIGR00417) — even though a precise, EC-bridged child already
exists. Rather than record the useless broad term, the seed proposes the specific
child as an exactMatch (dGTPase→GO:0008832, enoyl-CoA hydratase→GO:0004300,
dihydroorotase→GO:0004151, spermidine synthase→GO:0004766, LL-DAP
aminotransferase→GO:0010285). This is not cosmetic: the broad parent is
near-universal so its propagation gain looks ~0, but the specific term reveals
large gaps the parent masked — spermidine synthase 575, LL-DAP aminotransferase
1,185, dihydroorotase 491, and dGTPase 456 (incl. 13 reviewed/Swiss-Prot)
entries missing the precise activity. Proposing our own term is what turns these from
invisible into actionable gap-fills.
…but more specific is not always right — the FtsX cell-division case. The
mirror-image judgement is TIGR00439 (permease-like cell division protein FtsX),
where chasing specificity would be over-annotation. NCBI assigned GO:0000910
cytokinesis; the ontology shows cytokinesis is part_of GO:0051301 cell division
(so NCBI's term is actually the narrower one — an earlier draft of this seed had
that backwards). Empirically, all 7 reviewed FtsX proteins carry GO:0051301 cell
division but only 2/7 carry cytokinesis or the most specific GO:0043093
FtsZ-dependent cytokinesis. FtsX/FtsEX regulates septal peptidoglycan hydrolysis
and divisome assembly, so curators annotate the safe participation term, not the
constriction act. The gain numbers make the trap explicit: mapping to GO:0051301
gains only 22 (it is already near-universal — a confirmatory mapping), whereas
mapping to GO:0043093 would show a 3,304 apparent gap — but propagating
FtsZ-dependent cytokinesis to every FtsX would assert more than the family supports.
We therefore propose the curator-consensus GO:0051301 cell division as the
exactMatch and decline the higher-gain specific term. Specificity is the goal
only up to the altitude the evidence supports.
Verification matters and catches real errors. One scoping-sample NCBIFAM
go_terms value (GO:0009448 on a GABA transaminase) is obsolete; a
diacylglycerol-kinase model (NF009874, EC 2.7.1.107) is tagged with the wrong
activity GO:0003951 NAD+ kinase activity (the correct GO:0004143 exists); and
several assignments sit at near-root altitude. The obsolete and wrong ids were
excluded; the broad ones were replaced by our own specific term (above) — so
the seed drops obsolete/incorrect ids, prefers specific children, and
EC-bridge-confirms enzymes. Every GO id/label was checked non-obsolete
against QuickGO (2026-06-20); every family id/name/type/EC is from hmm_PGAP.tsv;
every EC→GO bridge against the live ec2go. Validate
with just validate-ncbifam-mappings — SSSOM structural validation plus GO
term/label validation (object bound to the full GO graph, MF+BP+CC; generated
nested view ncbifam2go.terms.yaml). The seed
passes validation.
A cdd2go set is not planned: per the CDD section,
CDD-proper has no native GO, and the GO surfaced through CDD belongs to NCBIFAM
(captured here) or Pfam (InterPro-routed) — a cdd2go would be largely redundant.
Scaling the seed to the whole collection (EC-bridge candidates)
The 28-row seed is hand-reviewed; the EC bridge lets us scale the same evidence
standard to the whole collection with no per-row human judgement, because the
agreement of two independent curated resources (NCBI's go_terms and GO's ec2go)
is the verification. ncbifam2go_candidates.py
walks every NCBIFAM model and emits each (model, GO) where ec2go(model's EC)
confirms one of the model's own NCBI go_terms. The live funnel:
| Stage | Count |
|---|---|
| GO-bearing NCBIFAM models | 11,228 |
| …with both an EC and a GO term | 3,782 |
…where ec2go(EC) confirms a model GO → exactMatch candidates |
2,455 (2,503 rows) |
…where ec2go(EC) would refine NCBI's broader/absent GO (the spermidine-synthase pattern, at scale) |
843 |
| …candidates already in the reviewed seed (cross-check) | 17 |
The generated set is ncbifam2go.candidates.tsv
(2,503 rows, clearly marked generated; mapping_justification would be
semapv:CompositeMatching). The 17 rows that coincide with the reviewed seed are
exactly the seed's EC-bridge enzyme rows — an automatic confirmation that the
generator agrees with manual curation where they overlap. These 2,455 are
AMR-rich (trimethoprim-resistant dihydrofolate reductases → GO:0004146,
β-lactamases → GO:0008800, aminoglycoside 6′-N-acetyltransferases → GO:0047663,
…) and are the natural ready-to-add core of a real ncbifam2go. The 843
"refine" models are the scaled version of the five hand-fixed altitude rows: NCBIFAM
gave a broad/near-root term but ec2go supplies the specific child — a second,
also-automatable candidate class (propose ec2go's term), pending the same
altitude/over-annotation check the FtsX case shows is still needed.
Methods
The interpro2go characterisation, InterPro member-integration counts, and the
NCBIFAM PGAP GO/EC coverage are computed live by
ncbifam_cdd_probe.py (stdlib only; no go-db, no
auth). The masking evidence (0 member signatures in WITH/FROM; 1,160 NCBIfam +
2,174 CDD DR lines) is computed from this repo's genes/**/*-goa.tsv and
*-uniprot.txt. The annotation-gain numbers are computed live against the
UniProtKB REST API by ncbifam_go_gain.py
(closure-aware go: query; see
NCBIFAM-ANNOTATION-GAIN.md). The CDD-own-GO
check uses the NCBI CDD FTP (cddannot*.dat, cddid_all.tbl) and Entrez cdd
records. The member-DB attribution (which member DB backs each GO_REF:0000002
row in the repo) is computed live by
interpro_member_attribution.py against the
InterPro API (resumable, cached). The forward closure-filtered cross-organism
contribution table reuses the UniPathway/RHEA uniqueness
query and is staged pending the go-db DuckDBs (absent in the web container). Full
queries and caveats: NCBIFAM-METHODOLOGY.md.
How this differs from RHEA, SPKW, and UniPathway
SPKW (GO_REF:0000043) |
RHEA (GO_REF:0000116) |
NCBIFAM/CDD (GO_REF:0000002) |
|
|---|---|---|---|
| GO aspect | mostly BP | MF (enzyme activity) | MF + BP + CC (family/domain models) |
| Provenance in GOA | direct (keyword visible) | direct (assigned_by=RHEA) |
masked — only the integrated InterPro:IPR… shows |
Dedicated *2go file |
yes | yes | no (ncbifam2go/cdd2go do not exist) |
| Dominant failure mode | process conflation | parent/child altitude; wrong substrate | domain-altitude (CDD); unintegrated coverage gap |
| Built-in quality signal | none | reaction precision | NCBIFAM family_type (equivalog) |
| Curation emphasis | over-annotation removal | gap-filling | both, but attribution-first then equivalog gap-fill |
NCBIFAM is unusual among these sources in carrying its own curated GO/EC that
GO does not ingest, so — like RHEA — the expected verdict skew is toward
NEW / gap-filling (for equivalog models) on the reverse side, with the
forward side dominated by the attribution/masking problem rather than outright
over-annotation.
Curation Recommendations (preliminary)
- Attribute before auditing. A
GO_REF:0000002annotation cannot be praised
or blamed on NCBIFAM/CDD until GOA is re-joined to InterPro member integration;
build that join first. - Mine NCBIFAM
equivalogGO/EC as a mapping source. The 13,253 equivalogs
with NCBIgo_terms/ec_numbersare the cleanest gap-fill substrate — start
thencbifam2go.sssom.yamlhere. - EC-bridge where only EC is given. An equivalog with
ec_numbersbut no
go_termscan be mapped throughec2go, the RHEA EC-bridge pattern. - Treat CDD as domain-altitude-risky. Without
family_type, CDD's forward
contribution skews broad; prefer the specific child the curated entry supports. - Unintegrated signatures with a real function are
proposed_new_terms/
InterPro-integration requests, not silent gaps.
Follow-Up Targets
| Target | Rationale |
|---|---|
| ✅ GOA × InterPro member-integration re-join | Done (attribution section): NCBIFAM backs 13% / CDD 8% of the repo's InterPro2GO rows (sole signature for 250 / 116). |
| Forward closure-filtered cross-organism scan | UniPathway-style uniqueness for member-attributed rows; needs go-db DuckDBs. Now seeded by the member-attribution join above. |
| Promote the 2,455 EC-bridge candidates | Altitude/obsolete-check ncbifam2go.candidates.tsv and fold the clean rows into the reviewed SSSOM → a near-complete ingestible ncbifam2go. |
| Build the 843 "refine" class | Auto-propose ec2go's specific term where NCBI's go_terms is broad/absent (the spermidine-synthase pattern), then altitude-review as FtsX shows is needed. |
| Non-EC families (defense/secretion/transport) | The high-gain non-enzyme equivalogs (transposases, anti-phage, T4SS, encapsulins) have no EC bridge → need a different verification (literature/SPARCLE), curated like the seed's CC/BP rows. |
| Full-collection gain run | Replace the 60-model gain sample with the complete equivalog set for a definitive reviewed-vs-TrEMBL gain figure. |
| Exemplar gene reviews | Pick 2–3 genes whose only MF/BP support is an NCBIFAM equivalog (e.g. an anti-phage or secretion family) and run the full review workflow. |
Project Status
- Started: 2026-06-20
- Maturity: SCOPING — pipeline identified, masking demonstrated on the repo
gene set, NCBIFAM GO/EC source and the integration coverage gap characterised
live, CDD-own-GO question resolved, annotation gain measured, a validated
28-rowncbifam2goseed in place, a 2,455-model EC-bridge candidate set
generated at collection scale, and the member-DB attribution re-join done on the
repo's annotations. - Computed live (via
NCBIFam/ncbifam_cdd_probe.py
andncbifam_go_gain.py):
interpro2go= 30,200 rows / 14,799 InterPro ids (GO2026-04-28); NCBIFAM PGAP
= 34,351 models, 11,228 (33%) with GO, 6,417 (19%) with EC, 13,253 equivalogs;
InterPro integration NCBIFAM 7,447/18,511 (40%), CDD 5,059/19,902 (25%); CDD-proper
carries 0 native GO (FTP + Entrez); 60-model gain Σ = 19 reviewed / 26,578
all-UniProtKB; member-DB attribution = NCBIFAM backs 705 (13%) / CDD 469 (8%) of
5,549 repo InterPro2GO rows (sole signature 250 / 116); masking verified from this
repo's*-goa.tsv/*-uniprot.txt. - Curated mappings:
NCBIFam/ncbifam2go.sssom.yaml
— 28 verified SSSOM rows (27 exactMatch ready-to-add, incl. 5 proposing our own
specific term over NCBI's broad one and 1 — FtsX — declining a too-specific term;
1 broadMatch, VirB5, where no specific term exists), spanning MF/BP/CC, each with
live propagation gain; passesjust validate-ncbifam-mappings. - Scaled candidates:
NCBIFam/ncbifam2go.candidates.tsv
— 2,503 generated EC-bridge-confirmed rows (2,455 models;ncbifam2go_candidates.py),
AMR-rich; 17 coincide with the reviewed seed as a cross-check, plus 843 "refine" models
whereec2gosupplies a specific term over NCBI's broad one. - Current conclusion: NCBIFAM/CDD reach GO only through InterPro, which
masks their contribution in GOA and leaves the majority of signatures
unintegrated. CDD-proper has no native GO (it is NCBIFAM, bundled inside
CDD, that carries it). The highest-value work is (a) re-attributing
GO_REF:0000002rows to the firing member DB, and (b) ingesting NCBIFAM's own
curatedequivalogGO via thencbifam2goSSSOM mapping (seeded here) — the
RHEA pattern applied to a family resource; the gain is large but TrEMBL-weighted,
concentrated in mobile-element/defense/secretion biology.