Headroom for New, More-Specific Pfam → GO Mappings
Auto-generated by
headroom_analysis.py. Re-run to refresh. See parent project. This measures the opportunity to author new per-Pfam mappings; it does not invent any mapping.
Provenance
- InterPro release (membership + types): 109.0 (11-JUN-26)
- Pfam universe: Pfam-A.clans.tsv, 30,134 families
- GO depth = shortest is_a/part_of path to an aspect root; “general” = depth ≤ 3.
A. Coverage gap — Pfam families with no GO via InterPro
A new Pfam→GO mapping for these families adds annotation where InterPro2GO currently provides nothing.
- Total Pfam-A families: 30,134
- Integrated into an InterPro entry: 29,105 (96.6%)
- … of which the entry carries ≥1 GO term (covered): 5,246 (17.4%)
- Integrated but entry has no GO term: 23,859
- Not integrated into any InterPro entry: 1,029
- Total families with zero GO via InterPro: 24,888 (82.6%)
- of which domains of unknown function (DUF / 'unknown function'): 6,529 (26.2%) — real biology gaps, not mapping gaps
- named, tractable targets for new mappings: 18,359 (73.8%)
These are the pure-coverage candidates (full list: data/pfam_no_go_coverage_gap.tsv, git-ignored). Examples with informative descriptions:
| Pfam | name | reason | description |
|---|---|---|---|
| PF00007 | Cys_knot | entry_has_no_go | Cystine-knot domain |
| PF00008 | EGF | entry_has_no_go | EGF-like domain |
| PF00011 | HSP20 | entry_has_no_go | Hsp20/alpha crystallin family |
| PF00017 | SH2 | entry_has_no_go | SH2 domain |
| PF00021 | UPAR_LY6 | entry_has_no_go | u-PAR/Ly-6 domain |
| PF00022 | Actin | entry_has_no_go | Actin |
| PF00024 | PAN_1 | entry_has_no_go | PAN domain |
| PF00026 | Asp | entry_has_no_go | Eukaryotic aspartyl protease |
| PF00027 | cNMP_binding | entry_has_no_go | Cyclic nucleotide-binding domain |
| PF00029 | Connexin | entry_has_no_go | Connexin |
| PF00030 | Crystall | entry_has_no_go | Beta/Gamma crystallin |
| PF00035 | dsrm | entry_has_no_go | Double-stranded RNA binding motif |
| PF00037 | Fer4 | entry_has_no_go | 4Fe-4S binding domain |
| PF00038 | Filament | entry_has_no_go | Intermediate filament protein |
| PF00040 | fn2 | entry_has_no_go | Fibronectin type II domain |
| PF00043 | GST_C | entry_has_no_go | Glutathione S-transferase, C-terminal domain |
| PF00045 | Hemopexin | entry_has_no_go | Hemopexin |
| PF00047 | ig | entry_has_no_go | Immunoglobulin domain |
| PF00051 | Kringle | entry_has_no_go | Kringle domain |
| PF00052 | Laminin_B | entry_has_no_go | Laminin B (Domain IV) |
(Note: a large share of uncovered families are domains of unknown function (DUFs); those are genuine knowledge gaps, not mapping gaps.)
B. Splitting headroom — lumped entries sharing one GO term
InterPro entries with ≥2 Pfam members and ≥1 GO term: every member inherits the same GO term, so functionally distinct members are candidates for more specific (descendant) per-Pfam terms.
- Multi-Pfam GO-bearing entries: 76
- Pfam families living in them (share a GO term with ≥1 sibling): 184
- Of those entries, GO term is general (depth ≤ 3): 51 entries, 124 Pfam families
- Multi-Pfam GO entries that are Family / Homologous superfamily (divergent groupings, strongest case for splitting): 4
Entry-type breakdown of multi-Pfam GO-bearing entries:
| InterPro entry type | entries |
|---|---|
| Domain | 67 |
| Repeat | 4 |
| Family | 4 |
| Conserved_site | 1 |
Highest-headroom entries (most Pfam members, most general shared GO)
Family / Homologous-superfamily entries with many members and a general GO term — the best candidates for per-Pfam refinement. Member descriptions show whether the lumped families are functionally heterogeneous.
| InterPro | type | members | GO depth | shared GO | example member families |
|---|---|---|---|---|---|
| IPR010392 | Family | 2 | 1 | structural molecule activity; viral capsid | TNV_CP; Potex_coat |
| IPR004031 | Family | 2 | 2 | membrane | PMP22_Claudin; Claudin_2 |
| IPR006628 | Family | 2 | 5 | RNA polymerase II transcription regulatory region sequence-specific DNA binding; purine-rich negative regulatory element binding | PurA; DUF3276 |
| IPR002494 | Family | 2 | 7 | keratin filament | Keratin_B2; Keratin_B2_2 |
Full ranked list: lumped_entries_headroom.tsv.
Interpretation
- Splitting gives almost no precision headroom. Only 76 GO-bearing InterPro entries have ≥2 Pfam members at all, and just 4 are Family/Homologous-superfamily. The rest are Repeat and Domain entries whose multiple Pfam members are redundant HMM signatures of the same domain (e.g. the GNAT acyltransferase or EF-hand calcium-binding entries list several alternative Pfam models), so a per-Pfam term would be identical, not finer. The premise that InterPro lumps functionally distinct Pfams under one general GO term rarely holds for GO-bearing entries: those entries are overwhelmingly 1 Pfam = 1 entry, so InterPro2GO already sits at Pfam granularity.
- Coverage is the real, large opportunity — and for well-characterised families it is also a precision opportunity. 82.6% of Pfam families get no GO via InterPro. About a quarter are DUFs (real biology gaps); the remaining ~18,359 named families are tractable. Crucially, InterPro2GO is conservative: even canonical, well-understood domains are left unmapped (e.g. SH2 IPR000980, EGF IPR000742, Kringle IPR000001, Actin IPR004000 all have no interpro2go term). For these a new mapping is not limited by any existing general InterPro term, so it can be assigned directly at a specific level — e.g. SH2 → phosphotyrosine residue binding (GO:0001784). Here coverage and precision coincide.
- So new mappings can beat InterPro2GO most where InterPro abstained. They cannot beat it by splitting its lumped entries (there is nothing to split), but they can (a) annotate the ~18k named uncovered families, often at a specific level; (b) go below the family — Pfam clan subfamilies, domain architectures, active-site/residue signatures, structure (AlphaFold) — to refine the ~5k already-covered families. Neither is delivered by
pfam2go(a copy of InterPro2GO); both must be authored by curation or grounded model prediction and then validated. This script scopes and prioritizes that work.