InterPro Mapping Review Project
Overview
This project reviews InterPro2GO mappings — the GO annotations a protein receives
because it matched an InterPro signature. In GOA these are the IEA annotations with
REFERENCE = GO_REF:0000002, WITH/FROM = InterPro:IPRxxxxxx, and ASSIGNED BY =
InterPro. They are produced by a single curated mapping file (InterPro2GO)
that attaches GO terms to an InterPro entry; every protein matching that signature
then inherits all of the entry's terms, with no subfamily resolution.
Two complementary workstreams:
- Review the flagged mappings. Every gene review already adjudicates its
InterPro2GO annotations. We aggregate those verdicts to find which InterPro entries
systematically over-annotate (and to surface individual suspect mappings for
correction). This is the InterPro analogue of the IBA Annotation Review
and Over-Annotation Patterns projects. - Deep-research the families themselves. For PANTHER,
just fetch-genealready
auto-caches the family intointerpro/panther/<FAM>/, and gene deep research is a
generated process (just deep-research-perplexity ...). There was no equivalent
deep-research process for InterPro entries — the gap this project closes by adding a
just deep-research-interpro-familyrecipe (see
Family deep research below).
Why InterPro2GO over-annotates
InterPro2GO is a blanket rule per entry. The recurrent failure modes:
- Fold ≠ function. A signature of
type: domainorhomologous_superfamily
identifies a module, not the whole-protein function. A whole-protein GO term mapped to
that module propagates to every protein that merely contains the module. - No subfamily resolution. A term that is true only for a subfamily is applied to
the whole entry (the classic neo-functionalization / pseudo-enzyme problem, here
without even the phylogenetic structure that IBA has). - Generic terms. Entries routinely carry low-information terms (
ATP binding,
metal ion binding,membrane) that are true but uninformative as core functions. - Process/component terms on broadly distributed domains. A process term can leak
into taxa where the pathway is absent.
Current state of the evidence
Across all reviewed genes (regenerate the underlying TSVs — see
Reproducibility):
- 2732 reviewed gene files scanned
- 3652 InterPro2GO (
GO_REF:0000002) annotations reviewed - 1732 (47%) flagged suspect — i.e. the reviewer did not plain-
ACCEPTthem - across 1826 distinct InterPro entries
Action breakdown of the reviewed InterPro2GO annotations:
| Action | Count | Meaning |
|---|---|---|
| ACCEPT | 1870 | mapping endorsed as-is |
| MODIFY | 609 | true but too broad → more specific term proposed |
| KEEP_AS_NON_CORE | 523 | true but not a core function |
| MARK_AS_OVER_ANNOTATED | 391 | over-annotation of this gene |
| REMOVE | 194 | not correct for this gene |
| NEW / PENDING / UNDECIDED | 65 | other |
Highest-impact InterPro entries
Top entries by number of suspect mappings (interpro_family_priorities.tsv). These are
the deep-research worklist:
| InterPro | Name | Reviewed | Suspect | Typical issue |
|---|---|---|---|---|
| IPR000719 | Protein kinase domain | 112 | 50 | protein kinase activity→ MODIFY to ser/thr or receptor-kinase child; ATP binding KEEP_AS_NON_CORE |
| IPR008271 | Ser/Thr kinase, active site | 55 | 34 | same kinase-domain pattern |
| IPR001128 / IPR036396 | Cytochrome P450 | 44 | 26 | generic monooxygenase/oxidoreductase, heme/iron binding non-core |
| IPR001424 / IPR036423 | Cu/Zn superoxide dismutase | 20 | 16 | superoxide metabolic process over-annotated on copper chaperones (e.g. Ccs) that don't themselves dismutate |
| IPR001046 | Sodium:solute symporter | 27 | 15 | generic metal ion transport → MODIFY to substrate-specific |
| IPR012724 | DnaK / Hsp70 | 24 | 13 | broad chaperone process terms |
| IPR000276 | GPCR, rhodopsin-like | 21 | 12 | GPCR signaling pathway KEEP_AS_NON_CORE on receptors with a more specific role |
Concrete example (kinase domain): plant BAK1 carried GO:0004672 protein kinase
activity from IPR000719; reviewed as MODIFY → GO:0004675 (transmembrane receptor
ser/thr kinase activity), and its GO:0005524 ATP binding as KEEP_AS_NON_CORE.
Family-level deep research
Family deep research is a generated process, the same as gene deep research — not a
hand-written file. It judges whether each InterPro2GO term is appropriate for all
members of the entry, accounting for subfamily divergence, pseudo-enzymes, and taxonomic
scope. (The recipe and its wiring are in Reproducibility.)
A worked starting point (IPR000719, Protein kinase domain) is already cached. Its
InterPro2GO terms are GO:0005524 ATP binding and GO:0006468 protein phosphorylation
— both generic, which is exactly why genes matching it are repeatedly MODIFY'd to
specific kinase-activity children or marked non-core.
Proposed interpro2go edits (SSSOM)
The family verdicts are captured as proposed edits to the InterPro2GO mapping file in
SSSOM YAML — the same format the RHEA and
AMR projects use for proposing mappings —
INTERPRO/interpro2go.sssom.yaml. One row per
(InterPro entry, GO term) reviewed, with a 3-way predicate convention:
skos:exactMatch— the interpro2go mapping is sound (holds for ~all members);skos:broadMatch— over-broad, retain but demote to a child entry / treat as non-core;skos:exactMatch+predicate_modifier: Not— remove (factually incorrect for the entry).
Reproducibility
Supporting material under INTERPRO/ (extractor, generated TSVs,
and the SSSOM mapping set):
INTERPRO/suspect_interpro_mappings.tsv— per-annotation verdicts.INTERPRO/interpro_family_priorities.tsv— per-entry worklist.INTERPRO/interpro2go.sssom.yaml— proposed edits.
# regenerate the per-annotation / per-entry worklist TSVs
uv run python projects/INTERPRO/extract_suspect_interpro_mappings.py
# validate the SSSOM mapping set (structure + GO label check on interpro2go.terms.yaml)
just validate-interpro-mappings
Family deep research (the InterPro analogue of just deep-research-perplexity) is wired as:
templates/interpro_family_research.md— the deep-research prompt (judges whether each
InterPro2GO term holds for all members; subfamily divergence, pseudo-enzymes, taxon scope).scripts/deep_research_interpro_family.py— wrapper that loads the cached
interpro/<db>/<ID>/<ID>-metadata.yamlas context and runs the provider, writing
interpro/<db>/<ID>/<ID>-deep-research-<provider>.md.
# provider defaults to falcon (Edison)
just deep-research-interpro-family IPR000719
just deep-research-interpro-family IPR001128 openai --fallback perplexity-lite
just deep-research-interpro-family PTHR10314 perplexity --database panther
STATUS
Workstream 1 — review flagged mappings
- [x] Extractor that aggregates InterPro2GO verdicts across all reviews
- [x] Per-entry priority worklist (1826 entries; 265 with ≥3 suspect mappings)
- [ ] Work down the top entries: confirm the suspect verdicts, propose canonical
replacement terms, and feed corrections back to the gene reviews - [ ] Summarize per-entry recommendations for InterPro2GO curators
Workstream 2 — family deep research
- [x] Confirmed the metadata fetcher supports
interpro(IPR) entries - [x] Generated deep-research process:
templates/interpro_family_research.md,
scripts/deep_research_interpro_family.py, and the
just deep-research-interpro-familyrecipe - [x] Seed example cached (
interpro/interpro/IPR000719/) - [x] First family deep research generated with falcon/Edison
(IPR000719-deep-research-falcon.md): verdict that both InterPro2GO terms
(ATP binding,protein phosphorylation) over-annotate the domain because it
also matches pseudokinases — REMOVE at the domain level, restrict to catalytic
children (GO:0004674 / GO:0004713) - [x] Batch 1 of 5 more top entries researched with falcon/Edison (P450, Cu/Zn SOD,
GPCR, NRAMP/SLC11, DnaJ) — see the table below - [x] Began feeding verdicts back into gene reviews (DnaJ family): the reviews are
strongly concordant with the family research — all 7 DnaJ genes with the
InterPro2GOATP bindingannotation already flagged it (5 REMOVE, 1 MODIFY-to
ATPase-activator, 1 over-annotated). Hardened the one soft outlier (yeast/YDJ1,
MARK_AS_OVER_ANNOTATED → REMOVE) and attached the IPR012724 family report as
corroborating evidence. - [ ] Per-family, hunt for the higher-value case: a gene that currently ACCEPTs the
flagged term but is actually one of the verdict's exception members (pseudokinase,
copper chaperone, atypical chemokine/orphan receptor, non-catalytic P450) — a genuine
missed over-annotation rather than a soft-vs-hard mismatch - [x] Captured the family verdicts as proposed interpro2go edits in SSSOM YAML
(INTERPRO/interpro2go.sssom.yaml, 17 mappings over the 6 entries) — the
consortium-facing deliverable, validated viajust validate-interpro-mappings - [ ] Continue down the worklist (
interpro_family_priorities.tsv)
Family deep-research verdicts (falcon/Edison)
| InterPro | Family | Entry type | InterPro2GO verdict |
|---|---|---|---|
| IPR000719 | Protein kinase domain | domain | ATP binding + protein phosphorylation → REMOVE at domain level (captures pseudokinases); restrict to catalytic children (GO:0004674 / GO:0004713) |
| IPR001128 | Cytochrome P450 | family | heme binding + iron ion binding universal → keep; monooxygenase activity + oxidoreductase activity over-annotate (819+ functionally diverse families) |
| IPR001424 | Cu/Zn superoxide dismutase domain | domain | superoxide metabolic process → REMOVE (BP term on a structural module; copper-chaperone members don't dismutate); metal ion binding → KEEP_AS_NON_CORE |
| IPR000276 | GPCR, rhodopsin-like (Class A) | family | GPCR activity + GPCR signaling pathway → MARK_AS_OVER_ANNOTATED / MODIFY (atypical chemokine + orphan receptors lack canonical G-protein coupling); membrane → KEEP_AS_NON_CORE |
| IPR001046 | NRAMP / SLC11 metal transporter | family | metal ion transmembrane transporter activity + metal ion transport → ACCEPT as broad family terms; membrane → KEEP_AS_NON_CORE; do not add more specific terms at family level |
| IPR012724 | Chaperone DnaJ (J-domain) | family | ATP binding → REMOVE (factually wrong — the Hsp70 partner binds ATP, not DnaJ); protein folding → ACCEPT; response to heat → KEEP_AS_NON_CORE (only heat-inducible subfamilies) |
| IPR007197 | Radical SAM | domain | catalytic activity → ACCEPT despite being the MF root term — see note below; iron-sulfur cluster binding → ACCEPT (defining [4Fe-4S] cofactor) |
| IPR020849 | Small GTPase, Ras-type | family | GTP binding → ACCEPT; ADD GTPase activity (GO:0003924) — proposed new mapping (annotation gain); signal transduction → demote to subfamily (GO:0007265); membrane → MARK_AS_OVER_ANNOTATED |
| IPR002100 | Transcription factor, MADS-box | domain | DNA binding + protein dimerization activity → ACCEPT (both domain-intrinsic). Notably do NOT add DNA-binding TF activity — TF function is a whole-protein property (K/C domains + complex), not the MADS domain |
- [ ] Run
just deep-research-interpro-family <IPR>(falcon/Edison default) for the next entries
Last updated: 2026-06-20
NOTES
2026-06-20
Project creation. Scoped the InterPro2GO (GO_REF:0000002) review. Built the
extractor and the per-entry priority worklist from all 2732 reviewed genes: 3652
InterPro2GO annotations, 47% flagged suspect across 1826 InterPro entries. Broad
domain/superfamily signatures dominate the suspect list (protein kinase domain, P450,
Cu/Zn SOD, GPCR), confirming the "fold ≠ function" failure mode as the main driver.
Closed the PANTHER-vs-InterPro deep-research gap. Gene deep research is a generated
process (just deep-research-<provider>); there was no equivalent for the InterPro
entries behind InterPro2GO annotations. Added the InterPro-family analogue —
templates/interpro_family_research.md, scripts/deep_research_interpro_family.py, and
the just deep-research-interpro-family <IPR> [provider] recipe (provider defaults to
falcon/Edison) — so families are researched by the same generated pipeline (output:
interpro/<db>/<ID>/<ID>-deep-research-<provider>.md), with IPR000719 cached as a
seed.
Batch 2 + first proposed new mapping. Researched 6 more families; 3 grounded cleanly
(Radical SAM, Ras-type small GTPase, MADS-box) and are in the SSSOM (now 25 mappings).
Notable findings:
- Genericity ≠ wrongness (Radical SAM, IPR007197). I predicted the MF root term
catalytic activitywould be a REMOVE. The research says ACCEPT: because the
superfamily catalyzes >100 mechanistically different reactions, the only universally
true MF really is "is an enzyme", so the maximally generic term is the correct
family-level annotation — replacing it with anything more specific would over-annotate. - First annotation-gain proposal (Ras-type, IPR020849). The entry maps
GTP binding
but notGTPase activity(GO:0003924), even though the GTP-hydrolysis machinery
(P-loop, Switch II/Gln61, Mg²⁺) is universal — so we propose ADDING it (an
exactMatchrow, flagged for curator confirmation since intrinsic hydrolysis is
GAP-accelerated). - Domain-intrinsic vs whole-protein (MADS-box, IPR002100). I expected to add
DNA-binding TF activity; the research argues against it — the MADS domain provides
DNA binding + dimerization, but being a transcription factor is a whole-protein
property (K/C domains, complex context), so adding it would over-annotate the domain. - QC catch. 3 of the 6 runs (sigma-54, pseudouridine synthase, GAPDH) silently
returned ungrounded reports (exit 0, file written, but "no contexts were retrieved
… not grounded in evidence", zero real citations) — likely Edison retrieval throttling
under 6-way parallel load. A sequential re-run also failed fast (~3 s each, no
retrieval), so this is a transient Edison retrieval-backend outage, not a load issue —
these 3 entries are deferred (metadata cached) for re-running when the backend
recovers. Excluded from the SSSOM. (Detect with: grep for "no contexts were retrieved"
orcitation_count/zero real refs — a groundedness guard worth adding to the wrapper.)
Batch 1 of family deep research (falcon/Edison). Ran five more top entries: P450
(IPR001128), Cu/Zn SOD (IPR001424), GPCR Class A (IPR000276), NRAMP/SLC11
(IPR001046), and DnaJ (IPR012724). See the verdict table under Workstream 2. A
recurring, independently-reached pattern: cofactor/binding terms (heme binding, metal
ion binding) and broad transport terms hold family-wide, but whole-protein activity
and process terms attached to a structural module over-annotate — most sharply for
IPR012724, where Edison flags ATP binding as factually wrong on DnaJ (the Hsp70 partner
binds ATP), and for IPR001424, where superoxide metabolic process mis-annotates
copper-chaperone members that do not dismutate.