InterPro Mapping Review Project

SCOPING PIPELINE

InterPro Mapping Review Project

Overview

This project reviews InterPro2GO mappings — the GO annotations a protein receives
because it matched an InterPro signature. In GOA these are the IEA annotations with
REFERENCE = GO_REF:0000002, WITH/FROM = InterPro:IPRxxxxxx, and ASSIGNED BY = InterPro. They are produced by a single curated mapping file (InterPro2GO)
that attaches GO terms to an InterPro entry; every protein matching that signature
then inherits all of the entry's terms, with no subfamily resolution.

Two complementary workstreams:

  1. Review the flagged mappings. Every gene review already adjudicates its
    InterPro2GO annotations. We aggregate those verdicts to find which InterPro entries
    systematically over-annotate (and to surface individual suspect mappings for
    correction). This is the InterPro analogue of the IBA Annotation Review
    and Over-Annotation Patterns projects.
  2. Deep-research the families themselves. For PANTHER, just fetch-gene already
    auto-caches the family into interpro/panther/<FAM>/, and gene deep research is a
    generated process (just deep-research-perplexity ...). There was no equivalent
    deep-research process for InterPro entries
    — the gap this project closes by adding a
    just deep-research-interpro-family recipe (see
    Family deep research below).

Why InterPro2GO over-annotates

InterPro2GO is a blanket rule per entry. The recurrent failure modes:

Current state of the evidence

Across all reviewed genes (regenerate the underlying TSVs — see
Reproducibility):

Action breakdown of the reviewed InterPro2GO annotations:

Action Count Meaning
ACCEPT 1870 mapping endorsed as-is
MODIFY 609 true but too broad → more specific term proposed
KEEP_AS_NON_CORE 523 true but not a core function
MARK_AS_OVER_ANNOTATED 391 over-annotation of this gene
REMOVE 194 not correct for this gene
NEW / PENDING / UNDECIDED 65 other

Highest-impact InterPro entries

Top entries by number of suspect mappings (interpro_family_priorities.tsv). These are
the deep-research worklist:

InterPro Name Reviewed Suspect Typical issue
IPR000719 Protein kinase domain 112 50 protein kinase activity→ MODIFY to ser/thr or receptor-kinase child; ATP binding KEEP_AS_NON_CORE
IPR008271 Ser/Thr kinase, active site 55 34 same kinase-domain pattern
IPR001128 / IPR036396 Cytochrome P450 44 26 generic monooxygenase/oxidoreductase, heme/iron binding non-core
IPR001424 / IPR036423 Cu/Zn superoxide dismutase 20 16 superoxide metabolic process over-annotated on copper chaperones (e.g. Ccs) that don't themselves dismutate
IPR001046 Sodium:solute symporter 27 15 generic metal ion transport → MODIFY to substrate-specific
IPR012724 DnaK / Hsp70 24 13 broad chaperone process terms
IPR000276 GPCR, rhodopsin-like 21 12 GPCR signaling pathway KEEP_AS_NON_CORE on receptors with a more specific role

Concrete example (kinase domain): plant BAK1 carried GO:0004672 protein kinase activity from IPR000719; reviewed as MODIFY → GO:0004675 (transmembrane receptor
ser/thr kinase activity), and its GO:0005524 ATP binding as KEEP_AS_NON_CORE.

Family-level deep research

Family deep research is a generated process, the same as gene deep research — not a
hand-written file. It judges whether each InterPro2GO term is appropriate for all
members of the entry, accounting for subfamily divergence, pseudo-enzymes, and taxonomic
scope. (The recipe and its wiring are in Reproducibility.)

A worked starting point (IPR000719, Protein kinase domain) is already cached. Its
InterPro2GO terms are GO:0005524 ATP binding and GO:0006468 protein phosphorylation
— both generic, which is exactly why genes matching it are repeatedly MODIFY'd to
specific kinase-activity children or marked non-core.

Proposed interpro2go edits (SSSOM)

The family verdicts are captured as proposed edits to the InterPro2GO mapping file in
SSSOM YAML — the same format the RHEA and
AMR projects use for proposing mappings —
INTERPRO/interpro2go.sssom.yaml. One row per
(InterPro entry, GO term) reviewed, with a 3-way predicate convention:

Reproducibility

Supporting material under INTERPRO/ (extractor, generated TSVs,
and the SSSOM mapping set):

# regenerate the per-annotation / per-entry worklist TSVs
uv run python projects/INTERPRO/extract_suspect_interpro_mappings.py

# validate the SSSOM mapping set (structure + GO label check on interpro2go.terms.yaml)
just validate-interpro-mappings

Family deep research (the InterPro analogue of just deep-research-perplexity) is wired as:

# provider defaults to falcon (Edison)
just deep-research-interpro-family IPR000719
just deep-research-interpro-family IPR001128 openai --fallback perplexity-lite
just deep-research-interpro-family PTHR10314 perplexity --database panther

STATUS

Workstream 1 — review flagged mappings

Workstream 2 — family deep research

Family deep-research verdicts (falcon/Edison)

InterPro Family Entry type InterPro2GO verdict
IPR000719 Protein kinase domain domain ATP binding + protein phosphorylationREMOVE at domain level (captures pseudokinases); restrict to catalytic children (GO:0004674 / GO:0004713)
IPR001128 Cytochrome P450 family heme binding + iron ion binding universal → keep; monooxygenase activity + oxidoreductase activity over-annotate (819+ functionally diverse families)
IPR001424 Cu/Zn superoxide dismutase domain domain superoxide metabolic processREMOVE (BP term on a structural module; copper-chaperone members don't dismutate); metal ion binding → KEEP_AS_NON_CORE
IPR000276 GPCR, rhodopsin-like (Class A) family GPCR activity + GPCR signaling pathway → MARK_AS_OVER_ANNOTATED / MODIFY (atypical chemokine + orphan receptors lack canonical G-protein coupling); membrane → KEEP_AS_NON_CORE
IPR001046 NRAMP / SLC11 metal transporter family metal ion transmembrane transporter activity + metal ion transport → ACCEPT as broad family terms; membrane → KEEP_AS_NON_CORE; do not add more specific terms at family level
IPR012724 Chaperone DnaJ (J-domain) family ATP bindingREMOVE (factually wrong — the Hsp70 partner binds ATP, not DnaJ); protein folding → ACCEPT; response to heat → KEEP_AS_NON_CORE (only heat-inducible subfamilies)
IPR007197 Radical SAM domain catalytic activityACCEPT despite being the MF root term — see note below; iron-sulfur cluster binding → ACCEPT (defining [4Fe-4S] cofactor)
IPR020849 Small GTPase, Ras-type family GTP binding → ACCEPT; ADD GTPase activity (GO:0003924) — proposed new mapping (annotation gain); signal transduction → demote to subfamily (GO:0007265); membrane → MARK_AS_OVER_ANNOTATED
IPR002100 Transcription factor, MADS-box domain DNA binding + protein dimerization activity → ACCEPT (both domain-intrinsic). Notably do NOT add DNA-binding TF activity — TF function is a whole-protein property (K/C domains + complex), not the MADS domain

Last updated: 2026-06-20

NOTES

2026-06-20

Project creation. Scoped the InterPro2GO (GO_REF:0000002) review. Built the
extractor and the per-entry priority worklist from all 2732 reviewed genes: 3652
InterPro2GO annotations, 47% flagged suspect across 1826 InterPro entries. Broad
domain/superfamily signatures dominate the suspect list (protein kinase domain, P450,
Cu/Zn SOD, GPCR), confirming the "fold ≠ function" failure mode as the main driver.

Closed the PANTHER-vs-InterPro deep-research gap. Gene deep research is a generated
process (just deep-research-<provider>); there was no equivalent for the InterPro
entries behind InterPro2GO annotations. Added the InterPro-family analogue —
templates/interpro_family_research.md, scripts/deep_research_interpro_family.py, and
the just deep-research-interpro-family <IPR> [provider] recipe (provider defaults to
falcon/Edison) — so families are researched by the same generated pipeline (output:
interpro/<db>/<ID>/<ID>-deep-research-<provider>.md), with IPR000719 cached as a
seed.

Batch 2 + first proposed new mapping. Researched 6 more families; 3 grounded cleanly
(Radical SAM, Ras-type small GTPase, MADS-box) and are in the SSSOM (now 25 mappings).
Notable findings:

Batch 1 of family deep research (falcon/Edison). Ran five more top entries: P450
(IPR001128), Cu/Zn SOD (IPR001424), GPCR Class A (IPR000276), NRAMP/SLC11
(IPR001046), and DnaJ (IPR012724). See the verdict table under Workstream 2. A
recurring, independently-reached pattern: cofactor/binding terms (heme binding, metal ion binding) and broad transport terms hold family-wide, but whole-protein activity
and process terms attached to a structural module over-annotate
— most sharply for
IPR012724, where Edison flags ATP binding as factually wrong on DnaJ (the Hsp70 partner
binds ATP), and for IPR001424, where superoxide metabolic process mis-annotates
copper-chaperone members that do not dismutate.