SwissProt Keywords (SPKW) Unique Terms Project

SwissProt Keywords (SPKW) Unique Terms Project

Overview

This project reviews genes that have GO annotations derived solely from UniProt Keywords (SPKW) via GO_REF:0000043, with no corroborating evidence from experimental, computational, or curator sources. The goal is to identify systematic over-annotation patterns and distinguish legitimate SPKW contributions from problematic mappings.

Key Findings

Cumulative Results

Subproject Organism Total Genes Reviewed Issue Rate Main Pattern
Apoptosis Human 280 23 87% Regulatory conflation
Rhythmic Process Human 146 5 100% Expression ≠ function
Autophagy Human 123 14 79% Signaling over-extension
ANOGA A. gambiae 5,812 22 Mixed D7 toxin=100%, immune=17%
SCHPO S. pombe 1,963 7 100% ATG-meiosis conflation
DROME D. melanogaster 2,753 4 50% Mixed patterns
PSEPK P. putida 1,098 4 25% RT defense keyword
ARATH A. thaliana 8,433 4 75% Subclade divergence
BPT4 Phage T4 ~300 3 100% Eukaryote-centric terms
ECO57 E. coli O157 ~74,000 2 50% Toxin vs effector

Methods

See SPKW-METHODOLOGY.md for detailed SQL queries and explanation of closure-based filtering (which reduces false positives by 70%+ in well-curated organisms).

Over-Annotation Patterns Identified

Pattern Description Examples Action
Process conflation Gene active during process X gets annotated to X ATG genes → meiosis (S. pombe) REMOVE
Regulatory conflation Gene regulates X, annotated to X AIMP2 → apoptotic process MODIFY to regulatory term
Caspase substrate Cleaved by caspases, annotated to apoptosis AIMP1, BCAP31 REMOVE
Signaling over-extension 4+ steps from direct function Sin1 → apoptosis REMOVE
Eukaryote-centric terms Immune/defense terms for phage-bacteria T4 DAM → innate immune REMOVE
Toxin vs effector Effectors incorrectly called toxins NleB1 (E. coli) REMOVE
Subclade divergence Family keyword ignores subfunctionalization LCR1 (Arabidopsis DEFL) REMOVE
Kratagonist ≠ toxin Sequestration ≠ toxin activity D7 proteins (mosquito) MODIFY

Legitimate SPKW Contributions

Not all SPKW-unique annotations are over-annotations:

Project Status

Phase 1 (Original)

Subprojects

Curation Recommendations

  1. Check regulatory vs participatory - many genes regulate processes but don't participate IN them
  2. Consider organism biology - same GO term can have different validity across taxa
  3. Distinguish toxins from effectors - direct cytotoxicity vs signaling modulation
  4. Validate family-level keywords - subfunctionalization can invalidate family annotations
  5. Expression ≠ function - upregulation during a process doesn't mean functional involvement

Swiss-Prot vs TrEMBL Analysis

Key finding: Keywords on Swiss-Prot entries are manually assigned by curators, not by ARBA/UniRule automatic systems. This means:

Organism Swiss-Prot % Implication
Human 99.6% Over-annotations reflect manual curator keyword choices
T4 Phage 99.6% Same - curators chose these keywords
E. coli O157 88.6% Mostly manual
P. putida 32.2% Mixed manual/automatic
A. gambiae 3.8% Mostly automatic keyword assignment

Of 71 genes with over-annotation issues: 70 are Swiss-Prot (99%)

This confirms the problem is in the KW→GO mapping layer, not keyword assignment. See SPKW-METHODOLOGY.md for stratification queries.


Session Notes

2026-02-04