SwissProt Keywords (SPKW) Unique Terms Project

MATURE PIPELINEFLAGSHIP

Warnings (3)

SwissProt Keywords (SPKW) Unique Terms Project

Overview

This project reviews genes that have GO annotations derived solely from UniProt Keywords (SPKW) via GO_REF:0000043, with no corroborating evidence from experimental, computational, or curator sources. The goal is to identify systematic over-annotation patterns and distinguish legitimate SPKW contributions from problematic mappings.

Key Findings

Cumulative Results

Subproject Organism Total Genes Reviewed Issue Rate Main Pattern
Apoptosis Human 280 23 87% Regulatory conflation
Rhythmic Process Human 146 5 100% Expression ≠ function
Autophagy Human 123 14 79% Signaling over-extension
ANOGA A. gambiae 5,812 22 Mixed D7 toxin=100%, immune=17%
SCHPO S. pombe 1,963 7 100% ATG-meiosis conflation
DROME D. melanogaster 2,753 4 50% Mixed patterns
PSEPK P. putida 1,098 4 25% RT defense keyword
ARATH A. thaliana 8,433 4 75% Subclade divergence
Virus clades Viral taxa 54,131 11 55% Host-context mismatch, specificity
PLANTS Non-ARATH plants 4,117 38 15% Tier-A Term-tiering; GOA retired SPKW
BPT4 Phage T4 ~300 3 100% Eukaryote-centric terms
ECO57 E. coli O157 ~74,000 2 50% Toxin vs effector

Methods

See SPKW-METHODOLOGY.md for detailed SQL queries and explanation of closure-based filtering (which reduces false positives by 70%+ in well-curated organisms).

Over-Annotation Patterns Identified

Pattern Description Examples Action
Process conflation Gene active during process X gets annotated to X ATG genes → meiosis (S. pombe) REMOVE
Regulatory conflation Gene regulates X, annotated to X AIMP2 → apoptotic process MODIFY to regulatory term
Caspase substrate Cleaved by caspases, annotated to apoptosis AIMP1, BCAP31 REMOVE
Signaling over-extension 4+ steps from direct function Sin1 → apoptosis REMOVE
Eukaryote-centric terms Immune/defense terms for phage-bacteria T4 DAM → innate immune REMOVE
Viral host-context mismatch Same host-pathogen term valid in one viral clade but wrong in another Phage AcrF8 → innate immune MODIFY
Toxin vs effector Effectors incorrectly called toxins NleB1 (E. coli) REMOVE
Subclade divergence Family keyword ignores subfunctionalization LCR1 (Arabidopsis DEFL) REMOVE
Kratagonist ≠ toxin Sequestration ≠ toxin activity D7 proteins (mosquito) MODIFY
Enzyme-class keyword → bare process An activity keyword maps to a generic, substrate-less process term; substrate specificity lives on the MF branch Methyltransferase → methylation (plant MTases: MET1A, EZ1, CCOAOMT, COQ5) MARK_OVER / MODIFY

Legitimate SPKW Contributions

Not all SPKW-unique annotations are over-annotations:

Project Status

Phase 1 (Original)

Subprojects

Curation Recommendations

  1. Check regulatory vs participatory - many genes regulate processes but don't participate IN them
  2. Consider organism biology - same GO term can have different validity across taxa
  3. Distinguish toxins from effectors - direct cytotoxicity vs signaling modulation
  4. Validate family-level keywords - subfunctionalization can invalidate family annotations
  5. Expression ≠ function - upregulation during a process doesn't mean functional involvement
  6. Check viral host context - phage-bacterium interactions need different terms from eukaryotic viral immune evasion

Swiss-Prot vs TrEMBL Analysis

Key finding: Keywords on Swiss-Prot entries are manually assigned by curators, not by ARBA/UniRule automatic systems. This means:

Organism Swiss-Prot % Implication
Human 99.6% Over-annotations reflect manual curator keyword choices
T4 Phage 99.6% Same - curators chose these keywords
E. coli O157 88.6% Mostly manual
P. putida 32.2% Mixed manual/automatic
Virus (all) 13.1% Mostly TrEMBL; errors may reflect mapping or automatic keyword assignment
A. gambiae 3.8% Mostly automatic keyword assignment

Of 71 genes with over-annotation issues: 70 are Swiss-Prot (99%)

For reviewed high-confidence organism batches, this confirms the problem is usually in the KW→GO mapping layer, not keyword assignment. Virus-wide analysis is different because most candidates are TrEMBL. See SPKW-METHODOLOGY.md and SPKW-VIRUS.md for stratification queries.


Session Notes

2026-05-30

2026-05-29

2026-05-21

2026-02-04

2026-05-21