SwissProt Keywords (SPKW) Unique Terms Project
Overview
This project reviews genes that have GO annotations derived solely from UniProt Keywords (SPKW) via GO_REF:0000043, with no corroborating evidence from experimental, computational, or curator sources. The goal is to identify systematic over-annotation patterns and distinguish legitimate SPKW contributions from problematic mappings.
Key Findings
- Over-annotation rates vary dramatically by organism and term
- Eukaryotic BP terms (apoptosis, meiosis, autophagy, rhythm) show 80-100% over-annotation
- Bacterial annotations are mostly accurate (~5% issues for P. putida)
- Common patterns: process conflation, regulatory vs participatory confusion, caspase substrates
Cumulative Results
| Subproject | Organism | Total Genes | Reviewed | Issue Rate | Main Pattern |
|---|---|---|---|---|---|
| Apoptosis | Human | 280 | 23 | 87% | Regulatory conflation |
| Rhythmic Process | Human | 146 | 5 | 100% | Expression ≠ function |
| Autophagy | Human | 123 | 14 | 79% | Signaling over-extension |
| ANOGA | A. gambiae | 5,812 | 22 | Mixed | D7 toxin=100%, immune=17% |
| SCHPO | S. pombe | 1,963 | 7 | 100% | ATG-meiosis conflation |
| DROME | D. melanogaster | 2,753 | 4 | 50% | Mixed patterns |
| PSEPK | P. putida | 1,098 | 4 | 25% | RT defense keyword |
| ARATH | A. thaliana | 8,433 | 4 | 75% | Subclade divergence |
| BPT4 | Phage T4 | ~300 | 3 | 100% | Eukaryote-centric terms |
| ECO57 | E. coli O157 | ~74,000 | 2 | 50% | Toxin vs effector |
Methods
See SPKW-METHODOLOGY.md for detailed SQL queries and explanation of closure-based filtering (which reduces false positives by 70%+ in well-curated organisms).
Over-Annotation Patterns Identified
| Pattern | Description | Examples | Action |
|---|---|---|---|
| Process conflation | Gene active during process X gets annotated to X | ATG genes → meiosis (S. pombe) | REMOVE |
| Regulatory conflation | Gene regulates X, annotated to X | AIMP2 → apoptotic process | MODIFY to regulatory term |
| Caspase substrate | Cleaved by caspases, annotated to apoptosis | AIMP1, BCAP31 | REMOVE |
| Signaling over-extension | 4+ steps from direct function | Sin1 → apoptosis | REMOVE |
| Eukaryote-centric terms | Immune/defense terms for phage-bacteria | T4 DAM → innate immune | REMOVE |
| Toxin vs effector | Effectors incorrectly called toxins | NleB1 (E. coli) | REMOVE |
| Subclade divergence | Family keyword ignores subfunctionalization | LCR1 (Arabidopsis DEFL) | REMOVE |
| Kratagonist ≠ toxin | Sequestration ≠ toxin activity | D7 proteins (mosquito) | MODIFY |
Legitimate SPKW Contributions
Not all SPKW-unique annotations are over-annotations:
- Antimicrobial peptides/lysozymes (D. melanogaster) - "killing of cells" is correct
- Arsenic/antibiotic resistance genes (P. putida) - direct functional annotations
- Conserved functions (Ced-12/ELMO in D. mel) - SPKW captures known biology missing from experimental annotations
- Core circadian genes (ELF4 in Arabidopsis) - accurate but redundant with specific terms
Project Status
- Started: 2025-12-23
- Last updated: 2026-02-04
- Total genes reviewed: 95 across 8 organisms
- Compiled data: spkw_reviewed_genes.csv
Phase 1 (Original)
Subprojects
- [x] Apoptosis - 23/280 reviewed
- [x] Rhythmic Process - 5/146 reviewed
- [x] Autophagy - 14/123 reviewed
- [x] ANOGA - D7 + immune genes
- [x] SCHPO - ATG-meiosis pattern
- [x] DROME - Case studies
- [x] PSEPK - Bacterial control
- [x] ARATH - Plant patterns
- [x] BPT4 - Phage semantics
- [x] ECO57 - Toxin/effector
Curation Recommendations
- Check regulatory vs participatory - many genes regulate processes but don't participate IN them
- Consider organism biology - same GO term can have different validity across taxa
- Distinguish toxins from effectors - direct cytotoxicity vs signaling modulation
- Validate family-level keywords - subfunctionalization can invalidate family annotations
- Expression ≠ function - upregulation during a process doesn't mean functional involvement
Swiss-Prot vs TrEMBL Analysis
Key finding: Keywords on Swiss-Prot entries are manually assigned by curators, not by ARBA/UniRule automatic systems. This means:
| Organism | Swiss-Prot % | Implication |
|---|---|---|
| Human | 99.6% | Over-annotations reflect manual curator keyword choices |
| T4 Phage | 99.6% | Same - curators chose these keywords |
| E. coli O157 | 88.6% | Mostly manual |
| P. putida | 32.2% | Mixed manual/automatic |
| A. gambiae | 3.8% | Mostly automatic keyword assignment |
Of 71 genes with over-annotation issues: 70 are Swiss-Prot (99%)
This confirms the problem is in the KW→GO mapping layer, not keyword assignment. See SPKW-METHODOLOGY.md for stratification queries.
Session Notes
2026-02-04
- Researched UniProt keyword assignment process (confirmed: Swiss-Prot = manual)
- Created spkw_reviewed_genes.csv compiling 95 reviewed genes
- Added Swiss-Prot vs TrEMBL stratification to methodology
- Cross-species analysis: issue rates 10-40%, all Swiss-Prot dominated