Over-Annotation Patterns Project
Overview
This project documents systematic patterns of over-annotation discovered through AI-assisted gene review. Over-annotation occurs when GO terms are assigned that are technically correct but provide minimal functional insight, or when terms are too broad/generic to be useful for understanding gene function.
These patterns emerge from multiple sources:
- High-throughput screens that generate generic annotations
- Domain-based IEA annotations that don't reflect actual activity
- IBA annotations that over-generalize from distantly related proteins
- Keyword-based mappings that assign parent terms unnecessarily
Source: Presented at Gene Ontology Consortium Meeting, October 2025, Cambridge UK. See ai4curation/ai-gene-review.
Categories of Over-Annotation
1. Generic "Protein Binding" (GO:0005515)
The Problem: High-throughput interactome studies generate thousands of IPI annotations to "protein binding" that provide no functional information.
Examples from Reviews:
- PHYKPL: 4 protein binding annotations from HTP screens showing interactions with POT1, USO1, VAC14, LNX2 - none related to its metabolic function
- UBA7: 6 protein binding annotations from interactome studies - UBA7 obviously binds proteins (ISG15, UBE2L6) but the generic term adds nothing
- Epe1: Protein binding annotation when specific HP1/Swi6 binding and SAGA complex binding are more informative
Recommended Action: REMOVE generic protein binding when more specific functional annotations exist or when interactions are from HTP screens without validation.
2. Overly Broad Enzymatic Terms
The Problem: Generic enzymatic terms (hydrolase activity, oxidoreductase activity, ligase activity) assigned when more specific terms exist.
Examples:
- LPL1: GO:0016787 (hydrolase activity) when GO:0102545 (phospholipase B activity) is more specific
- Epe1: GO:0016491 (oxidoreductase activity) assigned despite protein lacking catalytic activity
- UBA7: GO:0016874 (ligase activity) when GO:0019782 (ISG15 activating enzyme activity) is specific
Recommended Action: REMOVE or MODIFY to more specific child terms.
3. Domain-Based Predictions Without Validation
The Problem: IEA annotations from domain presence (InterPro, Pfam) that don't reflect actual biochemical activity.
Examples:
- Epe1: JmjC domain → histone demethylase activity, dioxygenase activity, metal ion binding (ALL INCORRECT - pseudo-enzyme)
- PHYKPL: Aminotransferase domain → transaminase activity (INCORRECT - functions as phospho-lyase)
Recommended Action: REMOVE when biochemical evidence contradicts domain prediction.
4. Indirect Downstream Process Annotations
The Problem: Genes annotated to broad biological processes based on indirect effects rather than direct function.
Pattern: Gene affects X → X affects Y → Gene annotated to Y
Examples:
- Kinase that phosphorylates one transcription factor annotated to "regulation of cell proliferation"
- Enzyme in metabolic pathway annotated to disease process it indirectly affects
Recommended Action: Use more proximal process terms; annotate to direct function, not downstream consequences.
5. Duplicate IEA Annotations
The Problem: Multiple automated pipelines annotate the same term, creating redundancy.
Examples:
- GND1: Same term (GO:0004616) from both IBA and IEA sources
- UBA7: Cytoplasm annotation from IBA, IEA, and IDA sources
Note: This is less problematic as multiple evidence codes can provide confidence, but creates clutter.
6. Predicted Localization Conflicts
The Problem: Automated transmembrane predictions leading to incorrect membrane annotations.
Examples:
- LPL1: GO:0016020 (membrane) from transmembrane prediction, but protein localizes to lipid droplets (monolayer, not bilayer membrane)
Recommended Action: REMOVE when experimental localization data contradicts prediction.
Genes Exemplifying Patterns
| Gene | Species | Over-Annotation Pattern | Status |
|---|---|---|---|
| PHYKPL | human | Protein binding, transaminase (wrong mechanism) | COMPLETE |
| UBA7 | human | Protein binding, generic ligase | COMPLETE |
| Epe1 | pombe | Domain-based demethylase (pseudo-enzyme) | COMPLETE |
| LPL1 | CANAL | Generic hydrolase, membrane localization | COMPLETE |
Recommended Curation Principles
- Specificity over breadth: Always prefer the most specific accurate term
- Remove uninformative annotations: Generic protein binding from HTP screens
- Validate domain predictions: Especially for enzymatic activity
- Distinguish direct vs. indirect: Annotate to proximal function
- Consider pseudo-enzymes: Domains don't guarantee activity
Impact on Annotation Quality
These over-annotation patterns:
- Dilute the signal from informative annotations
- Create false impressions of functional understanding
- Complicate enrichment analyses
- Propagate through IBA to other species
STATUS
Documented Patterns
- [x] Generic protein binding
- [x] Overly broad enzymatic terms
- [x] Domain-based predictions without validation
- [x] Indirect downstream processes
- [x] Predicted localization conflicts
Genes Analyzed
- [x] human/PHYKPL - transaminase vs phospho-lyase
- [x] human/UBA7 - protein binding from HTP, generic ligase
- [x] pombe/Epe1 - pseudo-demethylase
- [x] CANAL/LPL1 - hydrolase, membrane prediction
Last updated: 2026-01-22
NOTES
2026-01-22
Project Creation
Documented systematic over-annotation patterns discovered through AI review.
Key Insight: The most common over-annotation is generic "protein binding" (GO:0005515) from high-throughput interactome studies. These annotations:
- Provide no functional information
- Often represent false positives or indirect interactions
- Should be removed when specific functional annotations exist
Curation Recommendation: Consider flagging or filtering HTP-derived protein binding annotations during curation review.