UniProt CAUTION Note Project

MATURE PIPELINEFLAGSHIP

Warnings (3)

UniProt CAUTION Note Project

Overview

UniProtKB records can carry one or more free-text CAUTION comments in the
flat file:

CC   -!- CAUTION: In contrast to other JHDM1 histone demethylases, it lacks the
CC       iron catalytic His in position 370 which is replaced by a Tyr residue
CC       and has no histone demethylase activity in vitro (PubMed:16362057). It
CC       therefore may not be functional in vivo. {ECO:0000305|PubMed:16362057}.

A CAUTION comment is a curator's explicit warning to the reader: a function is
contested, an activity was mis-attributed, a supporting paper was
retracted, a domain is degenerate (pseudo-enzyme), or a feature is a
likely artifact. These are exactly the situations where automated GO
annotation (IEA/IBA, domain-to-function mappings) is most likely to be wrong,
and where AI-assisted review adds the most value.

This project systematically harvests CAUTION comments, categorizes them, and
uses them as a prioritized worklist for annotation review — a CAUTION note is
a strong signal that one or more existing GO annotations on that gene deserve
scrutiny.

Yes, this is a real UniProt feature. CAUTION is a standard UniProtKB
"General annotation (Comments)" topic
(https://www.uniprot.org/help/caution), distinct from the structured
SEQUENCE CAUTION block. We extract CC -!- CAUTION: free-text comments
and exclude the separate structured CC -!- SEQUENCE CAUTION: block
(erroneous initiation / frameshift / predicted-gene-model issues), which is
about sequence construction rather than function.

Headline numbers

Database-wide (Swiss-Prot / reviewed, 2026-06-16):

Category Notes (DB-wide) Curation value
contested-function 3,035 high
reclassified-function 2,476 high
degenerate-domain (pseudo-enzyme) 1,366 high
retracted-reference 279 high
possible-artifact 63 high
other 7,605 mixed (keyword classifier residual)
lacks-conserved-residue 6 low

Distribution by organism (top): Human (2,298 entries), S. cerevisiae (895),
Mouse (844), A. thaliana (714), E. coli K12 (334), Rat (329), Bovine (206),
Dictyostelium (192), Drosophila (160), Rice (150), C. elegans (122),
Zebrafish (109).

Local corpus (genes already fetched in this repo) — 209 CAUTION notes
across 201 cached *-uniprot.txt records
:

Category Count Curation value Description
contested-function 44 high function is controversial / disputed / "however ..."
reclassified-function 37 high "was originally/initially thought to be ..."
degenerate-domain 16 high pseudo-enzyme; "lacks the catalytic/active-site residue"
retracted-reference 12 high a supporting paper has been retracted
possible-artifact 4 high result may be an experimental artifact
other 38 mixed residual curatorial notes not matched by keywords
wgs-preliminary 39 low boilerplate: sequence from a preliminary WGS entry
lacks-conserved-residue 19 low boilerplate feature-propagation warning
Total 209 across 201 records

The WGS-preliminary boilerplate that dominated the local corpus essentially
vanishes in reviewed entries (0–6), confirming it was a TrEMBL/unreviewed
artifact rather than a curatorial signal.

Why this matters for GO review

The high-value categories map directly onto well-known GO over-annotation
failure modes:

Cross-referencing the DB-wide survey against the accessions we have already
fetched
(148 of our local genes overlap the reviewed-CAUTION set) yields
4,046 high-value, not-yet-reviewed candidates: reclassified-function (2,401),
degenerate-domain (1,342), retracted-reference (255), possible-artifact (48).
Human alone contributes 95 pseudo-enzyme (degenerate-domain) and 111
retracted-reference
candidates, e.g.:

Findings

Deep dive batch 1 (degenerate-domain)

Fetched 5 human degenerate-domain candidates (RHBDF1, SUMF2, PANK4, DPYSL5,
NAALADL2) and checked their curated GO molecular functions against the CAUTION
note:

Systematic over-annotation queries (local corpus, 2,675 genes)

Two CAUTION-driven detectors, run over the locally-fetched corpus (GO ancestry
via oaklib; no network).

Query A — negated-child / positive-parent conjunction. Within a gene, a
molecular-function term X is annotated positively while a more specific
descendant Y is annotated NOT. 33 hits / 22 genes (24 with an electronic
IEA/IBA parent). A triage column marks a hit STRONG when the parent's
specificity has no experimental positive support in GOA (the DPYSL5 pattern):

Gene positive parent NOT-ed child note
DPYSL5 hydrolase activity (IEA) dihydropyrimidinase (IBA) confirmed over-annotation → REMOVE (done)
CPT1C (acyl)transferase activity (IEA) carnitine O-palmitoyltransferase (ISS) CAUTION: little/no CPT activity in vivo — likely over-annotation
ENDOU hydrolase activity (IEA) serine-type peptidase (IDA) parent legit (it's a ribonuclease), child correctly NOT-ed
pmp20 peroxidase activity (IEA) glutathione peroxidase (IDA) false positive: genuine peroxiredoxin, peroxidase only IEA-annotated

Query A also surfaces 9 DIRECT same-term conflicts — a term annotated both
positively and NOT on the same gene (genuine GOA inconsistencies worth
reporting upstream), e.g. EDEM1/EDEM2 mannosyl-oligosaccharide 1,2-α-mannosidase
(TAS vs NOT IDA), ENDOU serine-type peptidase (IDA vs NOT IDA), PARK7
protein deglycase (IDA vs NOT IDA), CYB5R4 NAD(P)H oxidase (IDA vs NOT IDA).

Query B — CAUTION PMID cited positively, never negated. Flags genes where a
UniProt CAUTION cites a PMID, a GO annotation is made to that same PMID, and
there is no NOT annotation citing it. 69 flags / 39 genes. Highest-value
molecular-function hits:

Many other Query B hits are localization debates (AGK, AIFM2, ASAH2, C1QBP,
ATP13A1, Dnajb11) where the positive annotation may have been deliberately
retained — these are flags for a curator's eye, not automatic removals.

Validation against existing curated reviews

Because the local corpus is already curated, the query hits double as a test of
the method — each flag is joined to the action the existing review assigned:

Net: the two genes spot-checked — CPT1C and CHMP1A — were both already
correctly curated (CPT1C REMOVEs the unsupported transferase parents and keeps
catalytic activity because the real palmitoyl-hydrolase activity sits under it,
exactly matching the STRONG-vs-supported split; CHMP1A REMOVEs the mistranslated-ORF
metalloprotease/zinc terms). No edits were needed — the queries reproduced the
experts' decisions, the validation we wanted before scaling UniProt-wide.

UniProt-wide scaling (QuickGO) — net-new pseudoenzyme families

Pulling molecular-function GOA from QuickGO for all 14,513 reviewed CAUTION
accessions
(MF annotations found for 12,194) and running both queries
database-wide, flagging genes not yet in this repo (net_new):

The STRONG net-new hits land squarely on known pseudoenzyme families, recovered
automatically and extended across orthologs — strong external validation:

Gene(s) Lost activity (NOT) Persisting electronic over-annotation
DPYSL2/CRMP2, DPYSL3, DPYSL4, CRMP1 (+ mouse/rat/chicken/bovine orthologs) dihydropyrimidinase (GO:0004157) hydrolase activity (GO:0016787/0016810) IEA — the exact DPYSL5 pattern across the whole CRMP family
ILK (integrin-linked kinase) protein Ser/Thr kinase (GO:0004674) protein kinase activity IEA — classic pseudokinase
ROR1 (+ orthologs, lin-18) receptor tyrosine kinase (GO:0004714) protein kinase activity IEA — pseudokinase RTK
CASP12 (+ CASP13 bovine) cysteine-type endopeptidase (GO:0004197) cysteine-type peptidase activity IEA — pseudo-caspase
AZIN2 (+ orthologs) ornithine/arginine decarboxylase (GO:0004586/0008792) catalytic activity IEA — dead ODC paralog (antizyme inhibitor)
Cpt1c (mouse/rat) carnitine O-palmitoyltransferase (GO:0004095) acyltransferase activity IEA — same as the human CPT1C we audited

That the method independently rediscovers ILK, ROR1, CASP12, AZIN2 and the CRMP
family
— canonical pseudoenzymes — from nothing but "CAUTION text + GO NOT +
GO ancestry" is the validation that matters. Immediate human review targets (not
yet in this repo): DPYSL2 (Q16555), DPYSL3 (Q14195), DPYSL4 (O14531), CRMP1
(Q14194), ILK (Q13418), ROR1 (Q01973), CASP12 (Q6UXS9), AZIN2.

Lessons learned

Reproducibility

All numbers above are produced by reproducible scripts under
UNIPROT_CAUTION_NOTE/; the multi-MB GOA dumps are
gitignored but regenerable. Rerun after fetching new genes.

Step Command Outputs
Local extraction extract_caution_notes.py caution_notes.tsv, caution_notes.md
DB-wide survey (REST API cc_caution) uniprot_api_survey.py (--organism 9606 for human) caution_uniprot_reviewed.tsv, api_survey.md
Prioritized worklist shortlist_candidates.py candidates_high_value.tsv, candidates.md
Local over-annotation queries (A/B) caution_conjunction_queries.py caution_conjunction.md, conjunction_hits.tsv, caution_pmid_unnegated.tsv
Validate queries vs reviews audit_queries_vs_reviews.py audit_queries_vs_reviews.md
UniProt-wide scaling (QuickGO) uniprot_wide_queries.py uniprot_wide_queries.md, uniprot_wide_conjunctions.tsv, uniprot_wide_pmid.tsv

The deep-dive write-up is in
deep_dive_batch1.md. To run UniProt-wide
the queries need GOA (evidence codes + NOT qualifiers) per accession, which the
cc_caution API survey does not carry — supplied here by the QuickGO pull.

Workflow

  1. (Re)generate the worklist: run extract_caution_notes.py.
  2. Prioritize the high-value categories (contested / reclassified /
    degenerate-domain / retracted / possible-artifact).
  3. For each gene that already has a *-ai-review.yaml, check whether its
    existing annotations are consistent with the CAUTION note; if a retracted
    PMID is an original_reference_id, flag it.
  4. For genes without a review yet, treat the CAUTION note as a strong reason to
    prioritize a full review (just fetch-gene <org> <gene>).
  5. Record findings using reference_review (retracted → is_invalid),
    finding_review (DISPUTED), and appropriate annotation actions.

STATUS

Done

Pending

Last updated: 2026-06-16

NOTES

2026-06-16

Project creation. Verified the premise before building: UniProt CAUTION is a
real, documented comment topic, and grep -c '\-!- CAUTION:' across the cached
*-uniprot.txt files returns 209 matches in 201 records. Built a parser that
correctly handles multi-line CAUTION comments and excludes the structured
SEQUENCE CAUTION block. Keyword categorizer separates curation-meaningful
notes (contested / reclassified / degenerate-domain / retracted / artifact) from
automatic boilerplate (WGS-preliminary, lacks-conserved-residue). The
degenerate-domain bucket overlaps heavily with existing CONTESTED_FUNCTION and
PSEUDOENZYMES work, confirming CAUTION notes are a good signal for finding
over-annotated catalytic functions.

API survey added (same day). Per request, queried the live UniProt REST API
on the cc_caution field (distinct from cc_sequence_caution) to get a
database-wide distribution without fetching genes: 14,513 reviewed entries,
14,830 notes, ~7,200 high-signal. Human dominates (2,298). Cross-referencing
against the 148 reviewed-CAUTION genes we already have yields a 4,046-entry
high-value worklist — the basis for a prioritized deep dive (starting with human
pseudo-enzymes flagged by degenerate-domain cautions).