EVIDENCE_SOURCE_SUFFICIENCY
Which lower-effort evidence sources are sufficient to support an ACCEPT decision for an existing GO annotation — review papers, abstracts alone, or deep-research reports?
Motivation
Reviewing an existing GO annotation costs curator (or agent) effort, and that
cost is dominated by which document you have to read. Reading a full-text
primary paper is expensive; reading an abstract, a narrative review, or a
pre-synthesized deep-research report is cheap. If cheap sources are routinely
sufficient to confirm an annotation, the review pipeline can be made much
faster and cheaper — especially for the modal action, ACCEPT.
This project tests four hypotheses about evidence-source sufficiency:
- H-a (reviews): For ACCEPT, a published review article is usually
sufficient — and reviews tend to carry the phylogenetic / comparative
reasoning that justifies a conserved-function call. - H-b (abstracts): For ACCEPT, the abstract alone is often sufficient.
- H-c (deep research): A deep-research report has sufficient depth to
support ACCEPT. - H-d (narrative sections): The background (Introduction / Literature
Review) and Discussion / Conclusions sections are a rich but under-used
vein of supporting text.
Why ACCEPT
ACCEPT is both the modal action and the one where cheap confirmation is most
plausible: confirming an already-asserted function is easier than overturning
or refining it. The opposite end (REMOVE / MODIFY) plausibly needs deeper,
primary evidence, and the baseline census below already hints at this
(REMOVE support is 92.8% primary research and only 16.7% abstract-only). So the
payoff of "cheap sources suffice" is concentrated in ACCEPT, and that is where
we scope the question.
The key methodological pitfall: usage ≠ sufficiency
The corpus records what curators cited, not what was required. This is a
census of observed usage, and it is confounded by habit and ordering:
Absence of review (or deep-research) support in a review does NOT mean a
review was unable to support the annotation. A curator who reaches for a
primary paper first, and stops once the annotation is justified, leaves no
trace that a review would have served equally well.
Therefore the census can only ever give a lower bound on how often a source
could suffice, and it cannot establish sufficiency or necessity on its own.
Any real test of H-a–H-d needs an ablation / counterfactual design (below),
not just a tally of the existing supported_by blocks.
What has been built
A reusable measurement layer (PR on branch
claude/go-annotation-sources-rqwmfl):
Reference.publication_type(PublicationTypeEnum:PRIMARY_RESEARCH,
REVIEW,SYSTEMATIC_REVIEW,META_ANALYSIS,DEEP_RESEARCH,DATABASE,
BIOINFORMATICS, …). For PMIDs it is inferred from PubMed's MEDLINEPT
metadata; forGO_REF/Reactome/file:refs from the identifier scheme.ai-gene-review analyze-evidence-sources— cross-tabulates
publication_type × reference_section_type × review.actionacross an
organism and writesreports/evidence_sources/<organism>_evidence_sources.md
plus per-snippet / per-annotation / per-reference TSVs.- The existing
reference_section_typeslot on every supporting-text
snippet already records which manuscript section a quote came from
(TITLE / ABSTRACT / INTRODUCTION / RESULTS / DISCUSSION / CONCLUSIONS / …).
Baseline census (human, 933 reviews) — and its caveats
From the first run (30.8k references, 70.2k evidence snippets, 52.6k reviewed
annotations):
| Hypothesis | Census signal | Read with caveat |
|---|---|---|
| H-a reviews | Reviews are cited rarely: 1.5% of refs; ≤3.8% of annotations have any review support; primary research dominates every action. Phylogenetic reasoning enters via the IBA / GO_REF:0000033 pipeline, not narrative reviews. |
Usage ≠ ability. Says nothing about whether a review could support ACCEPT. PubMed under-tags reviews, so REVIEW is a lower bound. |
| H-b abstracts | Of section-tagged snippets, 61.5% are title/abstract. Abstract-only support by action: ACCEPT 70.8%, MARK_AS_OVER_ANNOTATED 81.2% vs MODIFY 46.9%, REMOVE 16.7%. | Strongest result, and it points the right way for ACCEPT — but measured on the ~5% of snippets that carry a section tag. |
| H-c deep research | 723 deep-research refs across 556 genes support 6,914 annotations — far more used than reviews. | Heavy use ≠ sufficient depth; depth/correctness not yet scored. |
| H-d narrative | Narrative sections (Intro / Lit-review / Discussion / Conclusions) are only 5.1% of tagged snippets; CONCLUSIONS appears in 2. Abstract (59%) + Results (23%) dominate. | Either an untapped vein or evidence it isn't needed — cannot tell from a usage census. |
Limitations to fix before drawing conclusions
- PubMed under-tags reviews. The MEDLINE
PTfield misses many review
articles (e.g. PMID:24268103, a KCTD review, is only "Journal Article"). Add
journal-name heuristics (Nat Rev, Trends, Annu Rev, WIREs, Curr Opin, "…
Reviews"), MeSH, and/or a lightweight content classifier; backfill
publication_type. Until then REVIEW counts are a floor. - Section tagging is ~5% populated. H-b and H-d are measured on a small,
possibly biased subset. Raise coverage (validator nudge and/or a backfill
pass that assignsreference_section_typeto existing snippets) before
trusting the section distribution. - Cited ≠ required. The census measures observed usage. It cannot, by
itself, answer the sufficiency/necessity questions H-a–H-d pose.
Proposed methodology: ablation re-review (the real test)
To move from "what was cited" to "what suffices", run a counterfactual on a
sample of ACCEPT annotations that today carry full-text / primary support:
- Build restricted evidence bundles per annotation:
(i) abstract-only, (ii) review-only, (iii) deep-research-only. - Re-review under each restriction, blinded to the original decision, and
record whether ACCEPT is still reached with adequate justification. - Sufficiency estimate = agreement rate with the original ACCEPT under each
restricted bundle. High agreement ⇒ that source class is sufficient for ACCEPT
in that stratum. - Necessity probe (complementary): where primary/full-text was cited, judge
whether an abstract or review conveys the same fact — LLM-assisted with human
spot-check.
Stratify by aspect (MF / BP / CC) and by whether the term is a conserved /
phylogenetically-inferred function (where H-a's "reviews carry the phylogenetic
reasoning" should bite hardest).
A pre-registered protocol operationalizing this for ACCEPT decisions is in
EVIDENCE_SOURCE_SUFFICIENCY/PROTOCOL.md;
the reproducible stratified sample (30 genes, 484 ACCEPT annotations, seed
20260614) is produced by EVIDENCE_SOURCE_SUFFICIENCY/sample/sample_genes.py.
Status
- [x]
publication_typeschema field + PubMed-PT inference + cache - [x]
analyze-evidence-sourcesCLI and first human census report - [x] Pre-registered study protocol + reproducible stratified-by-aspect sampler
- [x] Scoring harness (
sample/../score.py): estimands + gene-clustered bootstrap CIs + blind calibration - [x] Objective auto-labeling pass (
autolabel.py, cited-snippet provenance) + preliminary results: for ACCEPT annotations citing a full-text-cached pub, the justifying quote is in the abstract 90.6% of the time (artifact-free fair test) - [x] Blind-ablation validation (
build_blind_bundles.py+ blinded reviewers): conditional on availability, abstract 85.7% ≫ deep research 40% > review-alone 14.3%; retrospective abstract number is optimistic by only +7 pts - [ ] Full-text reading pass (fixes H-d; upgrades fact_in_abstract to semantic)
- [ ] Broaden REVIEW_ONLY bundle + scale sample to tighten blind CIs
- [ ] Better review detection (journal/MeSH heuristics) +
publication_typebackfill - [ ] Raise
reference_section_typecoverage above the current ~5% - [ ] Ablation re-review harness for ACCEPT annotations (sufficiency test)
- [ ] Deep-research depth scoring (H-c)
Next steps
- Improve review detection so H-a is tested against a real review set, not a
PubMed-PT floor. - Backfill section tags so the H-b / H-d distributions are representative.
- Stand up the ablation re-review harness on a pilot of ACCEPT annotations,
starting with MF terms where abstract-level confirmation looks strongest.
Relationship to existing projects
ASSAY_TO_FUNCTION.md— complementary axis: that project asks which
experimental readouts license an annotation; this one asks which document
sources are enough to confirm one.AD_HOC_BIOINFORMATICS.md— bioinformatics analyses are themselves a
publication_type: BIOINFORMATICSevidence source tracked by this analysis.