EVIDENCE_SOURCE_SUFFICIENCY

MATURE PIPELINEFLAGSHIP

EVIDENCE_SOURCE_SUFFICIENCY

Which lower-effort evidence sources are sufficient to support an ACCEPT decision for an existing GO annotation — review papers, abstracts alone, or deep-research reports?

Motivation

Reviewing an existing GO annotation costs curator (or agent) effort, and that
cost is dominated by which document you have to read. Reading a full-text
primary paper is expensive; reading an abstract, a narrative review, or a
pre-synthesized deep-research report is cheap. If cheap sources are routinely
sufficient to confirm an annotation, the review pipeline can be made much
faster and cheaper — especially for the modal action, ACCEPT.

This project tests four hypotheses about evidence-source sufficiency:

Why ACCEPT

ACCEPT is both the modal action and the one where cheap confirmation is most
plausible: confirming an already-asserted function is easier than overturning
or refining it. The opposite end (REMOVE / MODIFY) plausibly needs deeper,
primary evidence, and the baseline census below already hints at this
(REMOVE support is 92.8% primary research and only 16.7% abstract-only). So the
payoff of "cheap sources suffice" is concentrated in ACCEPT, and that is where
we scope the question.

The key methodological pitfall: usage ≠ sufficiency

The corpus records what curators cited, not what was required. This is a
census of observed usage, and it is confounded by habit and ordering:

Absence of review (or deep-research) support in a review does NOT mean a
review was unable to support the annotation.
A curator who reaches for a
primary paper first, and stops once the annotation is justified, leaves no
trace that a review would have served equally well.

Therefore the census can only ever give a lower bound on how often a source
could suffice, and it cannot establish sufficiency or necessity on its own.
Any real test of H-a–H-d needs an ablation / counterfactual design (below),
not just a tally of the existing supported_by blocks.

What has been built

A reusable measurement layer (PR on branch
claude/go-annotation-sources-rqwmfl):

Baseline census (human, 933 reviews) — and its caveats

From the first run (30.8k references, 70.2k evidence snippets, 52.6k reviewed
annotations):

Hypothesis Census signal Read with caveat
H-a reviews Reviews are cited rarely: 1.5% of refs; ≤3.8% of annotations have any review support; primary research dominates every action. Phylogenetic reasoning enters via the IBA / GO_REF:0000033 pipeline, not narrative reviews. Usage ≠ ability. Says nothing about whether a review could support ACCEPT. PubMed under-tags reviews, so REVIEW is a lower bound.
H-b abstracts Of section-tagged snippets, 61.5% are title/abstract. Abstract-only support by action: ACCEPT 70.8%, MARK_AS_OVER_ANNOTATED 81.2% vs MODIFY 46.9%, REMOVE 16.7%. Strongest result, and it points the right way for ACCEPT — but measured on the ~5% of snippets that carry a section tag.
H-c deep research 723 deep-research refs across 556 genes support 6,914 annotations — far more used than reviews. Heavy use ≠ sufficient depth; depth/correctness not yet scored.
H-d narrative Narrative sections (Intro / Lit-review / Discussion / Conclusions) are only 5.1% of tagged snippets; CONCLUSIONS appears in 2. Abstract (59%) + Results (23%) dominate. Either an untapped vein or evidence it isn't needed — cannot tell from a usage census.

Limitations to fix before drawing conclusions

  1. PubMed under-tags reviews. The MEDLINE PT field misses many review
    articles (e.g. PMID:24268103, a KCTD review, is only "Journal Article"). Add
    journal-name heuristics (Nat Rev, Trends, Annu Rev, WIREs, Curr Opin, "…
    Reviews"), MeSH, and/or a lightweight content classifier; backfill
    publication_type. Until then REVIEW counts are a floor.
  2. Section tagging is ~5% populated. H-b and H-d are measured on a small,
    possibly biased subset. Raise coverage (validator nudge and/or a backfill
    pass that assigns reference_section_type to existing snippets) before
    trusting the section distribution.
  3. Cited ≠ required. The census measures observed usage. It cannot, by
    itself, answer the sufficiency/necessity questions H-a–H-d pose.

Proposed methodology: ablation re-review (the real test)

To move from "what was cited" to "what suffices", run a counterfactual on a
sample of ACCEPT annotations that today carry full-text / primary support:

  1. Build restricted evidence bundles per annotation:
    (i) abstract-only, (ii) review-only, (iii) deep-research-only.
  2. Re-review under each restriction, blinded to the original decision, and
    record whether ACCEPT is still reached with adequate justification.
  3. Sufficiency estimate = agreement rate with the original ACCEPT under each
    restricted bundle. High agreement ⇒ that source class is sufficient for ACCEPT
    in that stratum.
  4. Necessity probe (complementary): where primary/full-text was cited, judge
    whether an abstract or review conveys the same fact — LLM-assisted with human
    spot-check.

Stratify by aspect (MF / BP / CC) and by whether the term is a conserved /
phylogenetically-inferred function (where H-a's "reviews carry the phylogenetic
reasoning" should bite hardest).

A pre-registered protocol operationalizing this for ACCEPT decisions is in
EVIDENCE_SOURCE_SUFFICIENCY/PROTOCOL.md;
the reproducible stratified sample (30 genes, 484 ACCEPT annotations, seed
20260614) is produced by EVIDENCE_SOURCE_SUFFICIENCY/sample/sample_genes.py.

Status

Next steps

  1. Improve review detection so H-a is tested against a real review set, not a
    PubMed-PT floor.
  2. Backfill section tags so the H-b / H-d distributions are representative.
  3. Stand up the ablation re-review harness on a pilot of ACCEPT annotations,
    starting with MF terms where abstract-level confirmation looks strongest.

Relationship to existing projects