COSCIENTIST

IN_PROGRESS PIPELINE

Species: human, ECOLI, MYCTU, METJA, SCHPO, worm

COSCIENTIST

Using an autonomous AI "co-scientist" (OpenScientist) as an independent
bioinformatician to test specific gene-function hypotheses that the literature
cannot settle β€” and wiring the verdicts back into curated GO reviews.

πŸ“Š Slides: COSCIENTIST-slides
(Marp; regenerate the PDF with just gen-project-slides COSCIENTIST).

Motivation

Most AI assistance in curation is literature synthesis: read the papers, summarize,
cite. That is valuable but it cannot answer questions the literature never asked β€”
"does this orphan fold imply a function?", "are the catalytic residues actually
present?", "is this annotation a phylogenetic over-propagation?" Those require
running analyses, not just reading.

OpenScientist is an autonomous research agent that
can execute code (structure fetches, Foldseek searches, sequence/active-site
analysis, plotting) across several iterations and emit a cited report plus
provenance artifacts. This project treats it as a blinded, independent
bioinformatics scientist exploring a single hypothesis gene G has function F, then
compares its conclusion against held-out local analyses and curator judgment, and
folds the result into the gene's -ai-review.yaml.

Operational conventions live in the openscientist-hypothesis skill. The driver is
just gene-hypothesis-research openscientist <ORG> <GENE> --focus-type free-text --hypothesis "…" (defaults: 3 iterations, 2 h job timeout).

Where it adds the most value

Across runs, the highest-value, least-redundant results share a signature: the
question is not answerable from papers alone
. When OpenScientist is asked a
literature-settleable core-vs-non-core question it tends to (correctly) reproduce
the existing review; when asked a compute-requiring question it produces genuinely
new evidence. So we deliberately steer it toward:

A recurring, high-value pattern is catching a systematic mis-annotation: an
error that has propagated from a reference protein to a whole family or to a
paralog, not just the gene under review.

Completed runs

Compute-requiring batch (structure / sequence)

Gene Hypothesis Verdict (compute-driven) Review action
METJA/MJ1511 Does it retain AhpD oxidoreductase activity (GO:0016491, IBA) or is it a pseudoenzyme? Pseudoenzyme. AlphaFold: the two cysteines are 36.5 Γ… apart with zero histidines β€” no CXXC, no proton relay. Paralog MJ0742 carries the same erroneous annotation. GO:0016491 UNDECIDED β†’ REMOVE; flagged systematic methanogen-CMD mis-annotation
ECOLI/yrhB Is the ISS Imm35 immunity / peptidase-inhibitor function real? Fold correct, function over-annotated. Imm35 fold confirmed (AlphaFold pLDDT 95.2, Foldseek), but no Imm35 family member has experimental immunity evidence and there is no adjacent toxin gene; direct assays show chaperone activity (applies to K12 β€” 100% identical to BL21). Immunity tempered; added experimentally-grounded GO:0044183 (protein folding chaperone) + GO:0042026; core function reassigned
MYCTU/Rv0898c Can the DUF2630 orphan fold be assigned a molecular function? Partially supported β€” keep ND. Fold is classifiable (two-helix hairpin; weak uL29 hit at 27% identity, twilight zone) but function is not inferable (CATH 1.10.287 spans >600 superfamilies; motif CWDLLRQRR has no characterized match). GO:0003674 ND retained, rationale strengthened

Earlier hypothesis runs (core-vs-non-core / over-annotation)

Gene Hypothesis Verdict Note
SCO1 GO:0016531 copper chaperone = core MF Supported (high) flagged SCO2 paralog over-annotation; IEA→IMP upgrade lead
SCHPO/pmp20 GO:0008379 thioredoxin peroxidase Over-annotated β†’ remove reference refutes activity; GO logic conflict (NOT on parent)
IL21 GO:0042102 pos. reg. T-cell proliferation = core Keep as non-core B-cell/Tfh axis is the signature function
STAT3 GO:0030335 pos. reg. cell migration = core Non-core bidirectional migration effect = downstream, not core
CFAP300 scaffold/adaptor/chaperone in dynein preassembly? Unresolvable; add BP "protein binding" confirmed uninformative; structural compute

Non-structural batch (topology / regulatory / motif)

A second batch deliberately targeted questions decidable by non-structural data
to test whether the agent exercises those data types as well as it does structure.

Gene Data type Verdict (compute/reasoning-driven) Review action
worm/skn-1 regulatory / domain Over-annotated. Bzip/CNC transcription factor with no RNA-binding or translation-factor domain; the IEA GO:0006417 (regulation of translation) comes from UniProt keyword KW-0810, conflating SKN-1 being activated by translation inhibition with directly regulating it. GO:0006417 UNDECIDED β†’ REMOVE
CLCN7 localization / sorting-signal Over-annotated β€” second systematic catch. ClC-7 is an endolysosomal antiporter (N-terminal dileucine + acidic-cluster sorting motifs, absent from plasma-membrane paralogs CLCNKA/KB); GO:0030321 traces to a ComplexPortal family-level intro sentence propagated by PANTHER IBA to ~1,198 ortholog annotations. GO:0030321 β†’ REMOVE
ASCL1 TF / ChIP / motif Over-annotated. ASCL1 binds DNA as a functionally obligate class II bHLH heterodimer with class I E-proteins (TCF3/E2A, TCF4/E2-2, TCF12/HEB); the IEA GO:0042802 (identical protein binding, implying homodimer) is an Ensembl-Compara transfer; homodimers form only in vitro. GO:0042802 β†’ MODIFY β†’ GO:0046982 heterodimerization

Execution behaviour differs by question type. The structural runs executed code
(fetched AlphaFold models, ran Foldseek, computed distances/composition) and emitted
provenance artifacts. The non-structural runs reached equally strong, correct
verdicts but mostly reasoned over the relevant data β€” domain architecture,
sorting motifs, ChIP/regulon/E-box signatures, InterPro/PANTHER families β€” rather
than executing topology tools or querying ChIP-Atlas; skn-1 and CLCN7 produced no
provenance artifacts and ASCL1 emitted only an evidence-summary figure. So
OpenScientist's code-execution mode appears triggered chiefly by structural
questions; for topology/regulatory questions it behaves as an expert reasoner over
sequence features and curated databases. The verdicts are still high-value β€” CLCN7
delivered a second systematic-mis-annotation catch (after MJ1511/MJ0742).

The prompt template can shift this β€” validated by an A/B test. After tweaking
templates/gene_hypothesis_deep_research.md to ask the agent to execute
hypothesis-matched analyses (and save provenance), the CLCN7 topology question was
re-run with only the template changed. The original run produced zero provenance;
the re-run computed a Kyte–Doolittle hydropathy profile from the UniProt sequence,
aligned the 10 TM helices to UniProt topology, and localized the lysosomal sorting
motifs β€” saved as provenance β€” while reaching the identical over-annotation verdict
and honestly labelling the computation "supportive provenance rather than novel
evidence" (no fabricated web-only DeepTMHMM result). So for topology / pure-sequence
questions the behaviour gap is largely promptable. ChIP/expression questions that
depend on web-only resources (ChIP-Atlas, DeepTMHMM web server) remain limited by
tool access, not prompting.

Evidence integrity. Across all runs, cited PMIDs were real, on-target, and
verbatim-quotable; no hallucinated citations were found. Every supporting_text
wired into a review is checked as a verbatim substring of its source.

Beyond structure: other informatics that could decide a call

The batch above leans heavily on AlphaFold/Foldseek because those questions were
structural. But the same "ask a question the literature can't settle, then compute"
approach applies to other data types. Candidate genes where a non-structural
analysis would be the deciding evidence:

A. Disputed membrane topology / transmembrane proteins

Resolvable by sequence-topology tools: DeepTMHMM, SignalP, Phobius, positive-inside / orientation analysis.

Gene Disputed point Method Rank
CLCN7 GO:0030321 transepithelial chloride transport β€” endolysosomal antiporter vs plasma-membrane epithelial channel DeepTMHMM/Phobius topology + transporter-vs-channel mechanism HIGH
WFS1 multi-pass ER membrane topology; EF-hand / GO:0005516 calmodulin binding (UNDECIDED) DeepTMHMM 9-TM topology, signal-anchor orientation, EF-hand motif scan MED
SORL1 disputed subcellular localizations (GO:0005641 nuclear envelope lumen, UNDECIDED; secretory-granule/MVB) signal peptide + domain-architecture topology; localization-prediction consensus MED

B. Transcription factors / DNA-binding with disputed function

Resolvable by motif analysis (JASPAR) + ChIP/regulatory databases (ChIP-Atlas, ReMap) + co-expression/regulon inference (ARACNe/GENIE3) β€” not structure.

Gene Disputed point Method Rank
worm/skn-1 isoform-specific regulons & GO:0006417 translation regulation (UNDECIDED); A/B/C isoforms differ isoform-resolved ChIP/CUT&RUN + tissue co-expression to partition regulons HIGH
ASCL1 pioneer-factor claim; direct vs indirect targets; neuronal vs SCLC E-box preference ChIP-Atlas peaks + motif enrichment + enhancer/ATAC co-localization HIGH
CTBP1 corepressor vs context-specific coactivator (GO:0003713); which regulons ReMap/ChIP-Atlas target sets + differential co-expression by cell type MED

C. Expression-data-resolvable annotations

Resolvable by tissue/cell-type specificity and co-expression atlases: Bgee, Expression Atlas, single-cell references, GTEx.

Gene Disputed point Method Rank
WFS1 GO:0031016 pancreas development β€” developmental role vs Ξ²-cell degeneration (ER-stress apoptosis) developmental vs adult Ξ²-cell expression timecourse; single-cell ER-stress signatures MED
worm/skn-1 are isoform modules (ASI chemosensory / intestinal detox / ER stress) distinct or context-induced tissue-specific + developmental-stage expression; condition-induction profiles HIGH
STAT3 already-resolved migration call could be generalized: which processes are direct vs co-expression-driven regulon/co-expression to separate direct targets from convergent phenotypes LOW

The three HIGH leads (skn-1, ASCL1, CLCN7) have now been run β€” see the
non-structural batch above; all three returned actionable over-annotation
verdicts. The remaining MED rows (WFS1, SORL1, CTBP1) are still open leads.

Operational lessons

Status & next steps