COSCIENTIST
Using an autonomous AI "co-scientist" (OpenScientist) as an independent
bioinformatician to test specific gene-function hypotheses that the literature
cannot settle β and wiring the verdicts back into curated GO reviews.
π Slides: COSCIENTIST-slides
(Marp; regenerate the PDF with just gen-project-slides COSCIENTIST).
Motivation
Most AI assistance in curation is literature synthesis: read the papers, summarize,
cite. That is valuable but it cannot answer questions the literature never asked β
"does this orphan fold imply a function?", "are the catalytic residues actually
present?", "is this annotation a phylogenetic over-propagation?" Those require
running analyses, not just reading.
OpenScientist is an autonomous research agent that
can execute code (structure fetches, Foldseek searches, sequence/active-site
analysis, plotting) across several iterations and emit a cited report plus
provenance artifacts. This project treats it as a blinded, independent
bioinformatics scientist exploring a single hypothesis gene G has function F, then
compares its conclusion against held-out local analyses and curator judgment, and
folds the result into the gene's -ai-review.yaml.
Operational conventions live in the openscientist-hypothesis skill. The driver is
just gene-hypothesis-research openscientist <ORG> <GENE> --focus-type free-text
--hypothesis "β¦" (defaults: 3 iterations, 2 h job timeout).
Where it adds the most value
Across runs, the highest-value, least-redundant results share a signature: the
question is not answerable from papers alone. When OpenScientist is asked a
literature-settleable core-vs-non-core question it tends to (correctly) reproduce
the existing review; when asked a compute-requiring question it produces genuinely
new evidence. So we deliberately steer it toward:
- uncharacterized / orphan proteins with a predicted-but-unverified function;
- suspected over-annotations propagated by IEA/IBA/ISS inference;
- "protein binding"-style uninformative molecular functions;
- pseudoenzyme / missing-catalytic-residue questions;
- paralog-discrimination by sequence or structure.
A recurring, high-value pattern is catching a systematic mis-annotation: an
error that has propagated from a reference protein to a whole family or to a
paralog, not just the gene under review.
Completed runs
Compute-requiring batch (structure / sequence)
| Gene | Hypothesis | Verdict (compute-driven) | Review action |
|---|---|---|---|
| METJA/MJ1511 | Does it retain AhpD oxidoreductase activity (GO:0016491, IBA) or is it a pseudoenzyme? | Pseudoenzyme. AlphaFold: the two cysteines are 36.5 Γ apart with zero histidines β no CXXC, no proton relay. Paralog MJ0742 carries the same erroneous annotation. | GO:0016491 UNDECIDED β REMOVE; flagged systematic methanogen-CMD mis-annotation |
| ECOLI/yrhB | Is the ISS Imm35 immunity / peptidase-inhibitor function real? | Fold correct, function over-annotated. Imm35 fold confirmed (AlphaFold pLDDT 95.2, Foldseek), but no Imm35 family member has experimental immunity evidence and there is no adjacent toxin gene; direct assays show chaperone activity (applies to K12 β 100% identical to BL21). | Immunity tempered; added experimentally-grounded GO:0044183 (protein folding chaperone) + GO:0042026; core function reassigned |
| MYCTU/Rv0898c | Can the DUF2630 orphan fold be assigned a molecular function? | Partially supported β keep ND. Fold is classifiable (two-helix hairpin; weak uL29 hit at 27% identity, twilight zone) but function is not inferable (CATH 1.10.287 spans >600 superfamilies; motif CWDLLRQRR has no characterized match). | GO:0003674 ND retained, rationale strengthened |
Earlier hypothesis runs (core-vs-non-core / over-annotation)
| Gene | Hypothesis | Verdict | Note |
|---|---|---|---|
| SCO1 | GO:0016531 copper chaperone = core MF |
Supported (high) | flagged SCO2 paralog over-annotation; IEAβIMP upgrade lead |
| SCHPO/pmp20 | GO:0008379 thioredoxin peroxidase |
Over-annotated β remove | reference refutes activity; GO logic conflict (NOT on parent) |
| IL21 | GO:0042102 pos. reg. T-cell proliferation = core |
Keep as non-core | B-cell/Tfh axis is the signature function |
| STAT3 | GO:0030335 pos. reg. cell migration = core |
Non-core | bidirectional migration effect = downstream, not core |
| CFAP300 | scaffold/adaptor/chaperone in dynein preassembly? | Unresolvable; add BP | "protein binding" confirmed uninformative; structural compute |
Non-structural batch (topology / regulatory / motif)
A second batch deliberately targeted questions decidable by non-structural data
to test whether the agent exercises those data types as well as it does structure.
| Gene | Data type | Verdict (compute/reasoning-driven) | Review action |
|---|---|---|---|
| worm/skn-1 | regulatory / domain | Over-annotated. Bzip/CNC transcription factor with no RNA-binding or translation-factor domain; the IEA GO:0006417 (regulation of translation) comes from UniProt keyword KW-0810, conflating SKN-1 being activated by translation inhibition with directly regulating it. |
GO:0006417 UNDECIDED β REMOVE |
| CLCN7 | localization / sorting-signal | Over-annotated β second systematic catch. ClC-7 is an endolysosomal antiporter (N-terminal dileucine + acidic-cluster sorting motifs, absent from plasma-membrane paralogs CLCNKA/KB); GO:0030321 traces to a ComplexPortal family-level intro sentence propagated by PANTHER IBA to ~1,198 ortholog annotations. |
GO:0030321 β REMOVE |
| ASCL1 | TF / ChIP / motif | Over-annotated. ASCL1 binds DNA as a functionally obligate class II bHLH heterodimer with class I E-proteins (TCF3/E2A, TCF4/E2-2, TCF12/HEB); the IEA GO:0042802 (identical protein binding, implying homodimer) is an Ensembl-Compara transfer; homodimers form only in vitro. |
GO:0042802 β MODIFY β GO:0046982 heterodimerization |
Execution behaviour differs by question type. The structural runs executed code
(fetched AlphaFold models, ran Foldseek, computed distances/composition) and emitted
provenance artifacts. The non-structural runs reached equally strong, correct
verdicts but mostly reasoned over the relevant data β domain architecture,
sorting motifs, ChIP/regulon/E-box signatures, InterPro/PANTHER families β rather
than executing topology tools or querying ChIP-Atlas; skn-1 and CLCN7 produced no
provenance artifacts and ASCL1 emitted only an evidence-summary figure. So
OpenScientist's code-execution mode appears triggered chiefly by structural
questions; for topology/regulatory questions it behaves as an expert reasoner over
sequence features and curated databases. The verdicts are still high-value β CLCN7
delivered a second systematic-mis-annotation catch (after MJ1511/MJ0742).
The prompt template can shift this β validated by an A/B test. After tweaking
templates/gene_hypothesis_deep_research.md to ask the agent to execute
hypothesis-matched analyses (and save provenance), the CLCN7 topology question was
re-run with only the template changed. The original run produced zero provenance;
the re-run computed a KyteβDoolittle hydropathy profile from the UniProt sequence,
aligned the 10 TM helices to UniProt topology, and localized the lysosomal sorting
motifs β saved as provenance β while reaching the identical over-annotation verdict
and honestly labelling the computation "supportive provenance rather than novel
evidence" (no fabricated web-only DeepTMHMM result). So for topology / pure-sequence
questions the behaviour gap is largely promptable. ChIP/expression questions that
depend on web-only resources (ChIP-Atlas, DeepTMHMM web server) remain limited by
tool access, not prompting.
Evidence integrity. Across all runs, cited PMIDs were real, on-target, and
verbatim-quotable; no hallucinated citations were found. Every supporting_text
wired into a review is checked as a verbatim substring of its source.
Beyond structure: other informatics that could decide a call
The batch above leans heavily on AlphaFold/Foldseek because those questions were
structural. But the same "ask a question the literature can't settle, then compute"
approach applies to other data types. Candidate genes where a non-structural
analysis would be the deciding evidence:
A. Disputed membrane topology / transmembrane proteins
Resolvable by sequence-topology tools: DeepTMHMM, SignalP, Phobius, positive-inside / orientation analysis.
| Gene | Disputed point | Method | Rank |
|---|---|---|---|
| CLCN7 | GO:0030321 transepithelial chloride transport β endolysosomal antiporter vs plasma-membrane epithelial channel |
DeepTMHMM/Phobius topology + transporter-vs-channel mechanism | HIGH |
| WFS1 | multi-pass ER membrane topology; EF-hand / GO:0005516 calmodulin binding (UNDECIDED) |
DeepTMHMM 9-TM topology, signal-anchor orientation, EF-hand motif scan | MED |
| SORL1 | disputed subcellular localizations (GO:0005641 nuclear envelope lumen, UNDECIDED; secretory-granule/MVB) |
signal peptide + domain-architecture topology; localization-prediction consensus | MED |
B. Transcription factors / DNA-binding with disputed function
Resolvable by motif analysis (JASPAR) + ChIP/regulatory databases (ChIP-Atlas, ReMap) + co-expression/regulon inference (ARACNe/GENIE3) β not structure.
| Gene | Disputed point | Method | Rank |
|---|---|---|---|
| worm/skn-1 | isoform-specific regulons & GO:0006417 translation regulation (UNDECIDED); A/B/C isoforms differ |
isoform-resolved ChIP/CUT&RUN + tissue co-expression to partition regulons | HIGH |
| ASCL1 | pioneer-factor claim; direct vs indirect targets; neuronal vs SCLC E-box preference | ChIP-Atlas peaks + motif enrichment + enhancer/ATAC co-localization | HIGH |
| CTBP1 | corepressor vs context-specific coactivator (GO:0003713); which regulons |
ReMap/ChIP-Atlas target sets + differential co-expression by cell type | MED |
C. Expression-data-resolvable annotations
Resolvable by tissue/cell-type specificity and co-expression atlases: Bgee, Expression Atlas, single-cell references, GTEx.
| Gene | Disputed point | Method | Rank |
|---|---|---|---|
| WFS1 | GO:0031016 pancreas development β developmental role vs Ξ²-cell degeneration (ER-stress apoptosis) |
developmental vs adult Ξ²-cell expression timecourse; single-cell ER-stress signatures | MED |
| worm/skn-1 | are isoform modules (ASI chemosensory / intestinal detox / ER stress) distinct or context-induced | tissue-specific + developmental-stage expression; condition-induction profiles | HIGH |
| STAT3 | already-resolved migration call could be generalized: which processes are direct vs co-expression-driven | regulon/co-expression to separate direct targets from convergent phenotypes | LOW |
The three HIGH leads (skn-1, ASCL1, CLCN7) have now been run β see the
non-structural batch above; all three returned actionable over-annotation
verdicts. The remaining MED rows (WFS1, SORL1, CTBP1) are still open leads.
Operational lessons
- Give jobs ample time, then scope. Real 3-iteration runs take ~50β90 min;
structural runs routinely exceed the upstream 3600 s default and get cancelled
mid-analysis. Thejustrecipes now inject--param timeout=7200(the API ceiling,
le=7200) and keep the subprocess wall (--timeout-seconds, default 8100 s) above
it. When a job still hits the 7200 s cap β common for human proteins asked a
multi-faceted question β the fix is scope, not time: re-run with one decisive
analysis and/ormax_iterations=2(C18orf21: 7200 s timeout β 2708 s when narrowed
to Foldseek-only). - A job can exit 0 yet write nothing. Always confirm
openscientist.md(and the
openscientist_artifacts/dir) exist. Two empty-output failure modes: timeout
(β¦ timed out after Ns β¦ cancelledβ raise the timeout) and a transient
server-side LLM error (β¦ Request ID: req_β¦β re-run; if it recurs for the same
gene, treat as a persistent upstream issue and report rather than retry forever). - Keep local bioinformatics held out. Do not feed existing
*-bioinformatics/ RESULTS.mdinto the prompt; compare against it after the run. - Verify quotes. Every
supporting_textadded to a review must be a verbatim
substring of its cited source; this is enforced for both publication PMIDs and the
file:β¦/openscientist.mdreport itself.
Status & next steps
- [x] Compute-requiring structural batch (MJ1511, yrhB, Rv0898c) run, reviewed, and
wired into-ai-review.yaml. - [x] Timeout/failure-mode handling hardened in the skill + justfile.
- [x] HIGH non-structural leads (skn-1, ASCL1, CLCN7) run, reviewed, and wired in β
all three actionable over-annotation verdicts; established that code execution
is triggered mainly by structural questions. - [x] Tweaked the prompt template (
templates/gene_hypothesis_deep_research.md) to
encourage executing hypothesis-matched analyses (hydropathy/topology +
targeting motifs; active-site/motif residue checks; binding-domain/PWM;
domain/orthology) and saving the computed result as provenance, with a hard
"never fabricate / inconclusive is fine / say so if web-only" clause. - [x] Validated the template tweak with a CLCN7 A/B re-run: same question, only the
template changed; behaviour shifted from zero provenance to a computed
KyteβDoolittle hydropathy + sorting-motif analysis, identical verdict, honest
labelling. Confirms topology behaviour is promptable; ChIP/expression remain
tool-access-limited. - [x] Executable-bioinformatics + Proteostasis-Network batch run and wired in
(Rv0311 intein REMOVE; NPLOC4 pseudo-DUB; HSPA12A/HSPA12B pseudo-chaperones,
no GO:0140662; AARSD1 AlaX editing confirmed; RvY_17310 left UNDECIDED after
verifying the paralog claims; C18orf21 β RMP24/RNase MRP recorded as a
verification-gated lead). DNAJC28 pending its re-run. - [ ] Formalize C18orf21 β RNase MRP (GO:0000172) once the 2025β26 primary papers
are fetched/cached and verified. - [ ] Run the MED non-structural leads (WFS1, SORL1, CTBP1).
- [ ] Write up the systematic-mis-annotation cases (MJ1511/MJ0742; CLCN7's ~1,198
propagated annotations; pmp20 family) as examples of family-level error
propagation caught by compute.