Supplemental benchmark and source-availability details for the BioReason-Pro comparison

Supplemental benchmark and source-availability details

This supplement documents analyses that are useful for reproducibility but are not part of the main paper's primary BioReason-Pro benchmark story. The main manuscript uses ARGO139 for RL narrative review and ARGO95 for SFT GO-term review, while ESR-ECOLI-DET-Mini is the separate Expert Synthetic Review recap positive control. The views below explain why earlier drafts used mixed SFT denominators and preserve those results for reproducibility.

S1. Cohort accounting

The main RL benchmark is ARGO139, a fixed 139-gene set listed in ../genes.csv. The main SFT term benchmark is ARGO95, the 95-gene ARGO139 subset present in the HuggingFace wanglab/protein_catalogue SFT download.

Table S1. Cohorts emitted by write_benchmark_sidecars.py.

Cohort Genes Predictions Role
argo139_rl_narrative 139 - Main RL narrative benchmark
argo95_sft_terms 95 955 Main HF-catalogue SFT term benchmark
supplement_sft_terms_argo139_mixed_sources 139 10,697 Mixed-source ARGO139 diagnostic; not a primary benchmark
supplement_sft_terms_web_export_44 44 9,742 ARGO139 genes absent from HF; web source includes ancestor hierarchy
supplement_sft_narrative_hf 45 - SFT narrative cross-check
supplement_sft_terms_hf_catalogue_all 154 1,358 Full HF catalogue view
supplement_sft_terms_union_all 198 11,100 ARGO139 plus 59 HF-only genes
supplement_gogpt_overlap_300 300 8,910 Separate GO-GPT overlap review

The key availability issue is simple: the HuggingFace wanglab/protein_catalogue SFT download contained 95/139 ARGO139 genes. The remaining 44 ARGO139 genes were not present in that download. We do not fill those 44 into the primary SFT analysis, because the BioReason-Pro SFT web exports expose a much larger ancestor-rich term panel and are not comparable to the HF catalogue source.

S2. Supplemental SFT term views

Table S2. ARGO95 SFT assessment distribution, repeated from the main paper.

Benchmark Genes Terms CNN NPI COR LSP REP UNC
ARGO95 (HF catalogue) 95 955 645 (67.5%) 101 (10.6%) 52 (5.4%) 37 (3.9%) 21 (2.2%) 99 (10.4%)

For comparison, the mixed-source ARGO139 view is retained as a source-diagnostic table, not as a primary SFT benchmark.

Table S3. Supplemental mixed-source ARGO139 SFT assessment distribution.

Source Genes Terms CNN NPI COR LSP REP UNC
HF catalogue / ARGO95 95 955 645 (67.5%) 101 (10.6%) 52 (5.4%) 37 (3.9%) 21 (2.2%) 99 (10.4%)
Web export 44 9,742 2,321 (23.8%) 42 (0.4%) 7 (0.1%) 388 (4.0%) 1 (0.0%) 6,983 (71.7%)
Mixed-source ARGO139 total 139 10,697 2,966 (27.7%) 143 (1.3%) 59 (0.6%) 425 (4.0%) 22 (0.2%) 7,082 (66.2%)

Table S4. Terms per gene in the SFT source views.

Source Mean terms/gene Median terms/gene Max terms/gene
ARGO95 / HF catalogue 10.1 7.0 38
Web export 221.4 212.5 598
Mixed-source ARGO139 total 77.0 12.0 598

The all-HF view is still useful as the broadest single-source HF view, but it is not the main benchmark because 59 of those genes are outside ARGO139.

Table S5. Supplemental full HF catalogue view: 1,358 terms across 154 genes.

Assessment Count %
CNN 884 65.1
UNC 186 13.7
NPI 154 11.3
COR 59 4.3
LSP 50 3.7
REP 25 1.9

The all-source union is the broadest source-availability view, but it combines ARGO139 with 59 HF-only genes and is therefore not a paired benchmark.

Table S6. Supplemental all-source union: 11,100 terms across ARGO139 plus 59 HF-only genes.

Assessment Count %
CNN 3,205 28.9
NPI 196 1.8
UNC 7,169 64.6
COR 66 0.6
LSP 438 3.9
REP 26 0.2

S3. CAFA-style retrospective GOA agreement

We computed a retrospective CAFA-style agreement score for ARGO95 SFT GO-term predictions using current local GOA as the reference. This is not a true CAFA benchmark: ARGO95 is retrospective, there is no temporal holdout, and the BioReason-Pro SFT files do not contain model confidence scores. The score therefore treats predictions as an unranked single-threshold set and reports propagated precision/recall/F1 rather than (F_{\max}). Both predictions and reference GOA annotations are propagated over is_a and part_of ancestors from go-basic.obo, excluding the three GO aspect roots. The mixed-source ARGO139 rows are retained only as diagnostics.

Table S7. Propagated all-aspect agreement against current GOA.

Source Genes Scored direct predictions Direct GOA terms Precision Recall F1
ARGO95 / HF catalogue 95 952 2,382 0.864 0.476 0.614
Web export 44 9,730 3,885 0.777 0.533 0.632
Mixed-source ARGO139 total 139 10,682 6,267 0.808 0.510 0.625

The score shows why aggregate GOA agreement is useful but incomplete. In the HF catalogue subset, 37/122 terms classified by AI-AUGR as NPI or REP are exact matches to current GOA, and 92/122 have propagated overlap with current GOA. A GOA-agreement metric would reward some of these predictions despite evidence-grounded review classifying them as wrong or frequency-biased.

CAFA-style propagated F1 by aspect for ARGO95 SFT terms, with mixed-source diagnostics.

Full derived tables are in ../cafa-style/.

S4. SFT narrative cross-check

The HuggingFace SFT narrative sample contains 45 proteins, 44 of which have parseable 1-5 correctness/completeness scores. It is not paired to ARGO139 and is not used as a main result. It remains a useful cross-check: SFT narrative scores are lower than RL (correctness 2.9/5 vs. 3.7/5; completeness 2.7/5 vs. 2.9/5), and 7/45 SFT outputs contained fabricated "UniProt Summary" prose for proteins that UniProt describes only as uncharacterized.

S5. GO-GPT overlap review

The main paper removes the GO-GPT section because it is a separate 300-gene synthetic review of the upstream GO-GPT predictor, not a paired ARGO139 BioReason-Pro result. The review remains useful for showing how much apparent agreement changes when the reference set moves from raw GOA to AIGR core biology.

Table S8. GO-GPT prediction overlap at three reference levels (300 genes).

Reference level Terms in reference Predictions overlapping % of 8,910 predictions
Raw GOA 2,967 1,046 11.7
Post-AIGR-review 2,913 971 10.9
AIGR core functions only 933 210 2.4

GO-GPT prediction overlap at three reference levels.

GO-GPT emitted 8,910 predictions across 300 genes (mean 29.7 per gene). Raw GOA agreement was 11.7%; agreement with AIGR core functions was only 2.4%. This is a useful illustration of the CAFA-style scoring gap, but it is not used as a main BioReason-Pro benchmark result.

S6. Reproducibility files