Agentic evaluation of function prediction tools yields qualitative insights into systematic errors

Motivation

Assigning functions to gene products is one of the foundational tasks of molecular biology: downstream analyses, from enrichment testing to pathway reconstruction to therapeutic target identification, all inherit the assumptions encoded in functional annotations. At the scale of modern sequence databases, purely experimental characterisation is not tractable, and functional knowledge is instead assembled from two complementary sources: (i) literature-based expert curation, in which curators read papers about experimentally tractable model organisms and record structured annotations in resources such as the Gene Ontology (GO); and (ii) computational prediction, which propagates function to the long tail of uncharacterised proteins.

Although the literature is increasingly dominated by deep-learning and foundation-model approaches to computational prediction, the methods actually deployed by most annotation databases remain comparatively conservative: HMM- and family-based pipelines such as InterPro2GO, PANTHER phylogenetic inference (IBA), and orthology-transfer tools. These pipelines are attractive to curators because a hit to a well-defined protein family lands in a curated mapping (family → GO term, or ancestral-node → descendant in a PANTHER tree), giving a traceable, deterministic chain from the sequence evidence to the resulting annotation. This creates a recurring, concrete question for resources like GO: when is a new, less rule-like method good enough to trust in production?

The Critical Assessment of Functional Annotation (CAFA) series [1–3] has been indispensable for tracking aggregate progress, using temporal holdout against GOA snapshots and reporting per-protein GOA-agreement scores. However, aggregate agreement with a GOA snapshot answers a somewhat different question from the one a database lead actually faces, which is: does this method, on the proteins in my organism of interest, produce annotations that a human curator would sign off on? In practice, newly published deep-learning methods often post strong headline scores yet fail biological spot-checks in exactly the ways curators notice: wrong paralog subfamily, wrong compartment, wrong pathway context, activity that exists in vitro but not in vivo.

A second, and increasingly acute, problem is that modern agentic and foundation-model predictors no longer emit only GO terms. BioReason-Pro [6], for example, outputs a free-text functional summary and a chain-of-thought reasoning trace in addition to a predicted term list. Much of the scientific content — the proposed mechanism, the cited domain evidence, the identification (or not) of a pseudoenzyme, the placement of the protein in a pathway — lives in the narrative, not in the term list. A GOA-agreement metric, by construction, can only grade the term-list projection of such an output. It has no way to say whether the narrative is biologically coherent, whether it correctly names the organism, or whether the reasoning trace actually supports the terms the model ultimately emits.

The most systematic recent demonstration of this gap is de Crécy-Lagard et al. (2025, G3) [7], who manually reviewed 453 DeepECTransformer EC predictions for uncharacterised E. coli proteins. Only 3/453 predictions were genuinely novel and correct. The remainder fell into reproducible error classes that are lost in an aggregate metric but immediately obvious to an expert reading the gene dossier: paralog-incorrect (yciO is predicted as a threonylcarbamoyl-AMP synthase, but is a TsaC paralog with ~10⁴-fold weaker activity); non-paralog-incorrect (yjhQ is predicted as a mycothiol synthase, but the mycothiol pathway is absent from E. coli; yrhB's predicted activity is already encoded by QueD); in-vitro-not-in-vivo (yjdM has assayable phosphonoacetate hydrolase activity but no in vivo phenotype or genetic evidence); and frequency-biased repetition (fepE predicted as a histidine kinase with no sequence similarity to the HK family). Each rejection required the curator to synthesise domain architecture, paralog subfamily context, pathway presence/absence in the organism, genetic evidence, and orthogonal literature — not simple label-checking.

This manual synthesis is the bottleneck. With new foundation-model and agentic predictors appearing monthly, no database team can scale expert spot-checking by hand. We show that the synthesis itself can be partially automated, by casting the curator's workflow as an LLM agent pipeline grounded in per-gene dossiers (UniProt, GOA, full-text literature, InterPro, deep research reports) and GO best-practices [4,5]. We have implemented this as the AI Gene Review (AIGR) framework, in which curator-agents produce structured reviews with per-annotation actions, a curated core-function summary, and per-claim supporting quotes. AIGR is intended as a complement to — not a replacement for — CAFA-style benchmarks: it reviews narrative and reasoning content outside aggregate term metrics, and it flags the systematic failure modes that aggregate summaries hide. Here we apply AIGR to two case studies: a large-scale evaluation of BioReason-Pro, and ESR-ECOLI-DET-Mini, a 7-gene Expert Synthetic Review recap of the de Crécy-Lagard error taxonomy.

Methods

The AI Gene Review (AIGR) pipeline. AIGR is an agentic curation system in which an LLM curator-agent works through a per-gene dossier and emits a structured review against a LinkML schema. For each gene, the pipeline first assembles the dossier from canonical resources (UniProt record, the full GO annotation table from QuickGO, InterPro domain architecture, cached full-text publications for every PMID cited in the annotation table, and an orthogonal "deep research" report produced by a separate literature-retrieval agent). The curator-agent then proceeds through three phases: (i) annotation-level review, in which every existing GO annotation is assigned an action (ACCEPT, KEEP_AS_NON_CORE, MODIFY, REMOVE, MARK_AS_OVER_ANNOTATED, or UNDECIDED) together with a supporting-text quote drawn verbatim from one of the cached publications; (ii) core-function synthesis, in which the agent writes a free-text summary of the gene's core molecular and biological function and proposes any missing GO terms; and (iii) prediction review (optional), in which a separate predictions-review.yaml scores computational or LLM-generated predictions that are not already in GOA, using the de Crécy-Lagard error taxonomy (COR/CNN/LSP/PLI/NPI/REP/UNC) together with structured error-type tags such as PARALOG_OVERANNOTATION, PATHWAY_CONTEXT_IGNORED, FREQUENCY_BIAS, and IN_VITRO_NOT_IN_VIVO. All outputs are validated against a LinkML schema and against a suite of best-practice consistency checks (e.g., every supporting quote must be literally present in a cached publication). Gene reviews, supporting data, and validator are open-source at github.com/ai4curation/ai-gene-review.

Case 1 — BioReason-Pro evaluation on ARGO139 and ARGO95. We selected 139 proteins spanning well-characterised model-organism genes, non-MOD or less-specialized species such as Bacillus subtilis, and harder edge cases (pseudoenzymes, paralog families, sporulation sigma factors, organism-specific regulators). For each ARGO139 gene we obtained the BioReason-Pro RL functional summary and reasoning trace from the public BioReason web app [6] and the curated AIGR review as ground truth. For SFT GO-term analysis we used ARGO95, the 95-gene ARGO139 subset with HuggingFace wanglab/protein_catalogue predictions. A dedicated comparison agent then scored the RL output along two axes (Correctness 1–5, Completeness 1–5) with a rubric requiring supporting-quote evidence, and wrote a qualitative comparison against the InterPro2GO pipeline (GO_REF:0000002) as a domain-based baseline.

Case 2 — ESR-ECOLI-DET-Mini recap. de Crécy-Lagard et al. (2025, G3) [7] hand-categorised 453 DeepECTransformer EC predictions for uncharacterised E. coli proteins into COR, CNN, LSP, PLI, NPI, REP, and UNC classes. We sampled 7 genes spanning all classes (ygfF, yciO, yegV, yjhQ, yrhB, yjdM, fepE) and used AIGR to produce a full gene review and a structured predictions-review.yaml under the same taxonomy. This quick-check benchmark is ESR-ECOLI-DET-Mini (alias ESR-ECOLI-DET-7; dataset ID 10.5281/zenodo.20751016). This was first a retrospective positive control: the project artifacts include the published labels and rationales, so it is not a blinded validation. We then archived an answer-key-withheld recapitulation in which the de Crécy-Lagard source paper and published rationales were excluded but primary literature and in-house bioinformatics were allowed.

Results

BioReason-Pro. Overall correctness was 3.7/5 and completeness 2.9/5. Aggregated per-organism scores ranged from 4.7 (mouse) to 2.8 (S. pombe) and correlated with the richness of InterPro family-level names, suggesting much of the model's apparent skill is an echo of InterPro2GO. The agentic review surfaced seven reproducible failure modes that are not represented in $F_{\max}$-style metrics:

Pseudo-enzyme blind spot. BioReason confidently assigns ancestral catalytic activity to Epe1 (JmjC demethylase-fold with a degenerate active site), cts2 (chitinase-fold missing the catalytic glutamate), and pmp20 (peroxiredoxin that has lost its resolving cysteine and functions as a chaperone). Literature-grounded review refutes all three; legacy GOA would not.
Localisation defaults to cytoplasm for periplasmic (Skp, CpxP, Spy), ER-membrane (ETR1, IRE1), mitochondrial (HSP60, alo1), and secreted (fibrolase) proteins whenever InterPro names do not explicitly mention the compartment.
Paralog indistinguishability. Fyn ≡ Src (mouse); sigF ≡ sigG ≡ sigK (B. subtilis sporulation); Hspa5 ≡ Hspa8 (rat Hsp70) receive interchangeable summaries with no gene-specific biology.
Organism-specific biology absent (daf-16 generic FOXO not IIS/longevity; atfs-1 generic bZIP not UPRmt master regulator; aprE misses Phr-peptide quorum sensing).
Neo-functionalisation / moonlighting missed (Nmnat chaperone role; LysB digestive lysozyme; GAPDH non-glycolytic functions).
Narrative–GO disconnect. Internal inconsistency between the model's prose summary and its emitted GO terms (RidA narrative correct but assigns protein binding rather than deaminase activity).
Cross-kingdom fold bias. aprE (subtilisin) annotated with human blood-coagulation processes; PGRPLB (Anopheles) labelled a "fruit fly" protein.

Across the 139 genes, the dominant mode is narrative restatement of InterPro2GO: BioReason adds genuinely new biology only where multi-domain architectures are diagnostic (TOR1, NOTCH1, PTEN, EGFR, spo0A). This novel-vs-restatement distinction is exactly the question a database lead must answer when deciding whether to deploy a new method, and it is not answered by a GOA-agreement score.

Expert-taxonomy reproduction. On ESR-ECOLI-DET-Mini, AIGR reproduced all 7 error classifications from de Crécy-Lagard et al. [7], together with mechanistic rationales that match the paper's analysis (yciO as a TsaC paralog with ~10⁴-fold weaker activity; yjhQ assigned to a mycothiol pathway absent from E. coli; fepE a frequency-biased pseudo-histidine-kinase call for what is in fact a Wzz O-antigen length regulator). This shows that the AIGR schema and review workflow can encode expert taxonomic error analysis, but it should not be interpreted as a blinded test of independent recovery. In the answer-key-withheld recapitulation, the agent recovered 4/7 exact labels: it found the high-value incorrect-call patterns for fepE, yciO, yjhQ, and yrhB, but was conservative on yegV and ygfF and too severe on yjdM. This suggests current agentic review is useful as a triage/smell-test layer for sequence-AI outputs, but not as a replacement for expert judgment.

Conclusion

Annotation databases need a practical way to decide when a new computational method is good enough to deploy, and aggregate CAFA-style scores answer only part of that question. Agentic gene review is a complement — not a replacement — for CAFA-style benchmarks: it grades the narrative and reasoning outputs that modern predictors emit, it surfaces systematic failure modes (pseudoenzymes, paralog conflation, localisation defaults, cross-kingdom bias) that aggregate metrics hide, and it distinguishes genuine novel insight from a restatement of existing InterPro2GO logic. ESR-ECOLI-DET-Mini shows both the promise and boundary of AIGR: expert-like failure-mode triage is partially automatable, but expert-level nuance is not yet automatic. ARGO139 and ARGO95 supply the main linked BioReason-Pro benchmarks. We release both resources for the Function-COSI community, and we recommend that future method evaluations report a CAFA-style score alongside an agentic biological-validity assessment.

References

Radivojac P, et al. A large-scale evaluation of computational protein function prediction. Nat Methods 10, 221–227 (2013).
Jiang Y, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 17, 184 (2016).
Zhou N, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 20, 244 (2019).
Gaudet P, Dessimoz C. Gene Ontology: pitfalls, biases, and remedies. In The Gene Ontology Handbook, Methods Mol Biol 1446, 189–205 (2017).
The Gene Ontology Consortium. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Fallahpour A, et al. BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning. bioRxiv 10.64898/2026.03.19.712954 (2026).
de Crécy-Lagard V, et al. Limitations of current machine learning models in predicting enzymatic functions for uncharacterized proteins. G3 15(10), jkaf169 (2025) (PMID:40703034).