ProtNLM2 Evaluation

Evaluation of Google's ProtNLM2 GO term predictions against expert-curated AIGR gene reviews and existing GOA annotations, using the ARGO-ProtNLM-50 benchmark (50 proteins, 14 taxonomic groups, 75 GO predictions).

Interactive prediction evaluation table — filterable/sortable HTML view of all 50 proteins and 77 prediction assessments.

Key findings

ProtNLM2 is strongest for uncharacterized proteins: 12/23 (52%) of NOT_IN_GOA predictions were correct and novel — identifying genuine functions like DNA binding for a KilA-N domain protein (A2FPI7), ECM organization for OLFML2A (A0A8C9H4D2), and nuclear localization for MCM-4 (A0A061AL94).
"Exact" matches are mostly less precise, not novel: 13/19 (68%) are parent terms of more specific existing annotations (e.g., predicting "cytoplasm" when "clathrin-coated vesicle" is already annotated). ProtNLM2 captures broad functional categories but lacks resolution.
Frequency bias is the dominant error mode (13/22 error annotations): the model over-predicts common GO terms (membrane, transferase activity, phosphorylation) without biological specificity, especially problematic for proteins with unusual functions.
Cross-kingdom errors persist despite taxon filtering: animal-specific terms (neuronal cell body, protein antigen binding) predicted for plant proteins, indicating the Evidencer's taxon constraint checking has gaps.
Paralog discrimination is a weakness: the model conflates catalytically active and inactive family members (MTMR9 pseudophosphatase, RIC7 lacking kinase domain) — a Type 6 error that sequence similarity methods inherently struggle with.
Expert review substantially revises mechanical assessment: the automated category mapping (EXACT→CNN, NO_OVERLAP→UNC) was heavily corrected once protein biology was considered (EXACT→LSP, MORE_SPECIFIC→NPI, etc.).

Aggregate results (75 predictions, 39 proteins)

Assessment categories follow de Crécy-Lagard et al. 2025 (PMID:40703034).

Category	Code	CS	Count	%
Correct novel	COR	2	17	23%
Correct not novel	CNN	2	11	15%
Less precise	LSP	2	18	24%
Uncertain	UNC	1	13	17%
Nonparalog incorrect	NPI	0	14	19%
Paralog incorrect	PLI	0	2	3%
Total			75	Mean CS: 1.40/2.0 (70%)

Concordant (CS=2): 46/75 (61%) | Uncertain (CS=1): 13/75 (17%) | Discordant (CS=0): 16/75 (21%)

Results by GOA overlap category

Match category	n	Dominant assessments
EXACT	19	LSP:13, CNN:6 — typically parent terms of existing annotations
MORE_SPECIFIC	6	CNN:3, NPI:2, UNC:1 — some correctly refine GOA, others overreach
LESS_SPECIFIC	1	LSP:1
NO_OVERLAP	26	NPI:9, UNC:6, COR:5, PLI:2 — novel predictions often wrong
NOT_IN_GOA	23	COR:12, UNC:6, NPI:3 — best category for genuinely novel discoveries

Error analysis (16 incorrect predictions)

Error type	Count	Examples
FREQUENCY_BIAS	13	Generic "membrane", "transferase activity" for proteins with unusual functions
PARALOG_OVERANNOTATION	4	Phosphatase activity for catalytically dead MTMR9; kinase activity for RIC7 (no kinase domain)
PATHWAY_CONTEXT_IGNORED	3	Neuronal/immune terms for plant proteins
TRAINING_DATA_CONTAMINATION	2	Predictions that reproduce existing IEA annotations

Illustrative case studies

These 5 proteins (included in ARGO-50) were identified during exploratory analysis and represent the main error patterns. Each has a full AIGR review (*-ai-review.yaml) and prediction assessment (*-protnlm-predictions-review.yaml).

Trivially correct: A0A3B6GK97 (wheat patatin)

ProtNLM2 predicts lipid catabolic process > GOA's lipid metabolic process. The protein already has IBA annotations for glycerophospholipase + monoacylglycerol lipase activity. In wheat GOA, 94% of proteins with glycerophospholipase activity already have lipid catabolic process annotated. ProtNLM2 is doing bookkeeping, not discovering biology.

Phmmer transfer: A0A3B6RKV1 (wheat JmjC)

ProtNLM2 predicts 5 specific plant biology terms (gibberellin signaling, photomorphogenesis, seed germination, epigenetic regulation, red light response). All 5 trace to one phmmer hit: Q67XX3 = Arabidopsis JMJ22 (score 689.2). This is ISS/ISO-style annotation transfer — the "added value" over IBA is that ProtNLM2 transfers BP annotations that PAINT's more conservative approach chose not to propagate.

False positive: F4JLB7 (Arabidopsis RIC7)

ProtNLM2 predicts kinase activity + phosphorylation (score 0.23). The phmmer hit is mouse LRRK2 (score 33, barely above noise) — a 2,527 aa multidomain protein with LRR + ROC + COR + kinase domains. RIC7 only has LRR repeats and is a ROP GTPase effector, not a kinase. Classic multidomain annotation leakage.

Cross-kingdom error: F6LAX4 (wheat PP2A scaffold)

ProtNLM2 predicts neuron projection, neuronal cell body, and protein antigen binding for a wheat protein. Plants have no neurons and no adaptive immune system. The predictions leak from mammalian PP2A orthologs that are annotated to neuronal compartments. All 6 predictions scored CS=0 (NPI).

Ontology gap: Q9KZ33 (S. coelicolor sigma factor)

IBA: sigma factor activity. ProtNLM2: transcription initiation. These are biologically coupled but classified as NO_OVERLAP because there is no is_a/part_of path between "regulation of transcription initiation" (MF ancestry) and "DNA-templated transcription" (BP ancestry). This is a real limitation of closure-based evaluation, not a ProtNLM2 error.

What is ProtNLM2?

ProtNLM2 is a T5-based seq2seq model developed by Google DeepMind with UniProt, trained on 240M proteins (UniProt 2023_04). It predicts protein names, GO terms, subcellular locations, and function comments from amino acid sequence, organism TaxID, and AlphaFold secondary structure.

Predictions are post-processed by the Evidencer — a corroboration pipeline that checks each prediction against string matches, phmmer sequence similarity (bit score > 25), and TM-align structural similarity. Predictions failing GO taxon constraints or lacking corroboration are excluded. See the UniProt help page for details.

ARGO-ProtNLM-50 benchmark design

50 proteins curated for systematic evaluation, stratified across:
- 14 taxonomic groups (mammals, plants, bacteria, fish, insects, fungi, etc.)
- 4 prediction categories (rich, partial, go_only, name_only)
- Multiple evidence methods (string match, phmmer, tmalign)
- 5 case studies from exploratory analysis (see above)

All 50 received full AIGR annotation reviews with falcon deep research. The 39 with GO predictions received biologically informed prediction assessments. The 11 name-only proteins have reviews but no prediction-review YAMLs. See argo_protnlm_50.csv.

Overlap with existing AIGR reviews

Only 8 of 1,334 previously reviewed genes appear in the ProtNLM2 dataset (all TrEMBL/unreviewed): C5AXM3, O94267, Q09490, Q21303, Q86WA8, Q9BZE2, Q9UNW9, Q9XUS3.

References

UniProt ProtNLM help page
ProtNLM2 accession list (FTP)
de Crécy-Lagard et al. 2025 (PMID:40703034) — assessment categories

Files and methods

File	Description
`argo_protnlm_50.csv`	Benchmark set of 50 proteins
`protnlm_summary.ipynb`	Exploratory analysis (full 28K dataset)
`protnlm_bench50_eval.ipynb`	ARGO-ProtNLM-50 benchmark evaluation
`fetch_protnlm_api.py`	REST API fetch pipeline
`protnlm_evaluation_slides.md`	Slide deck (Marp)
Data history	XML vs API data source history