Table of Contents
ProtNLM2 Prediction Summary Statistics¶
Google's ProtNLM2 (UniProt help) predicts protein names, GO terms, subcellular locations, keywords, and function descriptions for unreviewed/TrEMBL UniProtKB entries. This notebook analyzes a post-processed XML export containing 28,553 entries across 440 species.
Note on dataset version: The public ProtNLM2 pilot release (UniProt 2026_02) covers 26,856 accessions. Our XML export is an earlier pre-release containing 28,553 entries — 1,697 additional entries that were subsequently removed during quality filtering. Some predictions in our dataset (e.g. cross-kingdom GO terms) may have been caught by the production Evidencer pipeline's exclusion criteria before public release.
ProtNLM2 architecture and the Evidencer¶
ProtNLM2 is a transformer-based sequence-to-sequence model (T5 architecture) trained on 240 million proteins from UniProt release 2023_04, including both Swiss-Prot and TrEMBL entries. It takes amino acid sequence, organism TaxID, and (in some ensemble members) AlphaFold secondary structure predictions as input and generates natural-language annotations as output — analogous to image captioning but for protein sequences.
The original ProtNLM (v1) predicted only protein names using an ensemble of 7 models. ProtNLM2 expands this to also predict GO terms, subcellular locations, keywords, and function comments.
Model score (0–1) is the language model's per-prediction confidence — its probability of generating that specific text given the input sequence. The score threshold is 0.05 for ProtNLM2 (lowered from 0.2 for v1), since many low-score predictions were judged accurate by curators.
Post-processing: the Evidencer¶
Because the neural model is a black box, UniProt/Google apply a post-hoc corroboration system called the Evidencer that mimics expert biocurator evaluation. This is NOT part of the model — it runs after prediction to assess reliability:
- Exclusion filter: predictions matching GO taxon constraints, nomenclature violations, or malformed IDs are rejected (not displayed)
- String/substring match: the predicted text is checked against existing annotations in the entry and cross-referenced databases (InterPro, GO, EC)
- Sequence similarity (phmmer): if no string match, checks for homologs in UniProtKB with matching annotations (bit score threshold: 25)
- Structural similarity (TM-align): if no sequence match, checks AlphaFold structural alignments (normalized TM-score > 0.5, min 5% seq identity, min pLDDT 70)
Predictions that pass exclusion AND corroboration are Approved (displayed in UniProtKB). Those that pass exclusion but lack corroboration get Uncertain Quality (not displayed).
Key insight from our data: the model_score is independent of the evidence strength.
For the same phmmer hit (same accession, same bit score), different GO term predictions
get different model_scores (e.g. 0.70 for seed germination vs 0.61 for epigenetic regulation — same source protein Q67XX3, score 689.2). The correlation between
model_score and phmmer bit score is effectively zero (r = −0.01).
Evidence types in the XML¶
The XML evidence blocks show which corroboration method approved each prediction:
1. String match (56% of evidence) — the predicted text matched a known database entry:
<evidence type="ECO:0008006" key="1">
<source>
<dbReference type="Google" id="ProtNLM2">
<property type="model_score" value="0.96"/>
<property type="string_match_text" value="WRKY"/>
<property type="string_match_location" value="domain"/>
<property type="string_match_type" value="exact_sanitized"/>
</dbReference>
</source>
</evidence>
2. phmmer (40%) — sequence similarity to characterized proteins:
<evidence type="ECO:0008006" key="2">
<source>
<dbReference type="Google" id="ProtNLM2">
<property type="model_score" value="0.69"/>
<property type="phmmer_accession" value="P57750"/>
<property type="phmmer_score" value="397.7"/>
</dbReference>
</source>
</evidence>
3. TM-align (4%) — structure-based alignment (against AlphaFold models):
<evidence type="ECO:0008006" key="3">
<source>
<dbReference type="Google" id="ProtNLM2">
<property type="model_score" value="0.45"/>
<property type="tmalign_accession" value="Q9SM59"/>
<property type="tmalign_score_chain_1" value="0.56297"/>
<property type="tmalign_score_chain_2" value="0.31889"/>
</dbReference>
</source>
</evidence>
The data was parsed into three TSV files by parse_protnlm_xml.py:
entries.tsv— one row per protein (accession, predicted name, prediction counts)predictions.tsv— one row per GO / function / subcellular-location predictionevidence.tsv— one row per evidence block (model score + provenance)
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 120
DATA = Path(".")
entries = pd.read_csv(DATA / "entries.tsv", sep="\t")
predictions = pd.read_csv(DATA / "predictions.tsv", sep="\t")
evidence = pd.read_csv(DATA / "evidence.tsv", sep="\t")
# Taxonomy: maps accession -> organism name + taxon_id
taxonomy = pd.read_csv(DATA / "taxonomy.tsv", sep="\t")
entries = entries.merge(taxonomy, on="accession", how="left")
# Lineage: maps taxon_id -> broad taxonomic group
lineage_path = DATA / "lineage.tsv"
if lineage_path.exists():
lineage = pd.read_csv(lineage_path, sep="\t")
entries = entries.merge(lineage, on="taxon_id", how="left")
print(f"Lineage groups available for {entries['broad_group'].notna().sum():,} entries")
print(f"Entries: {len(entries):,} | Predictions: {len(predictions):,} | Evidence blocks: {len(evidence):,}")
print(f"Species: {entries['organism'].nunique()}")
What does ProtNLM2 predict?¶
Every entry gets a predicted protein name. Beyond that, the model may additionally predict GO terms, subcellular locations, and/or free-text function comments — but not all entries get all prediction types. About 30% of entries receive only a name with no additional functional annotation.
entries["has_go"] = entries["n_go"] > 0
entries["has_function"] = entries["n_function"] > 0
entries["has_subcell"] = entries["n_subcell"] > 0
entries["name_only"] = ~entries["has_go"] & ~entries["has_function"] & ~entries["has_subcell"]
summary = pd.DataFrame({
"count": [
len(entries),
entries["has_go"].sum(),
entries["has_function"].sum(),
entries["has_subcell"].sum(),
entries["name_only"].sum(),
],
}, index=["Total entries", "With GO terms", "With function comment",
"With subcellular location", "Name only"])
summary["pct"] = (summary["count"] / len(entries) * 100).round(1)
summary
fig, ax = plt.subplots(figsize=(8, 3.5))
cats = ["has_go", "has_function", "has_subcell", "name_only"]
labels = ["GO terms", "Function comment", "Subcellular location", "Name only"]
vals = [entries[c].sum() for c in cats]
ax.barh(labels, vals, color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"])
ax.set_xlabel("Number of entries")
ax.set_title("ProtNLM2 prediction types (n=28,553)")
for i, v in enumerate(vals):
ax.text(v + 200, i, f"{v:,}", va="center", fontsize=9)
plt.tight_layout()
plt.show()
Protein name types¶
ProtNLM2 assigns either a recommendedName (high-confidence, informative name) or a
submittedName (lower-confidence, often echoing the original submission). About 79%
receive recommended names.
entries["name_type"].value_counts().to_frame("count")
Swiss-Prot vs TrEMBL breakdown¶
The XML uses placeholder metadata, so we queried the UniProt REST API to determine the current reviewed status of each accession. This tells us how many of ProtNLM2's predictions target well-characterized (Swiss-Prot) vs uncharacterized (TrEMBL) proteins.
status_path = DATA / "reviewed_status.tsv"
if status_path.exists():
status = pd.read_csv(status_path, sep="\t")
entries = entries.merge(status, on="accession", how="left")
sp_counts = entries["reviewed"].value_counts()
sp_total = sp_counts.get(True, 0)
tr_total = sp_counts.get(False, 0)
print(f"Swiss-Prot (reviewed): {sp_total:,} ({sp_total/len(entries)*100:.1f}%)")
print(f"TrEMBL (unreviewed): {tr_total:,} ({tr_total/len(entries)*100:.1f}%)")
# Breakdown by prediction richness
review_summary = entries.groupby("reviewed").agg(
n_entries=("accession", "count"),
pct_go=("has_go", "mean"),
pct_function=("has_function", "mean"),
pct_subcell=("has_subcell", "mean"),
pct_name_only=("name_only", "mean"),
)
review_summary.index = review_summary.index.map({True: "Swiss-Prot", False: "TrEMBL"})
for col in ["pct_go", "pct_function", "pct_subcell", "pct_name_only"]:
review_summary[col] = (review_summary[col] * 100).round(1)
print("\nPrediction richness by review status:")
display(review_summary)
# Chart
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Pie
axes[0].pie([sp_total, tr_total], labels=["Swiss-Prot", "TrEMBL"],
autopct="%1.1f%%", colors=["#4c78a8", "#bab0ac"])
axes[0].set_title("Swiss-Prot vs TrEMBL")
# Stacked: prediction types by review status
rev_pred = entries.groupby("reviewed").agg(
has_go=("has_go", "sum"), has_function=("has_function", "sum"),
has_subcell=("has_subcell", "sum"), name_only=("name_only", "sum"),
)
rev_pred.index = rev_pred.index.map({True: "Swiss-Prot", False: "TrEMBL"})
rev_pred.plot.barh(stacked=True, ax=axes[1],
color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"])
axes[1].set_xlabel("Number of entries")
axes[1].set_title("Prediction types by review status")
axes[1].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8)
plt.tight_layout()
plt.show()
else:
print("reviewed_status.tsv not yet available — run fetch reviewed status")
if "reviewed" in entries.columns and "organism" in entries.columns:
# By species (top 30 by total entries)
top_species = entries["organism"].value_counts().head(30).index
sp_review = entries[entries["organism"].isin(top_species)].groupby("organism")["reviewed"].agg(
n_entries="count",
n_swissprot="sum",
)
sp_review["n_trembl"] = sp_review["n_entries"] - sp_review["n_swissprot"]
sp_review["pct_swissprot"] = (sp_review["n_swissprot"] / sp_review["n_entries"] * 100).round(1)
sp_review = sp_review.sort_values("pct_swissprot", ascending=True)
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(sp_review.index, sp_review["n_swissprot"], label="Swiss-Prot", color="#4c78a8")
ax.barh(sp_review.index, sp_review["n_trembl"], left=sp_review["n_swissprot"],
label="TrEMBL", color="#bab0ac")
ax.set_xlabel("Number of entries")
ax.set_title("Swiss-Prot vs TrEMBL by species (top 30)")
ax.legend(fontsize=9)
for i, (_, row) in enumerate(sp_review.iterrows()):
if row["n_swissprot"] > 0:
ax.text(row["n_entries"] + 5, i, f'{row["pct_swissprot"]:.0f}% SP', va="center", fontsize=7)
plt.tight_layout()
plt.show()
# By broad taxonomic group
group_review = entries.groupby("broad_group")["reviewed"].agg(
n_entries="count", n_swissprot="sum",
)
group_review["n_trembl"] = group_review["n_entries"] - group_review["n_swissprot"]
group_review["pct_swissprot"] = (group_review["n_swissprot"] / group_review["n_entries"] * 100).round(1)
group_review = group_review.sort_values("pct_swissprot", ascending=True)
print("Swiss-Prot fraction by taxonomic group:")
display(group_review.sort_values("pct_swissprot", ascending=False))
Species and taxonomic breakdown¶
Species were resolved by batch-querying the UniProt REST API for each accession. Broad taxonomic groups (Mammal, Land plant, Ray-finned fish, etc.) were assigned by fetching NCBI lineages via the UniProt taxonomy endpoint and matching against major clades.
species_counts = entries["organism"].value_counts()
print(f"Unique species: {species_counts.shape[0]}")
print(f"\nTop 30 species:")
display(species_counts.head(30).to_frame("count"))
Top 20 species, colored by taxonomic group¶
The dataset is taxonomically diverse: dominated by crop plants (wheat, peanut, sunflower), fish (goldfish, Atlantic cod, salmon), and primates/mammals (horse, human, marmoset).
GROUP_COLORS = {
"Land plant": "#54a24b",
"Mammal": "#e45756",
"Ray-finned fish": "#4c78a8",
"Cartilaginous fish": "#72b7b2",
"Bird": "#eeca3b",
"Amphibian": "#b279a2",
"Other vertebrate": "#9d755d",
"Insect": "#f58518",
"Mollusc": "#ff9da6",
"Nematode": "#d67195",
"Fungus": "#9467bd",
"Bacterium": "#bab0ac",
"Other animal": "#d4a6c8",
}
top20 = species_counts.head(20)
top20_df = top20.to_frame("count").reset_index()
top20_df.columns = ["organism", "count"]
# Merge broad_group
org_group = entries[["organism", "broad_group"]].drop_duplicates("organism")
top20_df = top20_df.merge(org_group, on="organism", how="left")
top20_df["color"] = top20_df["broad_group"].map(GROUP_COLORS).fillna("#bab0ac")
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(range(len(top20_df)), top20_df["count"], color=top20_df["color"])
ax.set_yticks(range(len(top20_df)))
ax.set_yticklabels([f"{row.organism} ({row.broad_group})" for row in top20_df.itertuples()], fontsize=9)
ax.invert_yaxis()
ax.set_xlabel("Number of entries")
ax.set_title("Top 20 species in ProtNLM2 dataset (colored by taxonomic group)")
for i, v in enumerate(top20_df["count"]):
ax.text(v + 8, i, str(v), va="center", fontsize=8)
# Legend
from matplotlib.patches import Patch
groups_in_chart = top20_df["broad_group"].unique()
handles = [Patch(facecolor=GROUP_COLORS.get(g, "#bab0ac"), label=g) for g in groups_in_chart if pd.notna(g)]
ax.legend(handles=handles, loc="lower right", fontsize=8)
plt.tight_layout()
plt.show()
Entries by broad taxonomic group¶
Aggregating across all 440 species into broad taxonomic groups shows the overall composition of the dataset.
group_counts = entries["broad_group"].value_counts()
display(group_counts.to_frame("count"))
fig, ax = plt.subplots(figsize=(8, 4))
colors = [GROUP_COLORS.get(g, "#bab0ac") for g in group_counts.index]
group_counts.plot.barh(ax=ax, color=colors)
ax.invert_yaxis()
ax.set_xlabel("Number of entries")
ax.set_title("Entries by broad taxonomic group")
for i, v in enumerate(group_counts):
ax.text(v + 50, i, f"{v:,}", va="center", fontsize=8)
plt.tight_layout()
plt.show()
Prediction types by taxonomic group (stacked)¶
Do certain taxonomic groups get richer predictions? This stacked bar chart shows how many entries in each group have GO terms, function comments, subcellular locations, or just a protein name. Note that categories overlap — an entry can have both GO terms and a subcellular location — so the stacked chart shows the fraction of entries in each group that have each prediction type.
pred_by_group = entries.groupby("broad_group").agg(
n_entries=("accession", "count"),
has_go=("has_go", "sum"),
has_function=("has_function", "sum"),
has_subcell=("has_subcell", "sum"),
name_only=("name_only", "sum"),
).sort_values("n_entries", ascending=True)
# As percentages
pct = pred_by_group[["has_go", "has_function", "has_subcell", "name_only"]].div(
pred_by_group["n_entries"], axis=0
) * 100
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: stacked counts
pred_by_group[["has_go", "has_function", "has_subcell", "name_only"]].plot.barh(
stacked=True, ax=axes[0],
color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"]
)
axes[0].set_xlabel("Number of entries")
axes[0].set_title("Prediction types by taxonomic group (counts)")
axes[0].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8, loc="lower right")
# Right: percentages
pct.plot.barh(
stacked=True, ax=axes[1],
color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"]
)
axes[1].set_xlabel("% of entries in group")
axes[1].set_title("Prediction types by taxonomic group (%)")
axes[1].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8, loc="lower right")
plt.tight_layout()
plt.show()
Scores by match location (post-processing validation)¶
The XML file is a post-processed export: after ProtNLM2 generates a free-text prediction,
a downstream pipeline cross-references that prediction against known databases to check whether
it matches something in UniProt's existing annotation vocabulary. The match_location field
records which database the predicted text was validated against:
match_location |
What it means | Example match_text |
|---|---|---|
domain |
Predicted name matched a known UniProt domain description | WRKY |
PANTHER |
Matched a PANTHER family/subfamily ID | PTHR22654 |
GO |
Matched a GO term ID | GO:0005737 |
keyword |
Matched a UniProt keyword ID | KW-0472 |
recommended_protein_name |
Matched an existing recommended protein name | Aldehyde dehydrogenase |
InterPro |
Matched an InterPro accession | IPR000225 |
The model_score is ProtNLM2's own confidence in the prediction. The box plot below
shows how these scores distribute across validation targets — predictions that can be
matched to domain descriptions or protein names tend to score higher than those matched
to opaque IDs like PANTHER subfamily codes.
Prediction richness by species¶
For species with at least 50 entries, what fraction of their proteins get each type of prediction? Sorted by % with GO terms.
sp_groups = entries.groupby("organism").filter(lambda g: len(g) >= 50).groupby("organism")
richness = sp_groups.agg(
n_entries=("accession", "count"),
pct_go=("has_go", "mean"),
pct_function=("has_function", "mean"),
pct_subcell=("has_subcell", "mean"),
mean_go=("n_go", "mean"),
).sort_values("pct_go", ascending=False)
richness[["pct_go", "pct_function", "pct_subcell"]] = (
richness[["pct_go", "pct_function", "pct_subcell"]] * 100
).round(1)
richness["mean_go"] = richness["mean_go"].round(2)
display(richness.head(30))
Model confidence scores¶
Each evidence block carries a model_score between 0 and 1. Higher scores indicate
greater model confidence. The distribution is right-skewed, with a notable spike
near 1.0 (high-confidence string matches to known domain/family databases).
scores = evidence["model_score"].dropna()
print(f"Evidence blocks with scores: {len(scores):,}")
print(scores.describe().to_frame().T)
fig, ax = plt.subplots(figsize=(8, 3.5))
scores.hist(bins=50, ax=ax, color="#4c78a8", edgecolor="white")
ax.set_xlabel("Model score")
ax.set_ylabel("Count")
ax.set_title("ProtNLM2 model score distribution")
ax.axvline(scores.median(), color="red", ls="--", lw=1, label=f"median={scores.median():.2f}")
ax.legend()
plt.tight_layout()
plt.show()
ev_with_loc = evidence.dropna(subset=["match_location", "model_score"])
top_locs = ev_with_loc["match_location"].value_counts().head(10).index
fig, ax = plt.subplots(figsize=(10, 5))
ev_with_loc[ev_with_loc["match_location"].isin(top_locs)].boxplot(
column="model_score", by="match_location", ax=ax, vert=False
)
ax.set_xlabel("Model score")
ax.set_title("Model score by match location (top 10)")
fig.suptitle("")
plt.tight_layout()
plt.show()
Match type breakdown¶
ProtNLM2 uses three match types:
- exact_sanitized: prediction exactly matches a database entry (after normalization)
- hydrated: GO/keyword IDs were resolved to their full labels
- substring: partial match to a longer database text
Evidence blocks from phmmer/tmalign sequence searches have no match_type (they use sequence similarity scores instead).
evidence["match_type"].value_counts(dropna=False).to_frame("count")
GO term predictions¶
About 24% of entries (6,833) receive GO term predictions. These span 2,006 unique GO terms. The most frequently predicted terms are very generic (nucleus, membrane, DNA binding) — a pattern worth investigating for frequency bias.
go_preds = predictions[predictions["pred_type"] == "GO"]
print(f"Total GO predictions: {len(go_preds):,}")
print(f"Unique GO terms: {go_preds['pred_id'].nunique()}")
top_go = go_preds["pred_label"].value_counts().head(20)
display(top_go.to_frame("count"))
# Aspect breakdown (GO term labels start with C:, F:, or P:)
go_preds = go_preds.copy()
go_preds["aspect"] = go_preds["pred_label"].str[0]
print("\nGO aspect distribution:")
display(go_preds["aspect"].value_counts().to_frame("count"))
Subcellular location predictions¶
Nearly half the entries (48%) get subcellular location predictions. Membrane and nucleus dominate.
subcell_preds = predictions[predictions["pred_type"] == "subcellular_location"]
print(f"Total subcellular location predictions: {len(subcell_preds):,}")
top_locs = subcell_preds["pred_label"].value_counts().head(15)
fig, ax = plt.subplots(figsize=(8, 4))
top_locs.plot.barh(ax=ax, color="#54a24b")
ax.invert_yaxis()
ax.set_xlabel("Count")
ax.set_title("Top 15 predicted subcellular locations")
plt.tight_layout()
plt.show()
Evidence methods¶
ProtNLM2 evidence blocks come from three methods:
- string_match (56%): The predicted text matches a known database entry (domain, GO, PANTHER, etc.)
- phmmer (40%): Sequence search against characterized proteins (reports accession + bit score)
- tmalign (4%): Structure-based alignment (reports TM-scores for both chains)
String matches tend to have higher confidence; tmalign scores are more uniformly distributed.
evidence["method"] = "string_match"
evidence.loc[evidence["phmmer_accession"].notna(), "method"] = "phmmer"
evidence.loc[evidence["tmalign_accession"].notna(), "method"] = "tmalign"
method_counts = evidence["method"].value_counts()
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Pie chart
method_counts.plot.pie(ax=axes[0], autopct="%1.1f%%",
colors=["#4c78a8", "#f58518", "#54a24b"])
axes[0].set_ylabel("")
axes[0].set_title("Evidence method breakdown")
# Score distributions
for method, color in [("string_match", "#4c78a8"), ("phmmer", "#f58518"), ("tmalign", "#54a24b")]:
subset = evidence[evidence["method"] == method]["model_score"].dropna()
axes[1].hist(subset, bins=40, alpha=0.6, color=color, label=f"{method} (n={len(subset):,})")
axes[1].set_xlabel("Model score")
axes[1].set_ylabel("Count")
axes[1].set_title("Score distribution by evidence method")
axes[1].legend(fontsize=8)
plt.tight_layout()
plt.show()
Overlap with AIGR curated reviews¶
How many of the 28K ProtNLM2 entries overlap with our existing expert-curated gene reviews?
import subprocess
repo_root = Path("../..")
review_files = list(repo_root.glob("genes/*/*-ai-review.yaml"))
result = subprocess.run(
["grep", "-rh", "^id:", *[str(f) for f in review_files]],
capture_output=True, text=True
)
reviewed_ids = {line.split(":", 1)[1].strip() for line in result.stdout.strip().split("\n") if line}
protnlm_ids = set(entries["accession"])
overlap = reviewed_ids & protnlm_ids
print(f"AIGR reviewed genes: {len(reviewed_ids):,}")
print(f"ProtNLM2 entries: {len(protnlm_ids):,}")
print(f"Overlap: {len(overlap)}")
if overlap:
print(f"\nOverlapping accessions:")
cols = ["accession", "protein_name", "n_go", "n_function", "n_subcell"]
if "organism" in entries.columns:
cols.insert(2, "organism")
if "broad_group" in entries.columns:
cols.insert(3, "broad_group")
display(entries[entries["accession"].isin(overlap)][cols])
Validation: ProtNLM2 GO predictions vs curated GOA annotations¶
The summary statistics above describe the dataset as a whole. But how good are the
predictions? To evaluate, we compare ProtNLM2's GO term predictions against curated
GO annotations from the GOA database, using the isa_partof_closure table in
DuckDB to capture hierarchical relationships in the ontology.
We select three species spanning different annotation depths and taxonomic groups:
| Species | DuckDB source | Why chosen |
|---|---|---|
| Human | goa_human.ddb |
Best-annotated proteome; baseline for comparison |
| X. laevis | goa_xenopus.ddb |
Allotetraploid frog; many duplicated genes (ohnologs) with potential for paralog divergence |
| X. tropicalis | goa_xenopus.ddb |
Diploid frog; simpler genome, comparison point for laevis paralogs |
| Arabidopsis thaliana | goa_uniprot_gcrp.ddb |
Well-studied plant; many proteins with only ND annotations |
| Mouse | goa_mouse.ddb |
Well-annotated mammalian model organism; 710K GOA annotations |
| Wheat (T. aestivum) | goa_wheat.ddb |
Top plant species by entry count (729); large GOA with 512K annotations |
| Streptomyces coelicolor | bacteria.ddb |
Model actinobacterium; small sample but good GOA coverage |
Match category definitions¶
For each ProtNLM2 GO prediction on a protein that exists in GOA, we classify the relationship using the ontology closure:
| Category | Code | Definition |
|---|---|---|
| Exact | EXACT |
ProtNLM2 predicted the same GO term that exists in GOA for this protein |
| Less specific | LESS_SPECIFIC |
ProtNLM2 predicted a strict ancestor of a GOA term — not wrong, but redundant (e.g. predicts "membrane" when GOA has "plasma membrane") |
| More specific | MORE_SPECIFIC |
ProtNLM2 predicted a strict descendant of a GOA term — potentially novel and more informative |
| No overlap | NO_OVERLAP |
Protein is in GOA but the predicted term has no ancestor/descendant relationship to any existing annotation — could be novel or incorrect |
| Not in GCRP | NOT_IN_GCRP |
Protein is absent from the GCRP (Gene Centric Reference Proteome) subset of GOA — may have annotations in full GOA |
Cross-species summary¶
import duckdb
COMPARE_SPECIES = [
("Human", 9606, Path.home() / "repos/go-db/db/goa_human.ddb", None),
("Mouse", 10090, Path.home() / "repos/go-db/db/goa_mouse.ddb", None),
("X. laevis", 8355, Path.home() / "repos/go-db/db/goa_xenopus.ddb", "taxon:8355"),
("X. tropicalis", 8364, Path.home() / "repos/go-db/db/goa_xenopus.ddb", "taxon:8364"),
("Arabidopsis", 3702, Path.home() / "repos/go-db/db/goa_uniprot_gcrp.ddb", "taxon:3702"),
("Wheat", 4565, Path.home() / "repos/go-db/db/goa_wheat.ddb", None),
("S. coelicolor", 100226, Path.home() / "repos/go-db/db/bacteria.ddb", "taxon:100226"),
]
# Categories in display order
CAT_ORDER = ["EXACT", "LESS_SPECIFIC", "MORE_SPECIFIC", "NO_OVERLAP", "NOT_IN_GCRP"]
CAT_LABELS = ["Exact", "Less specific", "More specific", "No overlap", "Not in GCRP"]
CAT_COLORS = ["#2ca02c", "#f58518", "#4c78a8", "#e45756", "#bab0ac"]
COMPARISON_SQL = """
CREATE TEMP TABLE protnlm AS
SELECT * FROM read_csv_auto('{csv_path}');
WITH in_goa AS (
SELECT DISTINCT db_object_id FROM gaf_association {taxon_filter}
),
-- EXACT: predicted term is literally in GOA for this protein
exact AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN gaf_association g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
{taxon_where}
),
-- LESS_SPECIFIC: predicted term is a strict ancestor of a GOA term
less_specific AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN gaf_association g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
{taxon_and}
),
-- MORE_SPECIFIC: predicted term is a strict descendant of a GOA term
more_specific AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN gaf_association g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
{taxon_and}
)
SELECT
CASE
WHEN e.pred_id IS NOT NULL THEN 'EXACT'
WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
ELSE 'NO_OVERLAP'
END AS match_category,
COUNT(*) AS n_predictions,
COUNT(DISTINCT p.accession) AS n_proteins
FROM protnlm p
LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
GROUP BY 1
ORDER BY 2 DESC;
"""
DETAIL_SQL = """
WITH in_goa AS (
SELECT DISTINCT db_object_id FROM gaf_association {taxon_filter}
),
exact AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN gaf_association g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
{taxon_where}
),
less_specific AS (
SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
FROM protnlm p
INNER JOIN gaf_association g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
{taxon_and}
),
more_specific AS (
SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
FROM protnlm p
INNER JOIN gaf_association g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
{taxon_and}
)
SELECT
p.accession,
p.pred_id AS protnlm_term,
pt.label AS protnlm_label,
CASE
WHEN e.pred_id IS NOT NULL THEN 'EXACT'
WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
ELSE 'NO_OVERLAP'
END AS match_category,
CASE WHEN e.pred_id IS NOT NULL THEN p.pred_id ELSE COALESCE(ls.goa_match, ms.goa_match, '') END AS closest_goa_term,
CASE WHEN e.pred_id IS NOT NULL THEN pt.label ELSE COALESCE(ls.goa_match_label, ms.goa_match_label, '') END AS closest_goa_label
FROM protnlm p
LEFT JOIN term_label pt ON p.pred_id = pt.id
LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
ORDER BY match_category, p.accession;
"""
all_summaries = []
all_details = {}
for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
sp_accs = set(entries[entries["taxon_id"] == taxon_id]["accession"])
sp_go = predictions[
(predictions["pred_type"] == "GO") & (predictions["accession"].isin(sp_accs))
][["accession", "pred_id", "pred_label"]]
csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
sp_go.to_csv(csv_path, index=False)
if taxon_str:
taxon_filter = f"WHERE db_object_taxon = '{taxon_str}'"
taxon_where = f"WHERE g.db_object_taxon = '{taxon_str}'"
taxon_and = f"AND g.db_object_taxon = '{taxon_str}'"
else:
taxon_filter = taxon_where = taxon_and = ""
con = duckdb.connect(str(db_path), read_only=True)
summary = con.execute(COMPARISON_SQL.format(
csv_path=csv_path, taxon_filter=taxon_filter,
taxon_where=taxon_where, taxon_and=taxon_and
)).fetchdf()
summary["species"] = species_name
all_summaries.append(summary)
detail = con.execute(DETAIL_SQL.format(
taxon_filter=taxon_filter, taxon_where=taxon_where, taxon_and=taxon_and
)).fetchdf()
all_details[species_name] = detail
con.close()
print(f"{species_name}: {len(sp_go)} GO predictions, {sp_go['accession'].nunique()} proteins")
combined = pd.concat(all_summaries, ignore_index=True)
pivot = combined.pivot_table(
index="match_category", columns="species",
values="n_predictions", fill_value=0, aggfunc="sum"
).reindex(CAT_ORDER)
display(pivot)
# Stacked bar chart
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
pivot_t = pivot.T.reindex(columns=CAT_ORDER)
pivot_t.plot.barh(stacked=True, ax=axes[0], color=CAT_COLORS)
axes[0].set_xlabel("Number of GO predictions")
axes[0].set_title("ProtNLM2 vs GOA (counts)")
axes[0].legend(CAT_LABELS, fontsize=7, loc="lower right")
pct_pivot = pivot_t.div(pivot_t.sum(axis=1), axis=0) * 100
pct_pivot.plot.barh(stacked=True, ax=axes[1], color=CAT_COLORS)
axes[1].set_xlabel("% of predictions")
axes[1].set_title("ProtNLM2 vs GOA (%)")
axes[1].legend(CAT_LABELS, fontsize=7, loc="lower right")
plt.tight_layout()
plt.show()
Comparison by GOA annotation source¶
The analysis above compares ProtNLM2 against all GOA annotations. But GOA is a mix of high-quality experimental annotations, computational pipelines, and phylogenetic inference. How does ProtNLM2 compare against specific automated pipelines that it might be replacing?
We repeat the closure-based comparison against three GOA subsets:
| GOA subset | Filter | What it captures |
|---|---|---|
| InterPro2GO | assigned_by = 'InterPro' |
Domain-based GO annotation via InterPro matches — the most direct comparison since ProtNLM2 also uses domain/family information |
| IBA | evidence_type = 'IBA' |
Phylogenetic annotation from PAINT/GO_Central — orthology-based function transfer |
| InterPro2GO + IBA | Both combined | The union of the two major automated pipelines |
This tells us whether ProtNLM2 adds value beyond what existing automated methods already provide.
from IPython.display import display, Markdown
SOURCE_COMPARISON_SQL = """
CREATE OR REPLACE TEMP TABLE protnlm AS
SELECT * FROM read_csv_auto('{csv_path}');
WITH goa_filtered AS (
SELECT * FROM gaf_association
{taxon_where_base}
{goa_source_filter}
),
in_goa AS (
SELECT DISTINCT db_object_id FROM goa_filtered
),
exact AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN goa_filtered g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
),
less_specific AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN goa_filtered g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
),
more_specific AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN goa_filtered g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
)
SELECT
CASE
WHEN e.pred_id IS NOT NULL THEN 'EXACT'
WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
ELSE 'NO_OVERLAP'
END AS match_category,
COUNT(*) AS n_predictions,
COUNT(DISTINCT p.accession) AS n_proteins
FROM protnlm p
LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
GROUP BY 1
ORDER BY 2 DESC;
"""
SOURCE_DETAIL_SQL = """
WITH goa_filtered AS (
SELECT * FROM gaf_association
{taxon_where_base}
{goa_source_filter}
),
in_goa AS (
SELECT DISTINCT db_object_id FROM goa_filtered
),
exact AS (
SELECT DISTINCT p.accession, p.pred_id
FROM protnlm p
INNER JOIN goa_filtered g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
),
less_specific AS (
SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
FROM protnlm p
INNER JOIN goa_filtered g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
),
more_specific AS (
SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
FROM protnlm p
INNER JOIN goa_filtered g ON p.accession = g.db_object_id
INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
)
SELECT
p.accession,
p.pred_id AS protnlm_term,
pt.label AS protnlm_label,
CASE
WHEN e.pred_id IS NOT NULL THEN 'EXACT'
WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
ELSE 'NO_OVERLAP'
END AS match_category,
CASE WHEN e.pred_id IS NOT NULL THEN p.pred_id ELSE COALESCE(ls.goa_match, ms.goa_match, '') END AS closest_goa_term,
CASE WHEN e.pred_id IS NOT NULL THEN pt.label ELSE COALESCE(ls.goa_match_label, ms.goa_match_label, '') END AS closest_goa_label
FROM protnlm p
LEFT JOIN term_label pt ON p.pred_id = pt.id
LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
WHERE ig.db_object_id IS NOT NULL
ORDER BY match_category, p.accession;
"""
GOA_SOURCES = [
("All GOA", ""),
("InterPro2GO", "AND assigned_by = 'InterPro'"),
("IBA", "AND evidence_type = 'IBA'"),
("InterPro2GO + IBA", "AND (assigned_by = 'InterPro' OR evidence_type = 'IBA')"),
]
source_results = []
source_details = {}
for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
taxon_where_base = f"WHERE db_object_taxon = '{taxon_str}'" if taxon_str else "WHERE 1=1"
con = duckdb.connect(str(db_path), read_only=True)
for source_name, source_filter in GOA_SOURCES:
sql = SOURCE_COMPARISON_SQL.format(
csv_path=csv_path,
taxon_where_base=taxon_where_base,
goa_source_filter=source_filter,
)
df = con.execute(sql).fetchdf()
df["species"] = species_name
df["goa_source"] = source_name
source_results.append(df)
# Fetch detail for InterPro2GO and IBA comparisons
if source_name in ("InterPro2GO", "IBA"):
detail = con.execute(SOURCE_DETAIL_SQL.format(
taxon_where_base=taxon_where_base,
goa_source_filter=source_filter,
)).fetchdf()
source_details[(species_name, source_name)] = detail
con.close()
source_combined = pd.concat(source_results, ignore_index=True)
# Summary chart
fig, axes = plt.subplots(1, 7, figsize=(32, 5), sharey=False)
source_order = ["All GOA", "InterPro2GO", "IBA", "InterPro2GO + IBA"]
for ax, species_name in zip(axes, ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]):
sp_data = source_combined[source_combined["species"] == species_name]
pivot = sp_data.pivot_table(
index="goa_source", columns="match_category",
values="n_predictions", fill_value=0, aggfunc="sum"
).reindex(columns=CAT_ORDER).reindex(source_order)
pct = pivot.div(pivot.sum(axis=1), axis=0) * 100
pct.plot.barh(stacked=True, ax=ax, color=CAT_COLORS, legend=False)
ax.set_xlabel("% of predictions")
ax.set_title(species_name)
axes[0].legend(CAT_LABELS, fontsize=7, loc="lower right")
plt.suptitle("ProtNLM2 vs GOA by annotation source (%)", fontsize=12)
plt.tight_layout()
plt.show()
# Per-species summary tables
for species_name in ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]:
sp_data = source_combined[source_combined["species"] == species_name]
pivot = sp_data.pivot_table(
index="goa_source", columns="match_category",
values="n_predictions", fill_value=0, aggfunc="sum"
).reindex(columns=CAT_ORDER).reindex(source_order)
display(Markdown(f"#### {species_name}"))
display(pivot)
ProtNLM2 vs automated pipelines: where is ProtNLM2 more specific?¶
The most interesting comparisons are against InterPro2GO (domain-based GO annotation)
and IBA (phylogenetic annotation from PAINT/GO_Central). When ProtNLM2 predictions are
MORE_SPECIFIC or NO_OVERLAP compared to these pipelines, this represents genuine added
value: predictions that go beyond what existing automated methods provide.
Below we show the detailed comparison for each species against each pipeline, highlighting predictions where ProtNLM2 is more specific or in a different ontology branch.
for goa_source_name in ["InterPro2GO", "IBA"]:
display(Markdown(f"---"))
display(Markdown(f"### ProtNLM2 vs {goa_source_name}: where is ProtNLM2 more specific?"))
for species_name in ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]:
key = (species_name, goa_source_name)
if key not in source_details:
continue
detail = source_details[key]
n_evaluable = len(detail)
n_proteins = detail["accession"].nunique()
cat_counts = detail["match_category"].value_counts()
lines = [
f"#### {species_name} vs {goa_source_name}",
f"",
f"**{n_evaluable} predictions** for **{n_proteins} proteins** that have {goa_source_name} annotations.",
f"",
f"| Category | Count |",
f"|---|---|",
]
for cat in CAT_ORDER:
if cat != "NOT_IN_GCRP":
lines.append(f"| {cat} | {cat_counts.get(cat, 0)} |")
lines.append("")
# Enrich with evidence data
detail_enriched = detail.merge(
predictions[["accession", "pred_id", "evidence_key"]],
left_on=["accession", "protnlm_term"],
right_on=["accession", "pred_id"],
how="left"
).merge(
evidence[["accession", "evidence_key", "model_score", "match_text",
"match_location", "match_type", "phmmer_accession", "phmmer_score",
"tmalign_accession", "tmalign_score1"]],
on=["accession", "evidence_key"],
how="left"
)
more = detail_enriched[detail_enriched["match_category"] == "MORE_SPECIFIC"]
if len(more) > 0:
lines.append(f"**{len(more)} predictions where ProtNLM2 is more specific than {goa_source_name}:**")
lines.append("")
lines.append(f"| Protein | ProtNLM2 predicts | Score | Evidence | {goa_source_name} has |")
lines.append("|---|---|---|---|---|")
for _, row in more.iterrows():
score = f"{row['model_score']:.2f}" if pd.notna(row.get('model_score')) else "?"
# Build evidence summary
if pd.notna(row.get('phmmer_accession')):
ev_str = f"phmmer: {row['phmmer_accession']} ({row['phmmer_score']})"
elif pd.notna(row.get('tmalign_accession')):
ev_str = f"tmalign: {row['tmalign_accession']} ({row['tmalign_score1']})"
elif pd.notna(row.get('match_text')):
ev_str = f"{row['match_location']}: {row['match_text']}"
else:
ev_str = ""
lines.append(
f"| [{row.accession}](https://www.uniprot.org/uniprotkb/{row.accession}) "
f"| {row.protnlm_label} (`{row.protnlm_term}`) "
f"| {score} | {ev_str} "
f"| {row.closest_goa_label} (`{row.closest_goa_term}`) |"
)
lines.append("")
no_ov = detail_enriched[detail_enriched["match_category"] == "NO_OVERLAP"]
if len(no_ov) > 0:
lines.append(f"**{len(no_ov)} predictions with no ontological overlap to {goa_source_name}:**")
lines.append("")
lines.append("| Protein | ProtNLM2 predicts | Score | Evidence |")
lines.append("|---|---|---|---|")
for _, row in no_ov.iterrows():
score = f"{row['model_score']:.2f}" if pd.notna(row.get('model_score')) else "?"
if pd.notna(row.get('phmmer_accession')):
ev_str = f"phmmer: {row['phmmer_accession']} ({row['phmmer_score']})"
elif pd.notna(row.get('tmalign_accession')):
ev_str = f"tmalign: {row['tmalign_accession']} ({row['tmalign_score1']})"
elif pd.notna(row.get('match_text')):
ev_str = f"{row['match_location']}: {row['match_text']}"
else:
ev_str = ""
lines.append(
f"| [{row.accession}](https://www.uniprot.org/uniprotkb/{row.accession}) "
f"| {row.protnlm_label} (`{row.protnlm_term}`) "
f"| {score} | {ev_str} |"
)
lines.append("")
display(Markdown("\n".join(lines)))
Note on IBA-specific comparisons¶
Some predictions classified as MORE_SPECIFIC vs IBA are already present in GOA via other evidence codes. For example:
Q9BYD2 (MRPL9, human mitochondrial ribosomal protein)
ProtNLM2 predicts ribosome (GO:0005840) and ribonucleoprotein complex (GO:1990904),
classified as MORE_SPECIFIC vs IBA (which only has mitochondrion). But GOA already has
both terms via IEA, plus structural constituent of ribosome via IEA and NAS. The protein
is a ribosomal protein — membership in the ribosome complex is literally what it is.
This highlights a gap in GO's logical inference: if a protein has structural constituent of ribosome (MF), it should by definition be annotated to ribosome (CC). GO lacks
the axiom that would make this automatic (something like: "structural constituent of X"
implies "part_of X"). ProtNLM2 captures this trivially, but so would a simple rule:
If annotated to
structural constituent of ribosome→ addribosome
This is another instance of the "bookkeeping" pattern — the prediction is correct but trivially derivable from existing annotations.
Commentary on NO_OVERLAP cases¶
Inspecting the NO_OVERLAP cases reveals two common patterns that inflate the apparent novelty of ProtNLM2 predictions:
Pattern 1: Uninformative cellular component predictions¶
O87844 has IBA annotation for ribosome disassembly — a process that obviously
occurs in the cytoplasm. ProtNLM2 predicts cytoplasm (GO:0005737), which is classified
as NO_OVERLAP because there is no is_a/part_of path between a BP term (ribosome disassembly)
and a CC term (cytoplasm). But this prediction adds zero information: it would be surprising
if ribosome disassembly happened outside the cytoplasm. Generic CC terms like cytoplasm,
membrane, and nucleus are low-value predictions that are almost always inferable from
the process or function annotations.
Pattern 2: Ontology gaps in cross-aspect relationships¶
Q9KZ33 has IBA annotation for sigma factor activity (GO:0016987). ProtNLM2 predicts
DNA-templated transcription initiation (GO:0006352). These are biologically tightly
coupled — sigma factors perform transcription initiation — but the ontology models this
indirectly:
sigma factor activityis an MF term whose ancestors includeregulation of DNA-templated transcription initiationandtranscription regulator activityDNA-templated transcription initiationis a BP term underDNA-templated transcription- There is no is_a/part_of path between "regulation of X" and "X" itself
So the closure-based comparison classifies this as NO_OVERLAP, when biologically the
prediction is entirely consistent with the IBA annotation. This is a limitation of using
is_a/part_of closure for cross-aspect comparisons — a more sophisticated evaluation would
use broader relationship types (e.g. regulates, enables) or semantic similarity metrics.
Case study: A0A3B6GK97 — trivially correct but reveals annotation gaps¶
ProtNLM2 predicts lipid catabolic process (GO:0016042) for wheat protein
A0A3B6GK97, which is classified as
MORE_SPECIFIC because GOA only has the parent term lipid metabolic process (GO:0006629).
Is this prediction likely correct? The protein has IBA annotations for glycerophospholipase
activity and monoacylglycerol lipase activity — both hydrolytic enzymes that degrade
lipids. Catabolism is literally what lipases do. We can quantify this by asking: among all
wheat proteins with glycerophospholipase activity that also have lipid metabolic process,
how many also have lipid catabolic process?
# Co-annotation analysis: glycerophospholipase + lipid metabolism -> lipid catabolism?
con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_wheat.ddb"), read_only=True)
gba_df = con.execute("""
WITH has_lipase AS (
SELECT DISTINCT db_object_id
FROM gaf_association WHERE ontology_class_ref = 'GO:0004620'
),
has_lipid_met AS (
SELECT DISTINCT db_object_id
FROM gaf_association WHERE ontology_class_ref = 'GO:0006629'
),
has_lipid_cat AS (
SELECT DISTINCT db_object_id
FROM gaf_association WHERE ontology_class_ref = 'GO:0016042'
)
SELECT
(SELECT COUNT(*) FROM has_lipase) AS n_glycerophospholipase,
(SELECT COUNT(*) FROM has_lipase INNER JOIN has_lipid_met USING (db_object_id)) AS n_with_lipid_metabolism,
(SELECT COUNT(*) FROM has_lipase INNER JOIN has_lipid_cat USING (db_object_id)) AS n_with_lipid_catabolism;
""").fetchdf()
con.close()
n_lipase = gba_df["n_glycerophospholipase"].iloc[0]
n_met = gba_df["n_with_lipid_metabolism"].iloc[0]
n_cat = gba_df["n_with_lipid_catabolism"].iloc[0]
pct = n_cat / n_met * 100 if n_met > 0 else 0
display(Markdown(f"""
**Guilt-by-association analysis** (wheat proteins with glycerophospholipase activity):
| | Count |
|---|---|
| Proteins with glycerophospholipase activity | {n_lipase} |
| ...that also have `lipid metabolic process` | {n_met} |
| ...that also have `lipid catabolic process` | {n_cat} |
| **Co-annotation rate** | **{pct:.0f}%** |
**{n_cat} out of {n_met}** ({pct:.0f}%) proteins with glycerophospholipase activity and
`lipid metabolic process` also have `lipid catabolic process`. A0A3B6GK97 is one of just
{n_met - n_cat} that lack the more specific term — almost certainly a gap in the existing
annotations rather than a genuinely novel prediction by ProtNLM2.
"""))
Interpretation: This is a case where ProtNLM2 trivially outperforms the existing annotation — not because the model has deep biological insight, but because the existing annotations could easily be deepened by simple means:
- Guilt-by-association (GBA): if 94% of proteins with enzyme activity X also have process annotation Y, propagate Y to the remainder
- Curatorial rules: lipases are catabolic enzymes by definition; a curator reviewing
lipase activity+lipid metabolic processshould always addlipid catabolic process - Logical inference: GO could define axioms like "lipase activity enables lipid catabolic process" to make this deduction automatic
This pattern — where ML predictions look impressive but are really just filling obvious gaps that existing pipelines neglect — is important context for evaluating ProtNLM2. The model isn't discovering new biology here; it's doing bookkeeping that the annotation ecosystem hasn't gotten around to yet.
Case study: A0A3B6RKV1 — annotation transfer from top phmmer hit¶
ProtNLM2 makes 5 highly specific plant biology predictions for wheat protein A0A3B6RKV1 (JmjC domain-containing protein with F-box), all classified as MORE_SPECIFIC compared to IBA:
| Predicted term | GO ID |
|---|---|
| gibberellin mediated signaling pathway | GO:0010476 |
| regulation of photomorphogenesis | GO:0010099 |
| positive regulation of seed germination | GO:0010030 |
| epigenetic regulation of gene expression | GO:0040029 |
| response to red light | GO:0010114 |
These are very specific developmental processes — not generic terms you would associate with
any JmjC protein. Only epigenetic regulation of gene expression has any overlap with
histone demethylases in Arabidopsis (4 of 78 proteins). The others (gibberellin, photomorphogenesis,
seed germination, red light) have zero co-occurrence with JmjC proteins in GOA.
So where do these predictions come from?
display(Markdown("**ProtNLM2 evidence for A0A3B6RKV1:**"))
ev_subset = evidence[evidence["accession"] == "A0A3B6RKV1"]
display(ev_subset[["accession", "evidence_key", "model_score", "match_text",
"match_location", "phmmer_accession", "phmmer_score"]].fillna(""))
# The phmmer hit is Q67XX3 = JMJ22 (Arabidopsis). What does JMJ22 have?
con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_uniprot_gcrp.ddb"), read_only=True)
jmj22 = con.execute(
"SELECT ontology_class_ref, t.label, evidence_type, assigned_by "
"FROM gaf_association g "
"JOIN term_label t ON g.ontology_class_ref = t.id "
"WHERE db_object_id = 'Q67XX3' "
"AND ontology_class_ref IN "
"('GO:0010476', 'GO:0010099', 'GO:0010030', 'GO:0040029', 'GO:0010114') "
"ORDER BY ontology_class_ref"
).fetchdf()
con.close()
display(Markdown("**JMJ22 (Q67XX3) — the phmmer hit (score 689.2) — has all 5 terms with experimental evidence:**"))
display(jmj22)
Every single prediction is a direct copy of JMJ22 experimentally validated annotations.
The wheat protein hits Arabidopsis JMJ22 (arginine-specific demethylase) via phmmer with a strong score (689.2). JMJ22 has all 5 terms with experimental evidence (IMP, IGI, IDA, IEP). ProtNLM2 simply transfers the top hit annotations — this is the same principle as ISS/ISO evidence in GO, or what IBA does via PAINT phylogenetic trees.
The IBA annotations for this wheat protein already capture the MF (cis-regulatory region sequence-specific DNA binding) and CC (nucleus) from JMJ22. But the BP terms did not
propagate via IBA — likely because the PAINT tree for this family does not include BP
annotations at the relevant ancestral node, or the wheat protein falls outside the
annotated clade.
Interpretation: These predictions are probably correct if the wheat protein is truly a JMJ22 ortholog (the phmmer score suggests it is). But ProtNLM2 is not doing anything beyond annotation transfer from the closest characterized sequence. The "added value" over IBA is that ProtNLM2 transfers BP annotations that PAINT more conservative phylogenetic approach chose not to propagate — a difference in methodology (aggressive phmmer transfer vs. conservative clade-based inference), not in biological insight.
Systematic review of all MORE_SPECIFIC and NO_OVERLAP predictions¶
Having examined two cases in detail, we now survey all remaining MORE_SPECIFIC and NO_OVERLAP predictions across the four species. They fall into a small number of recurring patterns:
| Pattern | Description | Example |
|---|---|---|
| Trivial deepening | Prediction is an obvious child of the GOA term | zinc ion binding > metal ion binding |
| Phmmer transfer | All predictions trace to the top phmmer hit's annotations | JMJ22 case above |
| Cross-aspect gap | MF and BP terms are biologically coupled but have no closure path | kinase activity vs phosphorylation |
| Uninformative CC | Generic location term that adds nothing | cytoplasm for a cytoplasmic process |
| Cross-kingdom error | Annotations from wrong lineage transferred to the target | neuron terms on a plant protein |
Below we walk through each pattern with specific examples.
Pattern 1: Trivial deepening¶
These predictions simply replace a broad term with an obvious child. They are almost certainly correct but could be achieved by simple rules or GBA — no ML needed.
| Protein | Species | ProtNLM2 predicts | GOA has |
|---|---|---|---|
| A0A3B6CCN7 | Wheat | zinc ion binding | metal ion binding |
| A0A3B6FNX0 | Wheat | iron ion binding | metal ion binding |
| A0A3B6INB8 | Wheat | DNA binding | nucleic acid binding |
| A0A3B6S8Y8 | Wheat | positive regulation of transcription by RNA pol II | regulation of transcription by RNA pol II |
| Q9M0U6 | Arabidopsis | SCF-dependent proteasomal ubiquitin-dependent protein catabolic process | ubiquitin-dependent protein catabolic process |
Several more predictions are MORE_SPECIFIC against root terms (molecular_function,
biological_process, cellular_component) — these proteins effectively have no annotation
and any non-root prediction is trivially "more specific":
F4IRM4 (plastid vs CC root), F4K5X1 (metal ion binding vs MF root),
Q0WW53 (catalytic activity vs MF root), Q8LG27 (phosphorylation vs BP root),
Q8RU85 (membrane vs CC root), Q9FFI4 (transferase activity vs MF root),
Q9FFW3 (defense response vs BP root), Q9LIR0 (ubiquitin-protein transferase vs MF root),
Q9LKR4 (nucleus vs CC root), Q9SSF9 (DNA binding vs MF root).
Pattern 2: Phmmer-based annotation transfer¶
Like the JMJ22 case, these predictions are direct copies of the top phmmer hit's annotations. The predictions may be correct if orthology holds, but the method is essentially ISS/ISO — sequence-similarity-based annotation transfer.
| Protein | Species | Prediction | phmmer hit | Score |
|---|---|---|---|---|
| A0A3B6DJW8 | Wheat | defense response to fungus | P93194 (PGIP, Phaseolus) | 258 |
| A0A3B6TMW0 | Wheat | defense response to fungus | P93194 (PGIP, Phaseolus) | 264 |
| A0A3B6HYU7 | Wheat | defense response to fungus | P93194 (PGIP, Phaseolus) | 275 |
| A0A3B6RKQ5 | Wheat | copper homeostasis, cadmium response, protein domain binding | Q94BT9 (ATX1, Arabidopsis) | 114 |
| A0A3B6SNL4 | Wheat | recognition of pollen | P17801 (SRK, Brassica) | 893 |
| A0A3B6LH70 | Wheat | negative regulation of ABA signaling, nucleus | Q39266 (AZF2, Arabidopsis) | 77 |
| Q9LJ71 | Arabidopsis | phosphorylation | O48716 | 503 |
| Q8N7F7 | Human | ER membrane targeting | B7NZQ9 | 125 |
The three polygalacturonase inhibitors (PGIP) all hit the same Phaseolus PGIP and get
defense response to fungus — a well-established function for this protein family.
| Q93IY1 | S. coelicolor | L-phosphoserine phosphatase activity | P40399 (RsbU, B. subtilis) | 39 |
The pollen recognition prediction for A0A3B6SNL4 (phmmer score 893 to Brassica SRK) is interesting but may be wrong: SRK-mediated self-incompatibility is specific to Brassicaceae and wheat uses a different self-incompatibility system.
Pattern 3: Cross-aspect ontology gaps¶
These are classified as NO_OVERLAP because the protein has an MF annotation and ProtNLM2 predicts the corresponding BP term (or vice versa), but there is no is_a/part_of path between the two aspects. The predictions are biologically plausible — they describe the same biology from a different perspective.
| Protein | Species | ProtNLM2 predicts | GOA has |
|---|---|---|---|
| A0A3B6H133 | Wheat | phosphorylation, kinase activity | ATP binding, signaling receptor activity (IBA) |
| A0A3B6KN92 | Wheat | protein phosphorylation | calcium-dependent protein kinase (IEA) |
| A0A3B6NKR6 | Wheat | phosphorylation | GHMP kinase domain (IEA) |
| A0A3B6EAI8 | Wheat | protein ubiquitination, ub-dependent catabolism | ubiquitin-protein transferase activity (IEA) |
| A0A3B6EQ70 | Wheat | protein ubiquitination | zinc ion binding, calmodulin binding, regulation of transcription (IEA) |
| A0A3B6MX28 | Wheat | methylation | S-adenosylmethionine-dependent methyltransferase (IEA) |
| Q6P6B1 | Human | phosphorylation | protein binding only (IPI) — no evidence for kinase activity |
| Q9KZ33 | S. coelicolor | transcription initiation | sigma factor activity (IBA) |
This is the largest category of apparent "novelty" — and it is entirely an artifact of
evaluating with is_a/part_of closure only. Using broader relationship types (enables,
regulates, part_of across aspects) or semantic similarity would collapse most of these
into EXACT or LESS_SPECIFIC.
Pattern 5: Cross-kingdom annotation transfer errors¶
The most serious failure mode: ProtNLM2 transfers annotations from the wrong taxonomic lineage via phmmer, producing biologically impossible predictions.
F6LAX4 — wheat TOG domain protein
phmmer hit: A0A7S3G569 (Palpitomonas bilix, a protist), score 420.
ProtNLM2 predicts:
neuron projection (GO:0043005)— plants do not have neuronsneuronal cell body (GO:0043025)— plants do not have neuronsprotein antigen binding (GO:1990405)— adaptive immunity term, not applicable to plantschromosome, centromeric region (GO:0000775)— plausible but likely from wrong contextchromosome segregation (GO:0007059)— plausibleprotein heterodimerization activity (GO:0046982)— generic, plausible
TOG domains are conserved across eukaryotes (they bind tubulin and regulate microtubule polymerization), but the phmmer hit picks up a protist protein whose annotations include animal-specific neuron terms. ProtNLM2 has no organism-awareness filter and blindly transfers everything.
This is the same class of error we documented in the BioReason evaluation — failure to respect taxonomic context — and argues for organism-aware filtering in any annotation transfer pipeline.
Pattern 6: Multidomain protein false positives¶
A distinct failure mode from cross-kingdom transfer: the phmmer hit matches a shared accessory domain in a multidomain protein, and the catalytic activity of the unrelated domain gets transferred.
F4JLB7 — RIC7, Arabidopsis LRR protein
ProtNLM2 predicts kinase activity (GO:0016301) and phosphorylation (GO:0016310),
both with model_score 0.23 (low, but still reported).
The evidence:
- phmmer hit: Q5S006 = LRRK2 (mouse leucine-rich repeat serine/threonine-protein kinase 2), score 33.0 — barely above noise threshold
- PANTHER: PTHR48057:SF13 — but UniProt actually assigns F4JLB7 to PTHR48004 (a completely different family). ProtNLM2 matched the wrong PANTHER family.
What's happening: LRRK2 is a large (2,527 aa) multidomain protein with LRR + ROC + COR + kinase domains. F4JLB7 (RIC7) is a small LRR protein that functions as a ROP GTPase effector in pollen tube growth. The phmmer hit matches only the LRR repeat region (a common structural domain), but the kinase prediction comes from LRRK2's unrelated kinase domain.
RIC7 has:
- LRR repeats (structural, no catalytic activity)
- IBA for
signaling receptor activityandplasma membrane - IMP for
pollen tube growth - No kinase domain, no kinase activity
Verdict: false positive. The model_score (0.23) is appropriately low, suggesting ProtNLM2 has some calibration — it's not confident in this prediction. But it still reports it. This failure mode — annotation leakage from one domain of a multidomain hit to a protein that only shares a different domain — is a well-known problem in sequence-based annotation transfer (sometimes called the "multi-domain problem" or "domain architecture mismatch").
Pattern 4: Uninformative CC and miscellaneous noise¶
Some predictions are technically not wrong but add no useful information:
| Protein | Species | Prediction | Why uninformative |
|---|---|---|---|
| Q8GXB1 | Arabidopsis | membrane | DUF1990 protein — membrane is too generic to be useful |
| A0A3B6GUW2 | Wheat | endoplasmic reticulum | Might be correct but low-information without MF/BP context |
| Q9M0U6 | Arabidopsis | SCF ubiquitin ligase complex | NO_OVERLAP — but the protein is an F-box protein, so SCF complex membership is trivially implied |
The SCF complex case (Q9M0U6) is particularly instructive: the protein is annotated as an F-box protein, and F-box proteins are by definition components of SCF complexes. This is another "bookkeeping" prediction that any rule-based system could make.
Summary of error/novelty patterns¶
Across all four species, every MORE_SPECIFIC and NO_OVERLAP prediction falls into one of five patterns:
| Pattern | Count (approx) | Genuinely novel? |
|---|---|---|
| Trivial deepening | ~20 | No — simple rules or GBA |
| Phmmer transfer | ~15 | Maybe — depends on orthology |
| Cross-aspect gap | ~15 | No — evaluation artifact |
| Uninformative CC/noise | ~5 | No |
| Cross-kingdom error | 1 (6 predictions) | No — actively wrong |
| Multidomain false positive | ~2 | No — wrong domain matched |
None of the predictions we examined represent genuinely novel biological insight. The closest are the phmmer transfer cases (e.g. JMJ22, PGIP), where the predictions may be correct but the method (sequence similarity transfer) is decades old and already implemented in GO as ISS/ISO evidence. ProtNLM2 adds value primarily by being more aggressive than PAINT/IBA in transferring annotations — at the cost of occasional cross-kingdom errors like the neuron-in-wheat case.
Note on evaluation methodology: semantic similarity metrics (Resnik, Lin) as used in CAFA are not a good alternative to closure-based evaluation — they reward predictions that are ontologically close but biologically opposite (e.g. positive vs negative regulation). Our closure-based approach with manual inspection of edge cases is more reliable, though it requires supplementing the is_a/part_of closure with broader relationship types (enables, regulates) for cross-aspect comparisons.
Caveat: Swiss-Prot does not mean well-characterized¶
Some Swiss-Prot entries have no functional annotation beyond root-level ND (No Data)
terms like molecular_function or biological_process. These are reviewed for existence
and structure but remain functionally uncharacterized.
Example: Q9SSF9 — Pentatricopeptide repeat-containing protein At1g74750
This is Swiss-Prot since 2004 but GOA has only GO:0003674 molecular_function (ND).
ProtNLM2 predicts DNA binding, chloroplast localization, and organelle-nucleus signaling.
The localization predictions are probably correct (most PPR proteins are organellar), but
"DNA binding" is likely wrong — PPR proteins bind RNA, not DNA. This is a classic
domain-level error where the model confuses structurally similar PPR/TPR repeat architectures.
When ProtNLM2 predictions are classified as MORE_SPECIFIC, many are against ND-only baselines — "anything is more specific than nothing." The table below separates proteins with real functional annotations from those with only ND root terms.
from IPython.display import display, Markdown
# For each species, check how many proteins in GOA only have ND annotations
nd_stats = {}
for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
taxon_and = f"AND db_object_taxon = '{taxon_str}'" if taxon_str else ""
con = duckdb.connect(str(db_path), read_only=True)
nd_df = con.execute(f"""
WITH protnlm_accs AS (
SELECT DISTINCT accession FROM read_csv_auto('{csv_path}')
),
protein_annotations AS (
SELECT
db_object_id,
COUNT(*) FILTER (WHERE evidence_type != 'ND') as n_real
FROM gaf_association
WHERE db_object_id IN (SELECT accession FROM protnlm_accs)
{taxon_and}
GROUP BY db_object_id
)
SELECT
CASE WHEN n_real = 0 THEN 'ND only (no real annotations)'
ELSE 'Has functional annotations' END AS annotation_status,
COUNT(*) AS n_proteins
FROM protein_annotations
GROUP BY 1
ORDER BY 1;
""").fetchdf()
con.close()
nd_stats[species_name] = nd_df
all_nd = pd.concat({k: v.set_index("annotation_status")["n_proteins"] for k, v in nd_stats.items()}, axis=1).fillna(0).astype(int)
display(Markdown("**Proteins in GOA: ND-only vs functionally annotated**"))
display(all_nd)
What are the NOT_IN_GCRP proteins?¶
A large fraction of ProtNLM2 predictions cannot be evaluated because the protein is absent from the GCRP (Gene Centric Reference Proteome) subset of GOA used in our DuckDB databases. But are these proteins truly unannotated, or just absent from the GCRP subset?
We cross-checked against the full QuickGO API to distinguish:
- In full GOA but not GCRP — annotations exist, just not in the reference proteome subset
- Truly absent from GOA — no annotations anywhere
# Analyze human NOT_IN_GCRP proteins
import urllib.request as urlreq
con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_human.ddb"), read_only=True)
human_accs = set(entries[entries["taxon_id"] == 9606]["accession"])
human_go = predictions[(predictions["pred_type"] == "GO") & (predictions["accession"].isin(human_accs))]
human_go[["accession", "pred_id", "pred_label"]].to_csv("/tmp/protnlm_human_go.csv", index=False)
not_in_gcrp = con.execute("""
CREATE OR REPLACE TEMP TABLE protnlm AS SELECT * FROM read_csv_auto('/tmp/protnlm_human_go.csv');
WITH in_goa AS (SELECT DISTINCT db_object_id FROM gaf_association)
SELECT DISTINCT p.accession
FROM protnlm p LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
WHERE ig.db_object_id IS NULL
""").fetchdf()
not_in_gcrp_accs = sorted(not_in_gcrp["accession"])
not_in_gcrp_preds = human_go[human_go["accession"].isin(not_in_gcrp_accs)]
# Cross-check against full QuickGO API
import time
has_full_goa = []
truly_absent = []
for acc in not_in_gcrp_accs:
url = f"https://www.ebi.ac.uk/QuickGO/services/annotation/search?geneProductId={acc}&limit=1"
try:
req = urlreq.Request(url)
req.add_header("Accept", "application/json")
with urlreq.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
n = data.get("numberOfHits", 0)
if n > 0:
has_full_goa.append((acc, n))
else:
truly_absent.append(acc)
except:
truly_absent.append(acc)
time.sleep(0.3)
con.close()
display(Markdown(f"""
**Human NOT_IN_GCRP: {len(not_in_gcrp_accs)} proteins, {len(not_in_gcrp_preds)} GO predictions**
Cross-check against full QuickGO:
| | Count |
|---|---|
| Total NOT_IN_GCRP | {len(not_in_gcrp_accs)} |
| Have annotations in full GOA (not in GCRP subset) | {len(has_full_goa)} |
| **Truly absent from GOA** | **{len(truly_absent)}** |
Over half ({len(has_full_goa)}/{len(not_in_gcrp_accs)}) of the "missing" proteins actually
have GOA annotations — they are just absent from the Gene Centric Reference Proteome subset.
These are typically alternate TrEMBL accessions for proteins whose canonical accession is
in the reference proteome.
"""))
# Show the truly absent ones
display(Markdown("**Truly absent from GOA (no annotations in QuickGO):**"))
truly_absent_info = []
for acc in truly_absent:
entry = entries[entries["accession"] == acc]
pname = entry.iloc[0]["protein_name"] if len(entry) > 0 else "?"
ev = evidence[evidence["accession"] == acc]
phmmer = ev[ev["phmmer_accession"].notna()]
hit = f"{phmmer.iloc[0]['phmmer_accession']} ({phmmer.iloc[0]['phmmer_score']})" if len(phmmer) > 0 else "none"
truly_absent_info.append({"accession": acc, "protein_name": pname, "phmmer_hit": hit})
display(pd.DataFrame(truly_absent_info))
Interpretation: The NOT_IN_GCRP category splits into two distinct groups:
1. In full GOA but not GCRP — These are alternate TrEMBL accessions for proteins whose canonical accession has annotations in the reference proteome. ProtNLM2's predictions for these are likely correct but entirely redundant. Example: F5H290 (ZNF7) has a canonical accession P17097 with 11 GOA annotations.
2. Truly absent from GOA — These are genuinely interesting. The standout case is J3KSL8 (926 aa, PE=1 — evidence at protein level, UniProt's highest confidence) which hits Q9Y2L5 (TRAPPC8) at phmmer score 2102 yet has zero GOA annotations. It's unclear why the normal UniProt/GOA pipelines failed to produce any annotations for a protein with such strong homology to a well-characterized Swiss-Prot entry. This represents a genuine gap in the GOA pipeline rather than an achievement of ProtNLM2.
Other truly-absent proteins tend to be domain-containing fragments with moderate phmmer hits (100–350 range) or uncharacterized proteins with no phmmer support at all.
Per-species detailed comparison¶
Below we drill into each species individually. For every ProtNLM2 GO prediction where the protein exists in GOA, we show the predicted term, the match category, and the closest GOA term. This lets us inspect exactly what ProtNLM2 gets right, where it is redundant, and where it might be adding novel (or incorrect) information.
for species_name, detail in all_details.items():
n_evaluable = len(detail)
n_proteins = detail["accession"].nunique()
cat_counts = detail["match_category"].value_counts()
exact = cat_counts.get("EXACT", 0)
less = cat_counts.get("LESS_SPECIFIC", 0)
more = cat_counts.get("MORE_SPECIFIC", 0)
no_overlap = cat_counts.get("NO_OVERLAP", 0)
# Check if MORE_SPECIFIC cases are against root terms (ND annotations)
root_terms = {"molecular_function", "biological_process", "cellular_component"}
more_rows = detail[detail["match_category"] == "MORE_SPECIFIC"]
more_vs_root = more_rows["closest_goa_label"].isin(root_terms).sum() if len(more_rows) > 0 else 0
lines = [
f"#### {species_name}",
f"",
f"**{n_evaluable} predictions** across **{n_proteins} proteins** that are in GOA.",
f"",
f"| Category | Count | Interpretation |",
f"|---|---|---|",
f"| Exact | {exact} | Predicted term is the same GO term already in GOA |",
f"| Less specific | {less} | Predicted term is a strict ancestor of a GOA term (redundant) |",
]
if more > 0:
note = ""
if more_vs_root > 0:
note = f" (**{more_vs_root} of these are vs root-level ND annotations** — the protein effectively has no real annotation)"
lines.append(f"| More specific | {more} | Predicted term is a strict descendant of a GOA term{note} |")
if no_overlap > 0:
lines.append(f"| No overlap | {no_overlap} | Predicted term is in a different ontology branch from all GOA annotations |")
lines.append("")
display(Markdown("\n".join(lines)))
display(detail.style.set_properties(**{"text-align": "left"}).set_caption(f"{species_name} — detailed predictions"))