<evidence type="ECO:0008006" key="1">
  <source>
    <dbReference type="Google" id="ProtNLM2">
      <property type="model_score" value="0.96"/>
      <property type="string_match_text" value="WRKY"/>
      <property type="string_match_location" value="domain"/>
      <property type="string_match_type" value="exact_sanitized"/>
    </dbReference>
  </source>
</evidence>

import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt

plt.rcParams["figure.dpi"] = 120

DATA = Path(".")

entries = pd.read_csv(DATA / "entries.tsv", sep="\t")
predictions = pd.read_csv(DATA / "predictions.tsv", sep="\t")
evidence = pd.read_csv(DATA / "evidence.tsv", sep="\t")

# Taxonomy: maps accession -> organism name + taxon_id
taxonomy = pd.read_csv(DATA / "taxonomy.tsv", sep="\t")
entries = entries.merge(taxonomy, on="accession", how="left")

# Lineage: maps taxon_id -> broad taxonomic group
lineage_path = DATA / "lineage.tsv"
if lineage_path.exists():
    lineage = pd.read_csv(lineage_path, sep="\t")
    entries = entries.merge(lineage, on="taxon_id", how="left")
    print(f"Lineage groups available for {entries['broad_group'].notna().sum():,} entries")

print(f"Entries: {len(entries):,}  |  Predictions: {len(predictions):,}  |  Evidence blocks: {len(evidence):,}")
print(f"Species: {entries['organism'].nunique()}")

entries["has_go"] = entries["n_go"] > 0
entries["has_function"] = entries["n_function"] > 0
entries["has_subcell"] = entries["n_subcell"] > 0
entries["name_only"] = ~entries["has_go"] & ~entries["has_function"] & ~entries["has_subcell"]

summary = pd.DataFrame({
    "count": [
        len(entries),
        entries["has_go"].sum(),
        entries["has_function"].sum(),
        entries["has_subcell"].sum(),
        entries["name_only"].sum(),
    ],
}, index=["Total entries", "With GO terms", "With function comment",
         "With subcellular location", "Name only"])
summary["pct"] = (summary["count"] / len(entries) * 100).round(1)
summary

fig, ax = plt.subplots(figsize=(8, 3.5))
cats = ["has_go", "has_function", "has_subcell", "name_only"]
labels = ["GO terms", "Function comment", "Subcellular location", "Name only"]
vals = [entries[c].sum() for c in cats]
ax.barh(labels, vals, color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"])
ax.set_xlabel("Number of entries")
ax.set_title("ProtNLM2 prediction types (n=28,553)")
for i, v in enumerate(vals):
    ax.text(v + 200, i, f"{v:,}", va="center", fontsize=9)
plt.tight_layout()
plt.show()

entries["name_type"].value_counts().to_frame("count")

status_path = DATA / "reviewed_status.tsv"
if status_path.exists():
    status = pd.read_csv(status_path, sep="\t")
    entries = entries.merge(status, on="accession", how="left")

    sp_counts = entries["reviewed"].value_counts()
    sp_total = sp_counts.get(True, 0)
    tr_total = sp_counts.get(False, 0)
    print(f"Swiss-Prot (reviewed): {sp_total:,} ({sp_total/len(entries)*100:.1f}%)")
    print(f"TrEMBL (unreviewed):   {tr_total:,} ({tr_total/len(entries)*100:.1f}%)")

    # Breakdown by prediction richness
    review_summary = entries.groupby("reviewed").agg(
        n_entries=("accession", "count"),
        pct_go=("has_go", "mean"),
        pct_function=("has_function", "mean"),
        pct_subcell=("has_subcell", "mean"),
        pct_name_only=("name_only", "mean"),
    )
    review_summary.index = review_summary.index.map({True: "Swiss-Prot", False: "TrEMBL"})
    for col in ["pct_go", "pct_function", "pct_subcell", "pct_name_only"]:
        review_summary[col] = (review_summary[col] * 100).round(1)
    print("\nPrediction richness by review status:")
    display(review_summary)

    # Chart
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    # Pie
    axes[0].pie([sp_total, tr_total], labels=["Swiss-Prot", "TrEMBL"],
                autopct="%1.1f%%", colors=["#4c78a8", "#bab0ac"])
    axes[0].set_title("Swiss-Prot vs TrEMBL")

    # Stacked: prediction types by review status
    rev_pred = entries.groupby("reviewed").agg(
        has_go=("has_go", "sum"), has_function=("has_function", "sum"),
        has_subcell=("has_subcell", "sum"), name_only=("name_only", "sum"),
    )
    rev_pred.index = rev_pred.index.map({True: "Swiss-Prot", False: "TrEMBL"})
    rev_pred.plot.barh(stacked=True, ax=axes[1],
                       color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"])
    axes[1].set_xlabel("Number of entries")
    axes[1].set_title("Prediction types by review status")
    axes[1].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8)

    plt.tight_layout()
    plt.show()
else:
    print("reviewed_status.tsv not yet available — run fetch reviewed status")

if "reviewed" in entries.columns and "organism" in entries.columns:
    # By species (top 30 by total entries)
    top_species = entries["organism"].value_counts().head(30).index
    sp_review = entries[entries["organism"].isin(top_species)].groupby("organism")["reviewed"].agg(
        n_entries="count",
        n_swissprot="sum",
    )
    sp_review["n_trembl"] = sp_review["n_entries"] - sp_review["n_swissprot"]
    sp_review["pct_swissprot"] = (sp_review["n_swissprot"] / sp_review["n_entries"] * 100).round(1)
    sp_review = sp_review.sort_values("pct_swissprot", ascending=True)

    fig, ax = plt.subplots(figsize=(10, 8))
    ax.barh(sp_review.index, sp_review["n_swissprot"], label="Swiss-Prot", color="#4c78a8")
    ax.barh(sp_review.index, sp_review["n_trembl"], left=sp_review["n_swissprot"],
            label="TrEMBL", color="#bab0ac")
    ax.set_xlabel("Number of entries")
    ax.set_title("Swiss-Prot vs TrEMBL by species (top 30)")
    ax.legend(fontsize=9)
    for i, (_, row) in enumerate(sp_review.iterrows()):
        if row["n_swissprot"] > 0:
            ax.text(row["n_entries"] + 5, i, f'{row["pct_swissprot"]:.0f}% SP', va="center", fontsize=7)
    plt.tight_layout()
    plt.show()

    # By broad taxonomic group
    group_review = entries.groupby("broad_group")["reviewed"].agg(
        n_entries="count", n_swissprot="sum",
    )
    group_review["n_trembl"] = group_review["n_entries"] - group_review["n_swissprot"]
    group_review["pct_swissprot"] = (group_review["n_swissprot"] / group_review["n_entries"] * 100).round(1)
    group_review = group_review.sort_values("pct_swissprot", ascending=True)

    print("Swiss-Prot fraction by taxonomic group:")
    display(group_review.sort_values("pct_swissprot", ascending=False))

species_counts = entries["organism"].value_counts()
print(f"Unique species: {species_counts.shape[0]}")
print(f"\nTop 30 species:")
display(species_counts.head(30).to_frame("count"))

GROUP_COLORS = {
    "Land plant": "#54a24b",
    "Mammal": "#e45756",
    "Ray-finned fish": "#4c78a8",
    "Cartilaginous fish": "#72b7b2",
    "Bird": "#eeca3b",
    "Amphibian": "#b279a2",
    "Other vertebrate": "#9d755d",
    "Insect": "#f58518",
    "Mollusc": "#ff9da6",
    "Nematode": "#d67195",
    "Fungus": "#9467bd",
    "Bacterium": "#bab0ac",
    "Other animal": "#d4a6c8",
}

top20 = species_counts.head(20)
top20_df = top20.to_frame("count").reset_index()
top20_df.columns = ["organism", "count"]

# Merge broad_group
org_group = entries[["organism", "broad_group"]].drop_duplicates("organism")
top20_df = top20_df.merge(org_group, on="organism", how="left")
top20_df["color"] = top20_df["broad_group"].map(GROUP_COLORS).fillna("#bab0ac")

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(range(len(top20_df)), top20_df["count"], color=top20_df["color"])
ax.set_yticks(range(len(top20_df)))
ax.set_yticklabels([f"{row.organism}  ({row.broad_group})" for row in top20_df.itertuples()], fontsize=9)
ax.invert_yaxis()
ax.set_xlabel("Number of entries")
ax.set_title("Top 20 species in ProtNLM2 dataset (colored by taxonomic group)")
for i, v in enumerate(top20_df["count"]):
    ax.text(v + 8, i, str(v), va="center", fontsize=8)

# Legend
from matplotlib.patches import Patch
groups_in_chart = top20_df["broad_group"].unique()
handles = [Patch(facecolor=GROUP_COLORS.get(g, "#bab0ac"), label=g) for g in groups_in_chart if pd.notna(g)]
ax.legend(handles=handles, loc="lower right", fontsize=8)

plt.tight_layout()
plt.show()

group_counts = entries["broad_group"].value_counts()
display(group_counts.to_frame("count"))

fig, ax = plt.subplots(figsize=(8, 4))
colors = [GROUP_COLORS.get(g, "#bab0ac") for g in group_counts.index]
group_counts.plot.barh(ax=ax, color=colors)
ax.invert_yaxis()
ax.set_xlabel("Number of entries")
ax.set_title("Entries by broad taxonomic group")
for i, v in enumerate(group_counts):
    ax.text(v + 50, i, f"{v:,}", va="center", fontsize=8)
plt.tight_layout()
plt.show()

pred_by_group = entries.groupby("broad_group").agg(
    n_entries=("accession", "count"),
    has_go=("has_go", "sum"),
    has_function=("has_function", "sum"),
    has_subcell=("has_subcell", "sum"),
    name_only=("name_only", "sum"),
).sort_values("n_entries", ascending=True)

# As percentages
pct = pred_by_group[["has_go", "has_function", "has_subcell", "name_only"]].div(
    pred_by_group["n_entries"], axis=0
) * 100

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: stacked counts
pred_by_group[["has_go", "has_function", "has_subcell", "name_only"]].plot.barh(
    stacked=True, ax=axes[0],
    color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"]
)
axes[0].set_xlabel("Number of entries")
axes[0].set_title("Prediction types by taxonomic group (counts)")
axes[0].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8, loc="lower right")

# Right: percentages
pct.plot.barh(
    stacked=True, ax=axes[1],
    color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"]
)
axes[1].set_xlabel("% of entries in group")
axes[1].set_title("Prediction types by taxonomic group (%)")
axes[1].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8, loc="lower right")

plt.tight_layout()
plt.show()

sp_groups = entries.groupby("organism").filter(lambda g: len(g) >= 50).groupby("organism")
richness = sp_groups.agg(
    n_entries=("accession", "count"),
    pct_go=("has_go", "mean"),
    pct_function=("has_function", "mean"),
    pct_subcell=("has_subcell", "mean"),
    mean_go=("n_go", "mean"),
).sort_values("pct_go", ascending=False)
richness[["pct_go", "pct_function", "pct_subcell"]] = (
    richness[["pct_go", "pct_function", "pct_subcell"]] * 100
).round(1)
richness["mean_go"] = richness["mean_go"].round(2)
display(richness.head(30))

scores = evidence["model_score"].dropna()
print(f"Evidence blocks with scores: {len(scores):,}")
print(scores.describe().to_frame().T)

fig, ax = plt.subplots(figsize=(8, 3.5))
scores.hist(bins=50, ax=ax, color="#4c78a8", edgecolor="white")
ax.set_xlabel("Model score")
ax.set_ylabel("Count")
ax.set_title("ProtNLM2 model score distribution")
ax.axvline(scores.median(), color="red", ls="--", lw=1, label=f"median={scores.median():.2f}")
ax.legend()
plt.tight_layout()
plt.show()

ev_with_loc = evidence.dropna(subset=["match_location", "model_score"])
top_locs = ev_with_loc["match_location"].value_counts().head(10).index

fig, ax = plt.subplots(figsize=(10, 5))
ev_with_loc[ev_with_loc["match_location"].isin(top_locs)].boxplot(
    column="model_score", by="match_location", ax=ax, vert=False
)
ax.set_xlabel("Model score")
ax.set_title("Model score by match location (top 10)")
fig.suptitle("")
plt.tight_layout()
plt.show()

evidence["match_type"].value_counts(dropna=False).to_frame("count")

go_preds = predictions[predictions["pred_type"] == "GO"]
print(f"Total GO predictions: {len(go_preds):,}")
print(f"Unique GO terms: {go_preds['pred_id'].nunique()}")

top_go = go_preds["pred_label"].value_counts().head(20)
display(top_go.to_frame("count"))

# Aspect breakdown (GO term labels start with C:, F:, or P:)
go_preds = go_preds.copy()
go_preds["aspect"] = go_preds["pred_label"].str[0]
print("\nGO aspect distribution:")
display(go_preds["aspect"].value_counts().to_frame("count"))

subcell_preds = predictions[predictions["pred_type"] == "subcellular_location"]
print(f"Total subcellular location predictions: {len(subcell_preds):,}")

top_locs = subcell_preds["pred_label"].value_counts().head(15)
fig, ax = plt.subplots(figsize=(8, 4))
top_locs.plot.barh(ax=ax, color="#54a24b")
ax.invert_yaxis()
ax.set_xlabel("Count")
ax.set_title("Top 15 predicted subcellular locations")
plt.tight_layout()
plt.show()

evidence["method"] = "string_match"
evidence.loc[evidence["phmmer_accession"].notna(), "method"] = "phmmer"
evidence.loc[evidence["tmalign_accession"].notna(), "method"] = "tmalign"

method_counts = evidence["method"].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Pie chart
method_counts.plot.pie(ax=axes[0], autopct="%1.1f%%",
                       colors=["#4c78a8", "#f58518", "#54a24b"])
axes[0].set_ylabel("")
axes[0].set_title("Evidence method breakdown")

# Score distributions
for method, color in [("string_match", "#4c78a8"), ("phmmer", "#f58518"), ("tmalign", "#54a24b")]:
    subset = evidence[evidence["method"] == method]["model_score"].dropna()
    axes[1].hist(subset, bins=40, alpha=0.6, color=color, label=f"{method} (n={len(subset):,})")
axes[1].set_xlabel("Model score")
axes[1].set_ylabel("Count")
axes[1].set_title("Score distribution by evidence method")
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.show()

import subprocess

repo_root = Path("../..")
review_files = list(repo_root.glob("genes/*/*-ai-review.yaml"))
result = subprocess.run(
    ["grep", "-rh", "^id:", *[str(f) for f in review_files]],
    capture_output=True, text=True
)
reviewed_ids = {line.split(":", 1)[1].strip() for line in result.stdout.strip().split("\n") if line}
protnlm_ids = set(entries["accession"])

overlap = reviewed_ids & protnlm_ids
print(f"AIGR reviewed genes: {len(reviewed_ids):,}")
print(f"ProtNLM2 entries: {len(protnlm_ids):,}")
print(f"Overlap: {len(overlap)}")
if overlap:
    print(f"\nOverlapping accessions:")
    cols = ["accession", "protein_name", "n_go", "n_function", "n_subcell"]
    if "organism" in entries.columns:
        cols.insert(2, "organism")
    if "broad_group" in entries.columns:
        cols.insert(3, "broad_group")
    display(entries[entries["accession"].isin(overlap)][cols])

import duckdb

COMPARE_SPECIES = [
    ("Human", 9606, Path.home() / "repos/go-db/db/goa_human.ddb", None),
    ("Mouse", 10090, Path.home() / "repos/go-db/db/goa_mouse.ddb", None),
    ("X. laevis", 8355, Path.home() / "repos/go-db/db/goa_xenopus.ddb", "taxon:8355"),
    ("X. tropicalis", 8364, Path.home() / "repos/go-db/db/goa_xenopus.ddb", "taxon:8364"),
    ("Arabidopsis", 3702, Path.home() / "repos/go-db/db/goa_uniprot_gcrp.ddb", "taxon:3702"),
    ("Wheat", 4565, Path.home() / "repos/go-db/db/goa_wheat.ddb", None),
    ("S. coelicolor", 100226, Path.home() / "repos/go-db/db/bacteria.ddb", "taxon:100226"),
]

# Categories in display order
CAT_ORDER = ["EXACT", "LESS_SPECIFIC", "MORE_SPECIFIC", "NO_OVERLAP", "NOT_IN_GCRP"]
CAT_LABELS = ["Exact", "Less specific", "More specific", "No overlap", "Not in GCRP"]
CAT_COLORS = ["#2ca02c", "#f58518", "#4c78a8", "#e45756", "#bab0ac"]

COMPARISON_SQL = """
    CREATE TEMP TABLE protnlm AS
    SELECT * FROM read_csv_auto('{csv_path}');

    WITH in_goa AS (
        SELECT DISTINCT db_object_id FROM gaf_association {taxon_filter}
    ),
    -- EXACT: predicted term is literally in GOA for this protein
    exact AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN gaf_association g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
        {taxon_where}
    ),
    -- LESS_SPECIFIC: predicted term is a strict ancestor of a GOA term
    less_specific AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN gaf_association g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
        WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
        {taxon_and}
    ),
    -- MORE_SPECIFIC: predicted term is a strict descendant of a GOA term
    more_specific AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN gaf_association g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
        WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
        {taxon_and}
    )
    SELECT
        CASE
            WHEN e.pred_id IS NOT NULL THEN 'EXACT'
            WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
            WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
            WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
            ELSE 'NO_OVERLAP'
        END AS match_category,
        COUNT(*) AS n_predictions,
        COUNT(DISTINCT p.accession) AS n_proteins
    FROM protnlm p
    LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
    LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
    LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
    LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
    GROUP BY 1
    ORDER BY 2 DESC;
"""

DETAIL_SQL = """
    WITH in_goa AS (
        SELECT DISTINCT db_object_id FROM gaf_association {taxon_filter}
    ),
    exact AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN gaf_association g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
        {taxon_where}
    ),
    less_specific AS (
        SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
               g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
        FROM protnlm p
        INNER JOIN gaf_association g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
        INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
        WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
        {taxon_and}
    ),
    more_specific AS (
        SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
               g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
        FROM protnlm p
        INNER JOIN gaf_association g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
        INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
        WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
        {taxon_and}
    )
    SELECT
        p.accession,
        p.pred_id AS protnlm_term,
        pt.label AS protnlm_label,
        CASE
            WHEN e.pred_id IS NOT NULL THEN 'EXACT'
            WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
            WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
            WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
            ELSE 'NO_OVERLAP'
        END AS match_category,
        CASE WHEN e.pred_id IS NOT NULL THEN p.pred_id ELSE COALESCE(ls.goa_match, ms.goa_match, '') END AS closest_goa_term,
        CASE WHEN e.pred_id IS NOT NULL THEN pt.label ELSE COALESCE(ls.goa_match_label, ms.goa_match_label, '') END AS closest_goa_label
    FROM protnlm p
    LEFT JOIN term_label pt ON p.pred_id = pt.id
    LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
    LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
    LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
    LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
    ORDER BY match_category, p.accession;
"""

all_summaries = []
all_details = {}

for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
    sp_accs = set(entries[entries["taxon_id"] == taxon_id]["accession"])
    sp_go = predictions[
        (predictions["pred_type"] == "GO") & (predictions["accession"].isin(sp_accs))
    ][["accession", "pred_id", "pred_label"]]
    csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
    sp_go.to_csv(csv_path, index=False)

    if taxon_str:
        taxon_filter = f"WHERE db_object_taxon = '{taxon_str}'"
        taxon_where = f"WHERE g.db_object_taxon = '{taxon_str}'"
        taxon_and = f"AND g.db_object_taxon = '{taxon_str}'"
    else:
        taxon_filter = taxon_where = taxon_and = ""

    con = duckdb.connect(str(db_path), read_only=True)
    summary = con.execute(COMPARISON_SQL.format(
        csv_path=csv_path, taxon_filter=taxon_filter,
        taxon_where=taxon_where, taxon_and=taxon_and
    )).fetchdf()
    summary["species"] = species_name
    all_summaries.append(summary)

    detail = con.execute(DETAIL_SQL.format(
        taxon_filter=taxon_filter, taxon_where=taxon_where, taxon_and=taxon_and
    )).fetchdf()
    all_details[species_name] = detail
    con.close()

    print(f"{species_name}: {len(sp_go)} GO predictions, {sp_go['accession'].nunique()} proteins")

combined = pd.concat(all_summaries, ignore_index=True)
pivot = combined.pivot_table(
    index="match_category", columns="species",
    values="n_predictions", fill_value=0, aggfunc="sum"
).reindex(CAT_ORDER)
display(pivot)

# Stacked bar chart
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
pivot_t = pivot.T.reindex(columns=CAT_ORDER)
pivot_t.plot.barh(stacked=True, ax=axes[0], color=CAT_COLORS)
axes[0].set_xlabel("Number of GO predictions")
axes[0].set_title("ProtNLM2 vs GOA (counts)")
axes[0].legend(CAT_LABELS, fontsize=7, loc="lower right")

pct_pivot = pivot_t.div(pivot_t.sum(axis=1), axis=0) * 100
pct_pivot.plot.barh(stacked=True, ax=axes[1], color=CAT_COLORS)
axes[1].set_xlabel("% of predictions")
axes[1].set_title("ProtNLM2 vs GOA (%)")
axes[1].legend(CAT_LABELS, fontsize=7, loc="lower right")
plt.tight_layout()
plt.show()

from IPython.display import display, Markdown

SOURCE_COMPARISON_SQL = """
    CREATE OR REPLACE TEMP TABLE protnlm AS
    SELECT * FROM read_csv_auto('{csv_path}');

    WITH goa_filtered AS (
        SELECT * FROM gaf_association
        {taxon_where_base}
        {goa_source_filter}
    ),
    in_goa AS (
        SELECT DISTINCT db_object_id FROM goa_filtered
    ),
    exact AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN goa_filtered g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
    ),
    less_specific AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN goa_filtered g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
        WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
    ),
    more_specific AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN goa_filtered g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
        WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
    )
    SELECT
        CASE
            WHEN e.pred_id IS NOT NULL THEN 'EXACT'
            WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
            WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
            WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
            ELSE 'NO_OVERLAP'
        END AS match_category,
        COUNT(*) AS n_predictions,
        COUNT(DISTINCT p.accession) AS n_proteins
    FROM protnlm p
    LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
    LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
    LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
    LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
    GROUP BY 1
    ORDER BY 2 DESC;
"""

SOURCE_DETAIL_SQL = """
    WITH goa_filtered AS (
        SELECT * FROM gaf_association
        {taxon_where_base}
        {goa_source_filter}
    ),
    in_goa AS (
        SELECT DISTINCT db_object_id FROM goa_filtered
    ),
    exact AS (
        SELECT DISTINCT p.accession, p.pred_id
        FROM protnlm p
        INNER JOIN goa_filtered g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
    ),
    less_specific AS (
        SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
               g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
        FROM protnlm p
        INNER JOIN goa_filtered g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
        INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
        WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
    ),
    more_specific AS (
        SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
               g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
        FROM protnlm p
        INNER JOIN goa_filtered g ON p.accession = g.db_object_id
        INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
        INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
        WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
    )
    SELECT
        p.accession,
        p.pred_id AS protnlm_term,
        pt.label AS protnlm_label,
        CASE
            WHEN e.pred_id IS NOT NULL THEN 'EXACT'
            WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
            WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
            WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
            ELSE 'NO_OVERLAP'
        END AS match_category,
        CASE WHEN e.pred_id IS NOT NULL THEN p.pred_id ELSE COALESCE(ls.goa_match, ms.goa_match, '') END AS closest_goa_term,
        CASE WHEN e.pred_id IS NOT NULL THEN pt.label ELSE COALESCE(ls.goa_match_label, ms.goa_match_label, '') END AS closest_goa_label
    FROM protnlm p
    LEFT JOIN term_label pt ON p.pred_id = pt.id
    LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
    LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
    LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
    LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
    WHERE ig.db_object_id IS NOT NULL
    ORDER BY match_category, p.accession;
"""

GOA_SOURCES = [
    ("All GOA", ""),
    ("InterPro2GO", "AND assigned_by = 'InterPro'"),
    ("IBA", "AND evidence_type = 'IBA'"),
    ("InterPro2GO + IBA", "AND (assigned_by = 'InterPro' OR evidence_type = 'IBA')"),
]

source_results = []
source_details = {}

for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
    csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
    taxon_where_base = f"WHERE db_object_taxon = '{taxon_str}'" if taxon_str else "WHERE 1=1"

    con = duckdb.connect(str(db_path), read_only=True)
    for source_name, source_filter in GOA_SOURCES:
        sql = SOURCE_COMPARISON_SQL.format(
            csv_path=csv_path,
            taxon_where_base=taxon_where_base,
            goa_source_filter=source_filter,
        )
        df = con.execute(sql).fetchdf()
        df["species"] = species_name
        df["goa_source"] = source_name
        source_results.append(df)

        # Fetch detail for InterPro2GO and IBA comparisons
        if source_name in ("InterPro2GO", "IBA"):
            detail = con.execute(SOURCE_DETAIL_SQL.format(
                taxon_where_base=taxon_where_base,
                goa_source_filter=source_filter,
            )).fetchdf()
            source_details[(species_name, source_name)] = detail

    con.close()

source_combined = pd.concat(source_results, ignore_index=True)

# Summary chart
fig, axes = plt.subplots(1, 7, figsize=(32, 5), sharey=False)
source_order = ["All GOA", "InterPro2GO", "IBA", "InterPro2GO + IBA"]
for ax, species_name in zip(axes, ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]):
    sp_data = source_combined[source_combined["species"] == species_name]
    pivot = sp_data.pivot_table(
        index="goa_source", columns="match_category",
        values="n_predictions", fill_value=0, aggfunc="sum"
    ).reindex(columns=CAT_ORDER).reindex(source_order)
    pct = pivot.div(pivot.sum(axis=1), axis=0) * 100
    pct.plot.barh(stacked=True, ax=ax, color=CAT_COLORS, legend=False)
    ax.set_xlabel("% of predictions")
    ax.set_title(species_name)
axes[0].legend(CAT_LABELS, fontsize=7, loc="lower right")
plt.suptitle("ProtNLM2 vs GOA by annotation source (%)", fontsize=12)
plt.tight_layout()
plt.show()

# Per-species summary tables
for species_name in ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]:
    sp_data = source_combined[source_combined["species"] == species_name]
    pivot = sp_data.pivot_table(
        index="goa_source", columns="match_category",
        values="n_predictions", fill_value=0, aggfunc="sum"
    ).reindex(columns=CAT_ORDER).reindex(source_order)
    display(Markdown(f"#### {species_name}"))
    display(pivot)

for goa_source_name in ["InterPro2GO", "IBA"]:
    display(Markdown(f"---"))
    display(Markdown(f"### ProtNLM2 vs {goa_source_name}: where is ProtNLM2 more specific?"))

    for species_name in ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]:
        key = (species_name, goa_source_name)
        if key not in source_details:
            continue
        detail = source_details[key]
        n_evaluable = len(detail)
        n_proteins = detail["accession"].nunique()
        cat_counts = detail["match_category"].value_counts()

        lines = [
            f"#### {species_name} vs {goa_source_name}",
            f"",
            f"**{n_evaluable} predictions** for **{n_proteins} proteins** that have {goa_source_name} annotations.",
            f"",
            f"| Category | Count |",
            f"|---|---|",
        ]
        for cat in CAT_ORDER:
            if cat != "NOT_IN_GCRP":
                lines.append(f"| {cat} | {cat_counts.get(cat, 0)} |")
        lines.append("")

        # Enrich with evidence data
        detail_enriched = detail.merge(
            predictions[["accession", "pred_id", "evidence_key"]],
            left_on=["accession", "protnlm_term"],
            right_on=["accession", "pred_id"],
            how="left"
        ).merge(
            evidence[["accession", "evidence_key", "model_score", "match_text",
                       "match_location", "match_type", "phmmer_accession", "phmmer_score",
                       "tmalign_accession", "tmalign_score1"]],
            on=["accession", "evidence_key"],
            how="left"
        )

        more = detail_enriched[detail_enriched["match_category"] == "MORE_SPECIFIC"]
        if len(more) > 0:
            lines.append(f"**{len(more)} predictions where ProtNLM2 is more specific than {goa_source_name}:**")
            lines.append("")
            lines.append(f"| Protein | ProtNLM2 predicts | Score | Evidence | {goa_source_name} has |")
            lines.append("|---|---|---|---|---|")
            for _, row in more.iterrows():
                score = f"{row['model_score']:.2f}" if pd.notna(row.get('model_score')) else "?"
                # Build evidence summary
                if pd.notna(row.get('phmmer_accession')):
                    ev_str = f"phmmer: {row['phmmer_accession']} ({row['phmmer_score']})"
                elif pd.notna(row.get('tmalign_accession')):
                    ev_str = f"tmalign: {row['tmalign_accession']} ({row['tmalign_score1']})"
                elif pd.notna(row.get('match_text')):
                    ev_str = f"{row['match_location']}: {row['match_text']}"
                else:
                    ev_str = ""
                lines.append(
                    f"| [{row.accession}](https://www.uniprot.org/uniprotkb/{row.accession}) "
                    f"| {row.protnlm_label} (`{row.protnlm_term}`) "
                    f"| {score} | {ev_str} "
                    f"| {row.closest_goa_label} (`{row.closest_goa_term}`) |"
                )
            lines.append("")

        no_ov = detail_enriched[detail_enriched["match_category"] == "NO_OVERLAP"]
        if len(no_ov) > 0:
            lines.append(f"**{len(no_ov)} predictions with no ontological overlap to {goa_source_name}:**")
            lines.append("")
            lines.append("| Protein | ProtNLM2 predicts | Score | Evidence |")
            lines.append("|---|---|---|---|")
            for _, row in no_ov.iterrows():
                score = f"{row['model_score']:.2f}" if pd.notna(row.get('model_score')) else "?"
                if pd.notna(row.get('phmmer_accession')):
                    ev_str = f"phmmer: {row['phmmer_accession']} ({row['phmmer_score']})"
                elif pd.notna(row.get('tmalign_accession')):
                    ev_str = f"tmalign: {row['tmalign_accession']} ({row['tmalign_score1']})"
                elif pd.notna(row.get('match_text')):
                    ev_str = f"{row['match_location']}: {row['match_text']}"
                else:
                    ev_str = ""
                lines.append(
                    f"| [{row.accession}](https://www.uniprot.org/uniprotkb/{row.accession}) "
                    f"| {row.protnlm_label} (`{row.protnlm_term}`) "
                    f"| {score} | {ev_str} |"
                )
            lines.append("")

        display(Markdown("\n".join(lines)))

# Co-annotation analysis: glycerophospholipase + lipid metabolism -> lipid catabolism?
con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_wheat.ddb"), read_only=True)

gba_df = con.execute("""
    WITH has_lipase AS (
        SELECT DISTINCT db_object_id
        FROM gaf_association WHERE ontology_class_ref = 'GO:0004620'
    ),
    has_lipid_met AS (
        SELECT DISTINCT db_object_id
        FROM gaf_association WHERE ontology_class_ref = 'GO:0006629'
    ),
    has_lipid_cat AS (
        SELECT DISTINCT db_object_id
        FROM gaf_association WHERE ontology_class_ref = 'GO:0016042'
    )
    SELECT
        (SELECT COUNT(*) FROM has_lipase) AS n_glycerophospholipase,
        (SELECT COUNT(*) FROM has_lipase INNER JOIN has_lipid_met USING (db_object_id)) AS n_with_lipid_metabolism,
        (SELECT COUNT(*) FROM has_lipase INNER JOIN has_lipid_cat USING (db_object_id)) AS n_with_lipid_catabolism;
""").fetchdf()

con.close()

n_lipase = gba_df["n_glycerophospholipase"].iloc[0]
n_met = gba_df["n_with_lipid_metabolism"].iloc[0]
n_cat = gba_df["n_with_lipid_catabolism"].iloc[0]
pct = n_cat / n_met * 100 if n_met > 0 else 0

display(Markdown(f"""
**Guilt-by-association analysis** (wheat proteins with glycerophospholipase activity):

| | Count |
|---|---|
| Proteins with glycerophospholipase activity | {n_lipase} |
| ...that also have `lipid metabolic process` | {n_met} |
| ...that also have `lipid catabolic process` | {n_cat} |
| **Co-annotation rate** | **{pct:.0f}%** |

**{n_cat} out of {n_met}** ({pct:.0f}%) proteins with glycerophospholipase activity and
`lipid metabolic process` also have `lipid catabolic process`. A0A3B6GK97 is one of just
{n_met - n_cat} that lack the more specific term — almost certainly a gap in the existing
annotations rather than a genuinely novel prediction by ProtNLM2.
"""))

display(Markdown("**ProtNLM2 evidence for A0A3B6RKV1:**"))
ev_subset = evidence[evidence["accession"] == "A0A3B6RKV1"]
display(ev_subset[["accession", "evidence_key", "model_score", "match_text",
                    "match_location", "phmmer_accession", "phmmer_score"]].fillna(""))

# The phmmer hit is Q67XX3 = JMJ22 (Arabidopsis). What does JMJ22 have?
con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_uniprot_gcrp.ddb"), read_only=True)
jmj22 = con.execute(
    "SELECT ontology_class_ref, t.label, evidence_type, assigned_by "
    "FROM gaf_association g "
    "JOIN term_label t ON g.ontology_class_ref = t.id "
    "WHERE db_object_id = 'Q67XX3' "
    "AND ontology_class_ref IN "
    "('GO:0010476', 'GO:0010099', 'GO:0010030', 'GO:0040029', 'GO:0010114') "
    "ORDER BY ontology_class_ref"
).fetchdf()
con.close()

display(Markdown("**JMJ22 (Q67XX3) — the phmmer hit (score 689.2) — has all 5 terms with experimental evidence:**"))
display(jmj22)

from IPython.display import display, Markdown

# For each species, check how many proteins in GOA only have ND annotations
nd_stats = {}
for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
    csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
    taxon_and = f"AND db_object_taxon = '{taxon_str}'" if taxon_str else ""

    con = duckdb.connect(str(db_path), read_only=True)
    nd_df = con.execute(f"""
        WITH protnlm_accs AS (
            SELECT DISTINCT accession FROM read_csv_auto('{csv_path}')
        ),
        protein_annotations AS (
            SELECT
                db_object_id,
                COUNT(*) FILTER (WHERE evidence_type != 'ND') as n_real
            FROM gaf_association
            WHERE db_object_id IN (SELECT accession FROM protnlm_accs)
            {taxon_and}
            GROUP BY db_object_id
        )
        SELECT
            CASE WHEN n_real = 0 THEN 'ND only (no real annotations)'
                 ELSE 'Has functional annotations' END AS annotation_status,
            COUNT(*) AS n_proteins
        FROM protein_annotations
        GROUP BY 1
        ORDER BY 1;
    """).fetchdf()
    con.close()
    nd_stats[species_name] = nd_df

all_nd = pd.concat({k: v.set_index("annotation_status")["n_proteins"] for k, v in nd_stats.items()}, axis=1).fillna(0).astype(int)
display(Markdown("**Proteins in GOA: ND-only vs functionally annotated**"))
display(all_nd)

# Analyze human NOT_IN_GCRP proteins
import urllib.request as urlreq
con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_human.ddb"), read_only=True)

human_accs = set(entries[entries["taxon_id"] == 9606]["accession"])
human_go = predictions[(predictions["pred_type"] == "GO") & (predictions["accession"].isin(human_accs))]
human_go[["accession", "pred_id", "pred_label"]].to_csv("/tmp/protnlm_human_go.csv", index=False)

not_in_gcrp = con.execute("""
    CREATE OR REPLACE TEMP TABLE protnlm AS SELECT * FROM read_csv_auto('/tmp/protnlm_human_go.csv');
    WITH in_goa AS (SELECT DISTINCT db_object_id FROM gaf_association)
    SELECT DISTINCT p.accession
    FROM protnlm p LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
    WHERE ig.db_object_id IS NULL
""").fetchdf()

not_in_gcrp_accs = sorted(not_in_gcrp["accession"])
not_in_gcrp_preds = human_go[human_go["accession"].isin(not_in_gcrp_accs)]

# Cross-check against full QuickGO API
import time
has_full_goa = []
truly_absent = []
for acc in not_in_gcrp_accs:
    url = f"https://www.ebi.ac.uk/QuickGO/services/annotation/search?geneProductId={acc}&limit=1"
    try:
        req = urlreq.Request(url)
        req.add_header("Accept", "application/json")
        with urlreq.urlopen(req, timeout=30) as resp:
            data = json.loads(resp.read())
        n = data.get("numberOfHits", 0)
        if n > 0:
            has_full_goa.append((acc, n))
        else:
            truly_absent.append(acc)
    except:
        truly_absent.append(acc)
    time.sleep(0.3)

con.close()

display(Markdown(f"""
**Human NOT_IN_GCRP: {len(not_in_gcrp_accs)} proteins, {len(not_in_gcrp_preds)} GO predictions**

Cross-check against full QuickGO:

| | Count |
|---|---|
| Total NOT_IN_GCRP | {len(not_in_gcrp_accs)} |
| Have annotations in full GOA (not in GCRP subset) | {len(has_full_goa)} |
| **Truly absent from GOA** | **{len(truly_absent)}** |

Over half ({len(has_full_goa)}/{len(not_in_gcrp_accs)}) of the "missing" proteins actually
have GOA annotations — they are just absent from the Gene Centric Reference Proteome subset.
These are typically alternate TrEMBL accessions for proteins whose canonical accession is
in the reference proteome.
"""))

# Show the truly absent ones
display(Markdown("**Truly absent from GOA (no annotations in QuickGO):**"))
truly_absent_info = []
for acc in truly_absent:
    entry = entries[entries["accession"] == acc]
    pname = entry.iloc[0]["protein_name"] if len(entry) > 0 else "?"
    ev = evidence[evidence["accession"] == acc]
    phmmer = ev[ev["phmmer_accession"].notna()]
    hit = f"{phmmer.iloc[0]['phmmer_accession']} ({phmmer.iloc[0]['phmmer_score']})" if len(phmmer) > 0 else "none"
    truly_absent_info.append({"accession": acc, "protein_name": pname, "phmmer_hit": hit})
display(pd.DataFrame(truly_absent_info))

for species_name, detail in all_details.items():
    n_evaluable = len(detail)
    n_proteins = detail["accession"].nunique()

    cat_counts = detail["match_category"].value_counts()
    exact = cat_counts.get("EXACT", 0)
    less = cat_counts.get("LESS_SPECIFIC", 0)
    more = cat_counts.get("MORE_SPECIFIC", 0)
    no_overlap = cat_counts.get("NO_OVERLAP", 0)

    # Check if MORE_SPECIFIC cases are against root terms (ND annotations)
    root_terms = {"molecular_function", "biological_process", "cellular_component"}
    more_rows = detail[detail["match_category"] == "MORE_SPECIFIC"]
    more_vs_root = more_rows["closest_goa_label"].isin(root_terms).sum() if len(more_rows) > 0 else 0

    lines = [
        f"#### {species_name}",
        f"",
        f"**{n_evaluable} predictions** across **{n_proteins} proteins** that are in GOA.",
        f"",
        f"| Category | Count | Interpretation |",
        f"|---|---|---|",
        f"| Exact | {exact} | Predicted term is the same GO term already in GOA |",
        f"| Less specific | {less} | Predicted term is a strict ancestor of a GOA term (redundant) |",
    ]
    if more > 0:
        note = ""
        if more_vs_root > 0:
            note = f" (**{more_vs_root} of these are vs root-level ND annotations** — the protein effectively has no real annotation)"
        lines.append(f"| More specific | {more} | Predicted term is a strict descendant of a GOA term{note} |")
    if no_overlap > 0:
        lines.append(f"| No overlap | {no_overlap} | Predicted term is in a different ontology branch from all GOA annotations |")
    lines.append("")

    display(Markdown("\n".join(lines)))
    display(detail.style.set_properties(**{"text-align": "left"}).set_caption(f"{species_name} — detailed predictions"))

`match_location`	What it means	Example `match_text`
`domain`	Predicted name matched a known UniProt domain description	`WRKY`
`PANTHER`	Matched a PANTHER family/subfamily ID	`PTHR22654`
`GO`	Matched a GO term ID	`GO:0005737`
`keyword`	Matched a UniProt keyword ID	`KW-0472`
`recommended_protein_name`	Matched an existing recommended protein name	`Aldehyde dehydrogenase`
`InterPro`	Matched an InterPro accession	`IPR000225`

Species	DuckDB source	Why chosen
Human	`goa_human.ddb`	Best-annotated proteome; baseline for comparison
X. laevis	`goa_xenopus.ddb`	Allotetraploid frog; many duplicated genes (ohnologs) with potential for paralog divergence
X. tropicalis	`goa_xenopus.ddb`	Diploid frog; simpler genome, comparison point for laevis paralogs
Arabidopsis thaliana	`goa_uniprot_gcrp.ddb`	Well-studied plant; many proteins with only ND annotations
Mouse	`goa_mouse.ddb`	Well-annotated mammalian model organism; 710K GOA annotations
Wheat (T. aestivum)	`goa_wheat.ddb`	Top plant species by entry count (729); large GOA with 512K annotations
Streptomyces coelicolor	`bacteria.ddb`	Model actinobacterium; small sample but good GOA coverage

Category	Code	Definition
Exact	`EXACT`	ProtNLM2 predicted the same GO term that exists in GOA for this protein
Less specific	`LESS_SPECIFIC`	ProtNLM2 predicted a strict ancestor of a GOA term — not wrong, but redundant (e.g. predicts "membrane" when GOA has "plasma membrane")
More specific	`MORE_SPECIFIC`	ProtNLM2 predicted a strict descendant of a GOA term — potentially novel and more informative
No overlap	`NO_OVERLAP`	Protein is in GOA but the predicted term has no ancestor/descendant relationship to any existing annotation — could be novel or incorrect
Not in GCRP	`NOT_IN_GCRP`	Protein is absent from the GCRP (Gene Centric Reference Proteome) subset of GOA — may have annotations in full GOA

GOA subset	Filter	What it captures
InterPro2GO	`assigned_by = 'InterPro'`	Domain-based GO annotation via InterPro matches — the most direct comparison since ProtNLM2 also uses domain/family information
IBA	`evidence_type = 'IBA'`	Phylogenetic annotation from PAINT/GO_Central — orthology-based function transfer
InterPro2GO + IBA	Both combined	The union of the two major automated pipelines

Pattern	Description	Example
Trivial deepening	Prediction is an obvious child of the GOA term	`zinc ion binding` > `metal ion binding`
Phmmer transfer	All predictions trace to the top phmmer hit's annotations	JMJ22 case above
Cross-aspect gap	MF and BP terms are biologically coupled but have no closure path	kinase activity vs phosphorylation
Uninformative CC	Generic location term that adds nothing	`cytoplasm` for a cytoplasmic process
Cross-kingdom error	Annotations from wrong lineage transferred to the target	neuron terms on a plant protein

Predicted term	GO ID
gibberellin mediated signaling pathway	GO:0010476
regulation of photomorphogenesis	GO:0010099
positive regulation of seed germination	GO:0010030
epigenetic regulation of gene expression	GO:0040029
response to red light	GO:0010114

Protein	Species	ProtNLM2 predicts	GOA has
A0A3B6CCN7	Wheat	zinc ion binding	metal ion binding
A0A3B6FNX0	Wheat	iron ion binding	metal ion binding
A0A3B6INB8	Wheat	DNA binding	nucleic acid binding
A0A3B6S8Y8	Wheat	positive regulation of transcription by RNA pol II	regulation of transcription by RNA pol II
Q9M0U6	Arabidopsis	SCF-dependent proteasomal ubiquitin-dependent protein catabolic process	ubiquitin-dependent protein catabolic process

Protein	Species	Prediction	phmmer hit	Score
A0A3B6DJW8	Wheat	defense response to fungus	P93194 (PGIP, Phaseolus)	258
A0A3B6TMW0	Wheat	defense response to fungus	P93194 (PGIP, Phaseolus)	264
A0A3B6HYU7	Wheat	defense response to fungus	P93194 (PGIP, Phaseolus)	275
A0A3B6RKQ5	Wheat	copper homeostasis, cadmium response, protein domain binding	Q94BT9 (ATX1, Arabidopsis)	114
A0A3B6SNL4	Wheat	recognition of pollen	P17801 (SRK, Brassica)	893
A0A3B6LH70	Wheat	negative regulation of ABA signaling, nucleus	Q39266 (AZF2, Arabidopsis)	77
Q9LJ71	Arabidopsis	phosphorylation	O48716	503
Q8N7F7	Human	ER membrane targeting	B7NZQ9	125

Protein	Species	ProtNLM2 predicts	GOA has
A0A3B6H133	Wheat	phosphorylation, kinase activity	ATP binding, signaling receptor activity (IBA)
A0A3B6KN92	Wheat	protein phosphorylation	calcium-dependent protein kinase (IEA)
A0A3B6NKR6	Wheat	phosphorylation	GHMP kinase domain (IEA)
A0A3B6EAI8	Wheat	protein ubiquitination, ub-dependent catabolism	ubiquitin-protein transferase activity (IEA)
A0A3B6EQ70	Wheat	protein ubiquitination	zinc ion binding, calmodulin binding, regulation of transcription (IEA)
A0A3B6MX28	Wheat	methylation	S-adenosylmethionine-dependent methyltransferase (IEA)
Q6P6B1	Human	phosphorylation	protein binding only (IPI) — no evidence for kinase activity
Q9KZ33	S. coelicolor	transcription initiation	sigma factor activity (IBA)

Protein	Species	Prediction	Why uninformative
Q8GXB1	Arabidopsis	membrane	DUF1990 protein — `membrane` is too generic to be useful
A0A3B6GUW2	Wheat	endoplasmic reticulum	Might be correct but low-information without MF/BP context
Q9M0U6	Arabidopsis	SCF ubiquitin ligase complex	NO_OVERLAP — but the protein is an F-box protein, so SCF complex membership is trivially implied

Pattern	Count (approx)	Genuinely novel?
Trivial deepening	~20	No — simple rules or GBA
Phmmer transfer	~15	Maybe — depends on orthology
Cross-aspect gap	~15	No — evaluation artifact
Uninformative CC/noise	~5	No
Cross-kingdom error	1 (6 predictions)	No — actively wrong
Multidomain false positive	~2	No — wrong domain matched

ProtNLM2 Prediction Summary Statistics¶

ProtNLM2 architecture and the Evidencer¶

Post-processing: the Evidencer¶

Evidence types in the XML¶

What does ProtNLM2 predict?¶

Protein name types¶

Swiss-Prot vs TrEMBL breakdown¶

Species and taxonomic breakdown¶

Top 20 species, colored by taxonomic group¶

Entries by broad taxonomic group¶

Prediction types by taxonomic group (stacked)¶

Scores by match location (post-processing validation)¶

Prediction richness by species¶

Model confidence scores¶

Match type breakdown¶

GO term predictions¶

Subcellular location predictions¶

Evidence methods¶

Overlap with AIGR curated reviews¶

Validation: ProtNLM2 GO predictions vs curated GOA annotations¶

Match category definitions¶

Cross-species summary¶

Comparison by GOA annotation source¶

ProtNLM2 vs automated pipelines: where is ProtNLM2 more specific?¶

Note on IBA-specific comparisons¶

Commentary on NO_OVERLAP cases¶

Pattern 1: Uninformative cellular component predictions¶

Pattern 2: Ontology gaps in cross-aspect relationships¶

Case study: A0A3B6GK97 — trivially correct but reveals annotation gaps¶

Case study: A0A3B6RKV1 — annotation transfer from top phmmer hit¶

Systematic review of all MORE_SPECIFIC and NO_OVERLAP predictions¶

Pattern 1: Trivial deepening¶

Pattern 2: Phmmer-based annotation transfer¶

Pattern 3: Cross-aspect ontology gaps¶

Pattern 5: Cross-kingdom annotation transfer errors¶

Pattern 6: Multidomain protein false positives¶

Pattern 4: Uninformative CC and miscellaneous noise¶

Summary of error/novelty patterns¶

Caveat: Swiss-Prot does not mean well-characterized¶

What are the NOT_IN_GCRP proteins?¶

Per-species detailed comparison¶