Table of Contents

    ProtNLM2 Prediction Summary Statistics¶

    Google's ProtNLM2 (UniProt help) predicts protein names, GO terms, subcellular locations, keywords, and function descriptions for unreviewed/TrEMBL UniProtKB entries. This notebook analyzes a post-processed XML export containing 28,553 entries across 440 species.

    Note on dataset version: The public ProtNLM2 pilot release (UniProt 2026_02) covers 26,856 accessions. Our XML export is an earlier pre-release containing 28,553 entries — 1,697 additional entries that were subsequently removed during quality filtering. Some predictions in our dataset (e.g. cross-kingdom GO terms) may have been caught by the production Evidencer pipeline's exclusion criteria before public release.

    ProtNLM2 architecture and the Evidencer¶

    ProtNLM2 is a transformer-based sequence-to-sequence model (T5 architecture) trained on 240 million proteins from UniProt release 2023_04, including both Swiss-Prot and TrEMBL entries. It takes amino acid sequence, organism TaxID, and (in some ensemble members) AlphaFold secondary structure predictions as input and generates natural-language annotations as output — analogous to image captioning but for protein sequences.

    The original ProtNLM (v1) predicted only protein names using an ensemble of 7 models. ProtNLM2 expands this to also predict GO terms, subcellular locations, keywords, and function comments.

    Model score (0–1) is the language model's per-prediction confidence — its probability of generating that specific text given the input sequence. The score threshold is 0.05 for ProtNLM2 (lowered from 0.2 for v1), since many low-score predictions were judged accurate by curators.

    Post-processing: the Evidencer¶

    Because the neural model is a black box, UniProt/Google apply a post-hoc corroboration system called the Evidencer that mimics expert biocurator evaluation. This is NOT part of the model — it runs after prediction to assess reliability:

    1. Exclusion filter: predictions matching GO taxon constraints, nomenclature violations, or malformed IDs are rejected (not displayed)
    2. String/substring match: the predicted text is checked against existing annotations in the entry and cross-referenced databases (InterPro, GO, EC)
    3. Sequence similarity (phmmer): if no string match, checks for homologs in UniProtKB with matching annotations (bit score threshold: 25)
    4. Structural similarity (TM-align): if no sequence match, checks AlphaFold structural alignments (normalized TM-score > 0.5, min 5% seq identity, min pLDDT 70)

    Predictions that pass exclusion AND corroboration are Approved (displayed in UniProtKB). Those that pass exclusion but lack corroboration get Uncertain Quality (not displayed).

    Key insight from our data: the model_score is independent of the evidence strength. For the same phmmer hit (same accession, same bit score), different GO term predictions get different model_scores (e.g. 0.70 for seed germination vs 0.61 for epigenetic regulation — same source protein Q67XX3, score 689.2). The correlation between model_score and phmmer bit score is effectively zero (r = −0.01).

    Evidence types in the XML¶

    The XML evidence blocks show which corroboration method approved each prediction:

    1. String match (56% of evidence) — the predicted text matched a known database entry:

    <evidence type="ECO:0008006" key="1">
      <source>
        <dbReference type="Google" id="ProtNLM2">
          <property type="model_score" value="0.96"/>
          <property type="string_match_text" value="WRKY"/>
          <property type="string_match_location" value="domain"/>
          <property type="string_match_type" value="exact_sanitized"/>
        </dbReference>
      </source>
    </evidence>
    

    2. phmmer (40%) — sequence similarity to characterized proteins:

    <evidence type="ECO:0008006" key="2">
      <source>
        <dbReference type="Google" id="ProtNLM2">
          <property type="model_score" value="0.69"/>
          <property type="phmmer_accession" value="P57750"/>
          <property type="phmmer_score" value="397.7"/>
        </dbReference>
      </source>
    </evidence>
    

    3. TM-align (4%) — structure-based alignment (against AlphaFold models):

    <evidence type="ECO:0008006" key="3">
      <source>
        <dbReference type="Google" id="ProtNLM2">
          <property type="model_score" value="0.45"/>
          <property type="tmalign_accession" value="Q9SM59"/>
          <property type="tmalign_score_chain_1" value="0.56297"/>
          <property type="tmalign_score_chain_2" value="0.31889"/>
        </dbReference>
      </source>
    </evidence>
    

    The data was parsed into three TSV files by parse_protnlm_xml.py:

    • entries.tsv — one row per protein (accession, predicted name, prediction counts)
    • predictions.tsv — one row per GO / function / subcellular-location prediction
    • evidence.tsv — one row per evidence block (model score + provenance)
    In [ ]:
    import pandas as pd
    from pathlib import Path
    import matplotlib.pyplot as plt
    
    plt.rcParams["figure.dpi"] = 120
    
    DATA = Path(".")
    
    entries = pd.read_csv(DATA / "entries.tsv", sep="\t")
    predictions = pd.read_csv(DATA / "predictions.tsv", sep="\t")
    evidence = pd.read_csv(DATA / "evidence.tsv", sep="\t")
    
    # Taxonomy: maps accession -> organism name + taxon_id
    taxonomy = pd.read_csv(DATA / "taxonomy.tsv", sep="\t")
    entries = entries.merge(taxonomy, on="accession", how="left")
    
    # Lineage: maps taxon_id -> broad taxonomic group
    lineage_path = DATA / "lineage.tsv"
    if lineage_path.exists():
        lineage = pd.read_csv(lineage_path, sep="\t")
        entries = entries.merge(lineage, on="taxon_id", how="left")
        print(f"Lineage groups available for {entries['broad_group'].notna().sum():,} entries")
    
    print(f"Entries: {len(entries):,}  |  Predictions: {len(predictions):,}  |  Evidence blocks: {len(evidence):,}")
    print(f"Species: {entries['organism'].nunique()}")
    

    What does ProtNLM2 predict?¶

    Every entry gets a predicted protein name. Beyond that, the model may additionally predict GO terms, subcellular locations, and/or free-text function comments — but not all entries get all prediction types. About 30% of entries receive only a name with no additional functional annotation.

    In [ ]:
    entries["has_go"] = entries["n_go"] > 0
    entries["has_function"] = entries["n_function"] > 0
    entries["has_subcell"] = entries["n_subcell"] > 0
    entries["name_only"] = ~entries["has_go"] & ~entries["has_function"] & ~entries["has_subcell"]
    
    summary = pd.DataFrame({
        "count": [
            len(entries),
            entries["has_go"].sum(),
            entries["has_function"].sum(),
            entries["has_subcell"].sum(),
            entries["name_only"].sum(),
        ],
    }, index=["Total entries", "With GO terms", "With function comment",
             "With subcellular location", "Name only"])
    summary["pct"] = (summary["count"] / len(entries) * 100).round(1)
    summary
    
    In [ ]:
    fig, ax = plt.subplots(figsize=(8, 3.5))
    cats = ["has_go", "has_function", "has_subcell", "name_only"]
    labels = ["GO terms", "Function comment", "Subcellular location", "Name only"]
    vals = [entries[c].sum() for c in cats]
    ax.barh(labels, vals, color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"])
    ax.set_xlabel("Number of entries")
    ax.set_title("ProtNLM2 prediction types (n=28,553)")
    for i, v in enumerate(vals):
        ax.text(v + 200, i, f"{v:,}", va="center", fontsize=9)
    plt.tight_layout()
    plt.show()
    

    Protein name types¶

    ProtNLM2 assigns either a recommendedName (high-confidence, informative name) or a submittedName (lower-confidence, often echoing the original submission). About 79% receive recommended names.

    In [ ]:
    entries["name_type"].value_counts().to_frame("count")
    

    Swiss-Prot vs TrEMBL breakdown¶

    The XML uses placeholder metadata, so we queried the UniProt REST API to determine the current reviewed status of each accession. This tells us how many of ProtNLM2's predictions target well-characterized (Swiss-Prot) vs uncharacterized (TrEMBL) proteins.

    In [ ]:
    status_path = DATA / "reviewed_status.tsv"
    if status_path.exists():
        status = pd.read_csv(status_path, sep="\t")
        entries = entries.merge(status, on="accession", how="left")
    
        sp_counts = entries["reviewed"].value_counts()
        sp_total = sp_counts.get(True, 0)
        tr_total = sp_counts.get(False, 0)
        print(f"Swiss-Prot (reviewed): {sp_total:,} ({sp_total/len(entries)*100:.1f}%)")
        print(f"TrEMBL (unreviewed):   {tr_total:,} ({tr_total/len(entries)*100:.1f}%)")
    
        # Breakdown by prediction richness
        review_summary = entries.groupby("reviewed").agg(
            n_entries=("accession", "count"),
            pct_go=("has_go", "mean"),
            pct_function=("has_function", "mean"),
            pct_subcell=("has_subcell", "mean"),
            pct_name_only=("name_only", "mean"),
        )
        review_summary.index = review_summary.index.map({True: "Swiss-Prot", False: "TrEMBL"})
        for col in ["pct_go", "pct_function", "pct_subcell", "pct_name_only"]:
            review_summary[col] = (review_summary[col] * 100).round(1)
        print("\nPrediction richness by review status:")
        display(review_summary)
    
        # Chart
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
        # Pie
        axes[0].pie([sp_total, tr_total], labels=["Swiss-Prot", "TrEMBL"],
                    autopct="%1.1f%%", colors=["#4c78a8", "#bab0ac"])
        axes[0].set_title("Swiss-Prot vs TrEMBL")
    
        # Stacked: prediction types by review status
        rev_pred = entries.groupby("reviewed").agg(
            has_go=("has_go", "sum"), has_function=("has_function", "sum"),
            has_subcell=("has_subcell", "sum"), name_only=("name_only", "sum"),
        )
        rev_pred.index = rev_pred.index.map({True: "Swiss-Prot", False: "TrEMBL"})
        rev_pred.plot.barh(stacked=True, ax=axes[1],
                           color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"])
        axes[1].set_xlabel("Number of entries")
        axes[1].set_title("Prediction types by review status")
        axes[1].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8)
    
        plt.tight_layout()
        plt.show()
    else:
        print("reviewed_status.tsv not yet available — run fetch reviewed status")
    
    In [ ]:
    if "reviewed" in entries.columns and "organism" in entries.columns:
        # By species (top 30 by total entries)
        top_species = entries["organism"].value_counts().head(30).index
        sp_review = entries[entries["organism"].isin(top_species)].groupby("organism")["reviewed"].agg(
            n_entries="count",
            n_swissprot="sum",
        )
        sp_review["n_trembl"] = sp_review["n_entries"] - sp_review["n_swissprot"]
        sp_review["pct_swissprot"] = (sp_review["n_swissprot"] / sp_review["n_entries"] * 100).round(1)
        sp_review = sp_review.sort_values("pct_swissprot", ascending=True)
    
        fig, ax = plt.subplots(figsize=(10, 8))
        ax.barh(sp_review.index, sp_review["n_swissprot"], label="Swiss-Prot", color="#4c78a8")
        ax.barh(sp_review.index, sp_review["n_trembl"], left=sp_review["n_swissprot"],
                label="TrEMBL", color="#bab0ac")
        ax.set_xlabel("Number of entries")
        ax.set_title("Swiss-Prot vs TrEMBL by species (top 30)")
        ax.legend(fontsize=9)
        for i, (_, row) in enumerate(sp_review.iterrows()):
            if row["n_swissprot"] > 0:
                ax.text(row["n_entries"] + 5, i, f'{row["pct_swissprot"]:.0f}% SP', va="center", fontsize=7)
        plt.tight_layout()
        plt.show()
    
        # By broad taxonomic group
        group_review = entries.groupby("broad_group")["reviewed"].agg(
            n_entries="count", n_swissprot="sum",
        )
        group_review["n_trembl"] = group_review["n_entries"] - group_review["n_swissprot"]
        group_review["pct_swissprot"] = (group_review["n_swissprot"] / group_review["n_entries"] * 100).round(1)
        group_review = group_review.sort_values("pct_swissprot", ascending=True)
    
        print("Swiss-Prot fraction by taxonomic group:")
        display(group_review.sort_values("pct_swissprot", ascending=False))
    

    Species and taxonomic breakdown¶

    Species were resolved by batch-querying the UniProt REST API for each accession. Broad taxonomic groups (Mammal, Land plant, Ray-finned fish, etc.) were assigned by fetching NCBI lineages via the UniProt taxonomy endpoint and matching against major clades.

    In [ ]:
    species_counts = entries["organism"].value_counts()
    print(f"Unique species: {species_counts.shape[0]}")
    print(f"\nTop 30 species:")
    display(species_counts.head(30).to_frame("count"))
    

    Top 20 species, colored by taxonomic group¶

    The dataset is taxonomically diverse: dominated by crop plants (wheat, peanut, sunflower), fish (goldfish, Atlantic cod, salmon), and primates/mammals (horse, human, marmoset).

    In [ ]:
    GROUP_COLORS = {
        "Land plant": "#54a24b",
        "Mammal": "#e45756",
        "Ray-finned fish": "#4c78a8",
        "Cartilaginous fish": "#72b7b2",
        "Bird": "#eeca3b",
        "Amphibian": "#b279a2",
        "Other vertebrate": "#9d755d",
        "Insect": "#f58518",
        "Mollusc": "#ff9da6",
        "Nematode": "#d67195",
        "Fungus": "#9467bd",
        "Bacterium": "#bab0ac",
        "Other animal": "#d4a6c8",
    }
    
    top20 = species_counts.head(20)
    top20_df = top20.to_frame("count").reset_index()
    top20_df.columns = ["organism", "count"]
    
    # Merge broad_group
    org_group = entries[["organism", "broad_group"]].drop_duplicates("organism")
    top20_df = top20_df.merge(org_group, on="organism", how="left")
    top20_df["color"] = top20_df["broad_group"].map(GROUP_COLORS).fillna("#bab0ac")
    
    fig, ax = plt.subplots(figsize=(10, 6))
    bars = ax.barh(range(len(top20_df)), top20_df["count"], color=top20_df["color"])
    ax.set_yticks(range(len(top20_df)))
    ax.set_yticklabels([f"{row.organism}  ({row.broad_group})" for row in top20_df.itertuples()], fontsize=9)
    ax.invert_yaxis()
    ax.set_xlabel("Number of entries")
    ax.set_title("Top 20 species in ProtNLM2 dataset (colored by taxonomic group)")
    for i, v in enumerate(top20_df["count"]):
        ax.text(v + 8, i, str(v), va="center", fontsize=8)
    
    # Legend
    from matplotlib.patches import Patch
    groups_in_chart = top20_df["broad_group"].unique()
    handles = [Patch(facecolor=GROUP_COLORS.get(g, "#bab0ac"), label=g) for g in groups_in_chart if pd.notna(g)]
    ax.legend(handles=handles, loc="lower right", fontsize=8)
    
    plt.tight_layout()
    plt.show()
    

    Entries by broad taxonomic group¶

    Aggregating across all 440 species into broad taxonomic groups shows the overall composition of the dataset.

    In [ ]:
    group_counts = entries["broad_group"].value_counts()
    display(group_counts.to_frame("count"))
    
    fig, ax = plt.subplots(figsize=(8, 4))
    colors = [GROUP_COLORS.get(g, "#bab0ac") for g in group_counts.index]
    group_counts.plot.barh(ax=ax, color=colors)
    ax.invert_yaxis()
    ax.set_xlabel("Number of entries")
    ax.set_title("Entries by broad taxonomic group")
    for i, v in enumerate(group_counts):
        ax.text(v + 50, i, f"{v:,}", va="center", fontsize=8)
    plt.tight_layout()
    plt.show()
    

    Prediction types by taxonomic group (stacked)¶

    Do certain taxonomic groups get richer predictions? This stacked bar chart shows how many entries in each group have GO terms, function comments, subcellular locations, or just a protein name. Note that categories overlap — an entry can have both GO terms and a subcellular location — so the stacked chart shows the fraction of entries in each group that have each prediction type.

    In [ ]:
    pred_by_group = entries.groupby("broad_group").agg(
        n_entries=("accession", "count"),
        has_go=("has_go", "sum"),
        has_function=("has_function", "sum"),
        has_subcell=("has_subcell", "sum"),
        name_only=("name_only", "sum"),
    ).sort_values("n_entries", ascending=True)
    
    # As percentages
    pct = pred_by_group[["has_go", "has_function", "has_subcell", "name_only"]].div(
        pred_by_group["n_entries"], axis=0
    ) * 100
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: stacked counts
    pred_by_group[["has_go", "has_function", "has_subcell", "name_only"]].plot.barh(
        stacked=True, ax=axes[0],
        color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"]
    )
    axes[0].set_xlabel("Number of entries")
    axes[0].set_title("Prediction types by taxonomic group (counts)")
    axes[0].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8, loc="lower right")
    
    # Right: percentages
    pct.plot.barh(
        stacked=True, ax=axes[1],
        color=["#4c78a8", "#f58518", "#54a24b", "#bab0ac"]
    )
    axes[1].set_xlabel("% of entries in group")
    axes[1].set_title("Prediction types by taxonomic group (%)")
    axes[1].legend(["GO terms", "Function", "Subcellular", "Name only"], fontsize=8, loc="lower right")
    
    plt.tight_layout()
    plt.show()
    

    Scores by match location (post-processing validation)¶

    The XML file is a post-processed export: after ProtNLM2 generates a free-text prediction, a downstream pipeline cross-references that prediction against known databases to check whether it matches something in UniProt's existing annotation vocabulary. The match_location field records which database the predicted text was validated against:

    match_location What it means Example match_text
    domain Predicted name matched a known UniProt domain description WRKY
    PANTHER Matched a PANTHER family/subfamily ID PTHR22654
    GO Matched a GO term ID GO:0005737
    keyword Matched a UniProt keyword ID KW-0472
    recommended_protein_name Matched an existing recommended protein name Aldehyde dehydrogenase
    InterPro Matched an InterPro accession IPR000225

    The model_score is ProtNLM2's own confidence in the prediction. The box plot below shows how these scores distribute across validation targets — predictions that can be matched to domain descriptions or protein names tend to score higher than those matched to opaque IDs like PANTHER subfamily codes.

    Prediction richness by species¶

    For species with at least 50 entries, what fraction of their proteins get each type of prediction? Sorted by % with GO terms.

    In [ ]:
    sp_groups = entries.groupby("organism").filter(lambda g: len(g) >= 50).groupby("organism")
    richness = sp_groups.agg(
        n_entries=("accession", "count"),
        pct_go=("has_go", "mean"),
        pct_function=("has_function", "mean"),
        pct_subcell=("has_subcell", "mean"),
        mean_go=("n_go", "mean"),
    ).sort_values("pct_go", ascending=False)
    richness[["pct_go", "pct_function", "pct_subcell"]] = (
        richness[["pct_go", "pct_function", "pct_subcell"]] * 100
    ).round(1)
    richness["mean_go"] = richness["mean_go"].round(2)
    display(richness.head(30))
    

    Model confidence scores¶

    Each evidence block carries a model_score between 0 and 1. Higher scores indicate greater model confidence. The distribution is right-skewed, with a notable spike near 1.0 (high-confidence string matches to known domain/family databases).

    In [ ]:
    scores = evidence["model_score"].dropna()
    print(f"Evidence blocks with scores: {len(scores):,}")
    print(scores.describe().to_frame().T)
    
    fig, ax = plt.subplots(figsize=(8, 3.5))
    scores.hist(bins=50, ax=ax, color="#4c78a8", edgecolor="white")
    ax.set_xlabel("Model score")
    ax.set_ylabel("Count")
    ax.set_title("ProtNLM2 model score distribution")
    ax.axvline(scores.median(), color="red", ls="--", lw=1, label=f"median={scores.median():.2f}")
    ax.legend()
    plt.tight_layout()
    plt.show()
    
    In [ ]:
    ev_with_loc = evidence.dropna(subset=["match_location", "model_score"])
    top_locs = ev_with_loc["match_location"].value_counts().head(10).index
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ev_with_loc[ev_with_loc["match_location"].isin(top_locs)].boxplot(
        column="model_score", by="match_location", ax=ax, vert=False
    )
    ax.set_xlabel("Model score")
    ax.set_title("Model score by match location (top 10)")
    fig.suptitle("")
    plt.tight_layout()
    plt.show()
    

    Match type breakdown¶

    ProtNLM2 uses three match types:

    • exact_sanitized: prediction exactly matches a database entry (after normalization)
    • hydrated: GO/keyword IDs were resolved to their full labels
    • substring: partial match to a longer database text

    Evidence blocks from phmmer/tmalign sequence searches have no match_type (they use sequence similarity scores instead).

    In [ ]:
    evidence["match_type"].value_counts(dropna=False).to_frame("count")
    

    GO term predictions¶

    About 24% of entries (6,833) receive GO term predictions. These span 2,006 unique GO terms. The most frequently predicted terms are very generic (nucleus, membrane, DNA binding) — a pattern worth investigating for frequency bias.

    In [ ]:
    go_preds = predictions[predictions["pred_type"] == "GO"]
    print(f"Total GO predictions: {len(go_preds):,}")
    print(f"Unique GO terms: {go_preds['pred_id'].nunique()}")
    
    top_go = go_preds["pred_label"].value_counts().head(20)
    display(top_go.to_frame("count"))
    
    # Aspect breakdown (GO term labels start with C:, F:, or P:)
    go_preds = go_preds.copy()
    go_preds["aspect"] = go_preds["pred_label"].str[0]
    print("\nGO aspect distribution:")
    display(go_preds["aspect"].value_counts().to_frame("count"))
    

    Subcellular location predictions¶

    Nearly half the entries (48%) get subcellular location predictions. Membrane and nucleus dominate.

    In [ ]:
    subcell_preds = predictions[predictions["pred_type"] == "subcellular_location"]
    print(f"Total subcellular location predictions: {len(subcell_preds):,}")
    
    top_locs = subcell_preds["pred_label"].value_counts().head(15)
    fig, ax = plt.subplots(figsize=(8, 4))
    top_locs.plot.barh(ax=ax, color="#54a24b")
    ax.invert_yaxis()
    ax.set_xlabel("Count")
    ax.set_title("Top 15 predicted subcellular locations")
    plt.tight_layout()
    plt.show()
    

    Evidence methods¶

    ProtNLM2 evidence blocks come from three methods:

    • string_match (56%): The predicted text matches a known database entry (domain, GO, PANTHER, etc.)
    • phmmer (40%): Sequence search against characterized proteins (reports accession + bit score)
    • tmalign (4%): Structure-based alignment (reports TM-scores for both chains)

    String matches tend to have higher confidence; tmalign scores are more uniformly distributed.

    In [ ]:
    evidence["method"] = "string_match"
    evidence.loc[evidence["phmmer_accession"].notna(), "method"] = "phmmer"
    evidence.loc[evidence["tmalign_accession"].notna(), "method"] = "tmalign"
    
    method_counts = evidence["method"].value_counts()
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Pie chart
    method_counts.plot.pie(ax=axes[0], autopct="%1.1f%%",
                           colors=["#4c78a8", "#f58518", "#54a24b"])
    axes[0].set_ylabel("")
    axes[0].set_title("Evidence method breakdown")
    
    # Score distributions
    for method, color in [("string_match", "#4c78a8"), ("phmmer", "#f58518"), ("tmalign", "#54a24b")]:
        subset = evidence[evidence["method"] == method]["model_score"].dropna()
        axes[1].hist(subset, bins=40, alpha=0.6, color=color, label=f"{method} (n={len(subset):,})")
    axes[1].set_xlabel("Model score")
    axes[1].set_ylabel("Count")
    axes[1].set_title("Score distribution by evidence method")
    axes[1].legend(fontsize=8)
    
    plt.tight_layout()
    plt.show()
    

    Overlap with AIGR curated reviews¶

    How many of the 28K ProtNLM2 entries overlap with our existing expert-curated gene reviews?

    In [ ]:
    import subprocess
    
    repo_root = Path("../..")
    review_files = list(repo_root.glob("genes/*/*-ai-review.yaml"))
    result = subprocess.run(
        ["grep", "-rh", "^id:", *[str(f) for f in review_files]],
        capture_output=True, text=True
    )
    reviewed_ids = {line.split(":", 1)[1].strip() for line in result.stdout.strip().split("\n") if line}
    protnlm_ids = set(entries["accession"])
    
    overlap = reviewed_ids & protnlm_ids
    print(f"AIGR reviewed genes: {len(reviewed_ids):,}")
    print(f"ProtNLM2 entries: {len(protnlm_ids):,}")
    print(f"Overlap: {len(overlap)}")
    if overlap:
        print(f"\nOverlapping accessions:")
        cols = ["accession", "protein_name", "n_go", "n_function", "n_subcell"]
        if "organism" in entries.columns:
            cols.insert(2, "organism")
        if "broad_group" in entries.columns:
            cols.insert(3, "broad_group")
        display(entries[entries["accession"].isin(overlap)][cols])
    

    Validation: ProtNLM2 GO predictions vs curated GOA annotations¶

    The summary statistics above describe the dataset as a whole. But how good are the predictions? To evaluate, we compare ProtNLM2's GO term predictions against curated GO annotations from the GOA database, using the isa_partof_closure table in DuckDB to capture hierarchical relationships in the ontology.

    We select three species spanning different annotation depths and taxonomic groups:

    Species DuckDB source Why chosen
    Human goa_human.ddb Best-annotated proteome; baseline for comparison
    X. laevis goa_xenopus.ddb Allotetraploid frog; many duplicated genes (ohnologs) with potential for paralog divergence
    X. tropicalis goa_xenopus.ddb Diploid frog; simpler genome, comparison point for laevis paralogs
    Arabidopsis thaliana goa_uniprot_gcrp.ddb Well-studied plant; many proteins with only ND annotations
    Mouse goa_mouse.ddb Well-annotated mammalian model organism; 710K GOA annotations
    Wheat (T. aestivum) goa_wheat.ddb Top plant species by entry count (729); large GOA with 512K annotations
    Streptomyces coelicolor bacteria.ddb Model actinobacterium; small sample but good GOA coverage

    Match category definitions¶

    For each ProtNLM2 GO prediction on a protein that exists in GOA, we classify the relationship using the ontology closure:

    Category Code Definition
    Exact EXACT ProtNLM2 predicted the same GO term that exists in GOA for this protein
    Less specific LESS_SPECIFIC ProtNLM2 predicted a strict ancestor of a GOA term — not wrong, but redundant (e.g. predicts "membrane" when GOA has "plasma membrane")
    More specific MORE_SPECIFIC ProtNLM2 predicted a strict descendant of a GOA term — potentially novel and more informative
    No overlap NO_OVERLAP Protein is in GOA but the predicted term has no ancestor/descendant relationship to any existing annotation — could be novel or incorrect
    Not in GCRP NOT_IN_GCRP Protein is absent from the GCRP (Gene Centric Reference Proteome) subset of GOA — may have annotations in full GOA

    Cross-species summary¶

    In [ ]:
    import duckdb
    
    COMPARE_SPECIES = [
        ("Human", 9606, Path.home() / "repos/go-db/db/goa_human.ddb", None),
        ("Mouse", 10090, Path.home() / "repos/go-db/db/goa_mouse.ddb", None),
        ("X. laevis", 8355, Path.home() / "repos/go-db/db/goa_xenopus.ddb", "taxon:8355"),
        ("X. tropicalis", 8364, Path.home() / "repos/go-db/db/goa_xenopus.ddb", "taxon:8364"),
        ("Arabidopsis", 3702, Path.home() / "repos/go-db/db/goa_uniprot_gcrp.ddb", "taxon:3702"),
        ("Wheat", 4565, Path.home() / "repos/go-db/db/goa_wheat.ddb", None),
        ("S. coelicolor", 100226, Path.home() / "repos/go-db/db/bacteria.ddb", "taxon:100226"),
    ]
    
    # Categories in display order
    CAT_ORDER = ["EXACT", "LESS_SPECIFIC", "MORE_SPECIFIC", "NO_OVERLAP", "NOT_IN_GCRP"]
    CAT_LABELS = ["Exact", "Less specific", "More specific", "No overlap", "Not in GCRP"]
    CAT_COLORS = ["#2ca02c", "#f58518", "#4c78a8", "#e45756", "#bab0ac"]
    
    COMPARISON_SQL = """
        CREATE TEMP TABLE protnlm AS
        SELECT * FROM read_csv_auto('{csv_path}');
    
        WITH in_goa AS (
            SELECT DISTINCT db_object_id FROM gaf_association {taxon_filter}
        ),
        -- EXACT: predicted term is literally in GOA for this protein
        exact AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN gaf_association g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
            {taxon_where}
        ),
        -- LESS_SPECIFIC: predicted term is a strict ancestor of a GOA term
        less_specific AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN gaf_association g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
            WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
            {taxon_and}
        ),
        -- MORE_SPECIFIC: predicted term is a strict descendant of a GOA term
        more_specific AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN gaf_association g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
            WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
            {taxon_and}
        )
        SELECT
            CASE
                WHEN e.pred_id IS NOT NULL THEN 'EXACT'
                WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
                WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
                WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
                ELSE 'NO_OVERLAP'
            END AS match_category,
            COUNT(*) AS n_predictions,
            COUNT(DISTINCT p.accession) AS n_proteins
        FROM protnlm p
        LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
        LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
        LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
        LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
        GROUP BY 1
        ORDER BY 2 DESC;
    """
    
    DETAIL_SQL = """
        WITH in_goa AS (
            SELECT DISTINCT db_object_id FROM gaf_association {taxon_filter}
        ),
        exact AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN gaf_association g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
            {taxon_where}
        ),
        less_specific AS (
            SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
                   g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
            FROM protnlm p
            INNER JOIN gaf_association g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
            INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
            WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
            {taxon_and}
        ),
        more_specific AS (
            SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
                   g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
            FROM protnlm p
            INNER JOIN gaf_association g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
            INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
            WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
            {taxon_and}
        )
        SELECT
            p.accession,
            p.pred_id AS protnlm_term,
            pt.label AS protnlm_label,
            CASE
                WHEN e.pred_id IS NOT NULL THEN 'EXACT'
                WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
                WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
                WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
                ELSE 'NO_OVERLAP'
            END AS match_category,
            CASE WHEN e.pred_id IS NOT NULL THEN p.pred_id ELSE COALESCE(ls.goa_match, ms.goa_match, '') END AS closest_goa_term,
            CASE WHEN e.pred_id IS NOT NULL THEN pt.label ELSE COALESCE(ls.goa_match_label, ms.goa_match_label, '') END AS closest_goa_label
        FROM protnlm p
        LEFT JOIN term_label pt ON p.pred_id = pt.id
        LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
        LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
        LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
        LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
        ORDER BY match_category, p.accession;
    """
    
    all_summaries = []
    all_details = {}
    
    for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
        sp_accs = set(entries[entries["taxon_id"] == taxon_id]["accession"])
        sp_go = predictions[
            (predictions["pred_type"] == "GO") & (predictions["accession"].isin(sp_accs))
        ][["accession", "pred_id", "pred_label"]]
        csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
        sp_go.to_csv(csv_path, index=False)
    
        if taxon_str:
            taxon_filter = f"WHERE db_object_taxon = '{taxon_str}'"
            taxon_where = f"WHERE g.db_object_taxon = '{taxon_str}'"
            taxon_and = f"AND g.db_object_taxon = '{taxon_str}'"
        else:
            taxon_filter = taxon_where = taxon_and = ""
    
        con = duckdb.connect(str(db_path), read_only=True)
        summary = con.execute(COMPARISON_SQL.format(
            csv_path=csv_path, taxon_filter=taxon_filter,
            taxon_where=taxon_where, taxon_and=taxon_and
        )).fetchdf()
        summary["species"] = species_name
        all_summaries.append(summary)
    
        detail = con.execute(DETAIL_SQL.format(
            taxon_filter=taxon_filter, taxon_where=taxon_where, taxon_and=taxon_and
        )).fetchdf()
        all_details[species_name] = detail
        con.close()
    
        print(f"{species_name}: {len(sp_go)} GO predictions, {sp_go['accession'].nunique()} proteins")
    
    combined = pd.concat(all_summaries, ignore_index=True)
    pivot = combined.pivot_table(
        index="match_category", columns="species",
        values="n_predictions", fill_value=0, aggfunc="sum"
    ).reindex(CAT_ORDER)
    display(pivot)
    
    # Stacked bar chart
    fig, axes = plt.subplots(1, 2, figsize=(14, 4))
    pivot_t = pivot.T.reindex(columns=CAT_ORDER)
    pivot_t.plot.barh(stacked=True, ax=axes[0], color=CAT_COLORS)
    axes[0].set_xlabel("Number of GO predictions")
    axes[0].set_title("ProtNLM2 vs GOA (counts)")
    axes[0].legend(CAT_LABELS, fontsize=7, loc="lower right")
    
    pct_pivot = pivot_t.div(pivot_t.sum(axis=1), axis=0) * 100
    pct_pivot.plot.barh(stacked=True, ax=axes[1], color=CAT_COLORS)
    axes[1].set_xlabel("% of predictions")
    axes[1].set_title("ProtNLM2 vs GOA (%)")
    axes[1].legend(CAT_LABELS, fontsize=7, loc="lower right")
    plt.tight_layout()
    plt.show()
    

    Comparison by GOA annotation source¶

    The analysis above compares ProtNLM2 against all GOA annotations. But GOA is a mix of high-quality experimental annotations, computational pipelines, and phylogenetic inference. How does ProtNLM2 compare against specific automated pipelines that it might be replacing?

    We repeat the closure-based comparison against three GOA subsets:

    GOA subset Filter What it captures
    InterPro2GO assigned_by = 'InterPro' Domain-based GO annotation via InterPro matches — the most direct comparison since ProtNLM2 also uses domain/family information
    IBA evidence_type = 'IBA' Phylogenetic annotation from PAINT/GO_Central — orthology-based function transfer
    InterPro2GO + IBA Both combined The union of the two major automated pipelines

    This tells us whether ProtNLM2 adds value beyond what existing automated methods already provide.

    In [ ]:
    from IPython.display import display, Markdown
    
    SOURCE_COMPARISON_SQL = """
        CREATE OR REPLACE TEMP TABLE protnlm AS
        SELECT * FROM read_csv_auto('{csv_path}');
    
        WITH goa_filtered AS (
            SELECT * FROM gaf_association
            {taxon_where_base}
            {goa_source_filter}
        ),
        in_goa AS (
            SELECT DISTINCT db_object_id FROM goa_filtered
        ),
        exact AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN goa_filtered g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
        ),
        less_specific AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN goa_filtered g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
            WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
        ),
        more_specific AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN goa_filtered g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
            WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
        )
        SELECT
            CASE
                WHEN e.pred_id IS NOT NULL THEN 'EXACT'
                WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
                WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
                WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
                ELSE 'NO_OVERLAP'
            END AS match_category,
            COUNT(*) AS n_predictions,
            COUNT(DISTINCT p.accession) AS n_proteins
        FROM protnlm p
        LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
        LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
        LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
        LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
        GROUP BY 1
        ORDER BY 2 DESC;
    """
    
    SOURCE_DETAIL_SQL = """
        WITH goa_filtered AS (
            SELECT * FROM gaf_association
            {taxon_where_base}
            {goa_source_filter}
        ),
        in_goa AS (
            SELECT DISTINCT db_object_id FROM goa_filtered
        ),
        exact AS (
            SELECT DISTINCT p.accession, p.pred_id
            FROM protnlm p
            INNER JOIN goa_filtered g ON p.accession = g.db_object_id AND p.pred_id = g.ontology_class_ref
        ),
        less_specific AS (
            SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
                   g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
            FROM protnlm p
            INNER JOIN goa_filtered g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON g.ontology_class_ref = ipc.subject
            INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
            WHERE ipc.object = p.pred_id AND g.ontology_class_ref != p.pred_id
        ),
        more_specific AS (
            SELECT DISTINCT ON (p.accession, p.pred_id) p.accession, p.pred_id,
                   g.ontology_class_ref AS goa_match, gt.label AS goa_match_label
            FROM protnlm p
            INNER JOIN goa_filtered g ON p.accession = g.db_object_id
            INNER JOIN isa_partof_closure ipc ON p.pred_id = ipc.subject
            INNER JOIN term_label gt ON g.ontology_class_ref = gt.id
            WHERE ipc.object = g.ontology_class_ref AND g.ontology_class_ref != p.pred_id
        )
        SELECT
            p.accession,
            p.pred_id AS protnlm_term,
            pt.label AS protnlm_label,
            CASE
                WHEN e.pred_id IS NOT NULL THEN 'EXACT'
                WHEN ls.pred_id IS NOT NULL THEN 'LESS_SPECIFIC'
                WHEN ms.pred_id IS NOT NULL THEN 'MORE_SPECIFIC'
                WHEN ig.db_object_id IS NULL THEN 'NOT_IN_GCRP'
                ELSE 'NO_OVERLAP'
            END AS match_category,
            CASE WHEN e.pred_id IS NOT NULL THEN p.pred_id ELSE COALESCE(ls.goa_match, ms.goa_match, '') END AS closest_goa_term,
            CASE WHEN e.pred_id IS NOT NULL THEN pt.label ELSE COALESCE(ls.goa_match_label, ms.goa_match_label, '') END AS closest_goa_label
        FROM protnlm p
        LEFT JOIN term_label pt ON p.pred_id = pt.id
        LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
        LEFT JOIN exact e ON p.accession = e.accession AND p.pred_id = e.pred_id
        LEFT JOIN less_specific ls ON p.accession = ls.accession AND p.pred_id = ls.pred_id
        LEFT JOIN more_specific ms ON p.accession = ms.accession AND p.pred_id = ms.pred_id
        WHERE ig.db_object_id IS NOT NULL
        ORDER BY match_category, p.accession;
    """
    
    GOA_SOURCES = [
        ("All GOA", ""),
        ("InterPro2GO", "AND assigned_by = 'InterPro'"),
        ("IBA", "AND evidence_type = 'IBA'"),
        ("InterPro2GO + IBA", "AND (assigned_by = 'InterPro' OR evidence_type = 'IBA')"),
    ]
    
    source_results = []
    source_details = {}
    
    for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
        csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
        taxon_where_base = f"WHERE db_object_taxon = '{taxon_str}'" if taxon_str else "WHERE 1=1"
    
        con = duckdb.connect(str(db_path), read_only=True)
        for source_name, source_filter in GOA_SOURCES:
            sql = SOURCE_COMPARISON_SQL.format(
                csv_path=csv_path,
                taxon_where_base=taxon_where_base,
                goa_source_filter=source_filter,
            )
            df = con.execute(sql).fetchdf()
            df["species"] = species_name
            df["goa_source"] = source_name
            source_results.append(df)
    
            # Fetch detail for InterPro2GO and IBA comparisons
            if source_name in ("InterPro2GO", "IBA"):
                detail = con.execute(SOURCE_DETAIL_SQL.format(
                    taxon_where_base=taxon_where_base,
                    goa_source_filter=source_filter,
                )).fetchdf()
                source_details[(species_name, source_name)] = detail
    
        con.close()
    
    source_combined = pd.concat(source_results, ignore_index=True)
    
    # Summary chart
    fig, axes = plt.subplots(1, 7, figsize=(32, 5), sharey=False)
    source_order = ["All GOA", "InterPro2GO", "IBA", "InterPro2GO + IBA"]
    for ax, species_name in zip(axes, ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]):
        sp_data = source_combined[source_combined["species"] == species_name]
        pivot = sp_data.pivot_table(
            index="goa_source", columns="match_category",
            values="n_predictions", fill_value=0, aggfunc="sum"
        ).reindex(columns=CAT_ORDER).reindex(source_order)
        pct = pivot.div(pivot.sum(axis=1), axis=0) * 100
        pct.plot.barh(stacked=True, ax=ax, color=CAT_COLORS, legend=False)
        ax.set_xlabel("% of predictions")
        ax.set_title(species_name)
    axes[0].legend(CAT_LABELS, fontsize=7, loc="lower right")
    plt.suptitle("ProtNLM2 vs GOA by annotation source (%)", fontsize=12)
    plt.tight_layout()
    plt.show()
    
    # Per-species summary tables
    for species_name in ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]:
        sp_data = source_combined[source_combined["species"] == species_name]
        pivot = sp_data.pivot_table(
            index="goa_source", columns="match_category",
            values="n_predictions", fill_value=0, aggfunc="sum"
        ).reindex(columns=CAT_ORDER).reindex(source_order)
        display(Markdown(f"#### {species_name}"))
        display(pivot)
    

    ProtNLM2 vs automated pipelines: where is ProtNLM2 more specific?¶

    The most interesting comparisons are against InterPro2GO (domain-based GO annotation) and IBA (phylogenetic annotation from PAINT/GO_Central). When ProtNLM2 predictions are MORE_SPECIFIC or NO_OVERLAP compared to these pipelines, this represents genuine added value: predictions that go beyond what existing automated methods provide.

    Below we show the detailed comparison for each species against each pipeline, highlighting predictions where ProtNLM2 is more specific or in a different ontology branch.

    In [ ]:
    for goa_source_name in ["InterPro2GO", "IBA"]:
        display(Markdown(f"---"))
        display(Markdown(f"### ProtNLM2 vs {goa_source_name}: where is ProtNLM2 more specific?"))
    
        for species_name in ["Human", "Mouse", "X. laevis", "X. tropicalis", "Arabidopsis", "Wheat", "S. coelicolor"]:
            key = (species_name, goa_source_name)
            if key not in source_details:
                continue
            detail = source_details[key]
            n_evaluable = len(detail)
            n_proteins = detail["accession"].nunique()
            cat_counts = detail["match_category"].value_counts()
    
            lines = [
                f"#### {species_name} vs {goa_source_name}",
                f"",
                f"**{n_evaluable} predictions** for **{n_proteins} proteins** that have {goa_source_name} annotations.",
                f"",
                f"| Category | Count |",
                f"|---|---|",
            ]
            for cat in CAT_ORDER:
                if cat != "NOT_IN_GCRP":
                    lines.append(f"| {cat} | {cat_counts.get(cat, 0)} |")
            lines.append("")
    
            # Enrich with evidence data
            detail_enriched = detail.merge(
                predictions[["accession", "pred_id", "evidence_key"]],
                left_on=["accession", "protnlm_term"],
                right_on=["accession", "pred_id"],
                how="left"
            ).merge(
                evidence[["accession", "evidence_key", "model_score", "match_text",
                           "match_location", "match_type", "phmmer_accession", "phmmer_score",
                           "tmalign_accession", "tmalign_score1"]],
                on=["accession", "evidence_key"],
                how="left"
            )
    
            more = detail_enriched[detail_enriched["match_category"] == "MORE_SPECIFIC"]
            if len(more) > 0:
                lines.append(f"**{len(more)} predictions where ProtNLM2 is more specific than {goa_source_name}:**")
                lines.append("")
                lines.append(f"| Protein | ProtNLM2 predicts | Score | Evidence | {goa_source_name} has |")
                lines.append("|---|---|---|---|---|")
                for _, row in more.iterrows():
                    score = f"{row['model_score']:.2f}" if pd.notna(row.get('model_score')) else "?"
                    # Build evidence summary
                    if pd.notna(row.get('phmmer_accession')):
                        ev_str = f"phmmer: {row['phmmer_accession']} ({row['phmmer_score']})"
                    elif pd.notna(row.get('tmalign_accession')):
                        ev_str = f"tmalign: {row['tmalign_accession']} ({row['tmalign_score1']})"
                    elif pd.notna(row.get('match_text')):
                        ev_str = f"{row['match_location']}: {row['match_text']}"
                    else:
                        ev_str = ""
                    lines.append(
                        f"| [{row.accession}](https://www.uniprot.org/uniprotkb/{row.accession}) "
                        f"| {row.protnlm_label} (`{row.protnlm_term}`) "
                        f"| {score} | {ev_str} "
                        f"| {row.closest_goa_label} (`{row.closest_goa_term}`) |"
                    )
                lines.append("")
    
            no_ov = detail_enriched[detail_enriched["match_category"] == "NO_OVERLAP"]
            if len(no_ov) > 0:
                lines.append(f"**{len(no_ov)} predictions with no ontological overlap to {goa_source_name}:**")
                lines.append("")
                lines.append("| Protein | ProtNLM2 predicts | Score | Evidence |")
                lines.append("|---|---|---|---|")
                for _, row in no_ov.iterrows():
                    score = f"{row['model_score']:.2f}" if pd.notna(row.get('model_score')) else "?"
                    if pd.notna(row.get('phmmer_accession')):
                        ev_str = f"phmmer: {row['phmmer_accession']} ({row['phmmer_score']})"
                    elif pd.notna(row.get('tmalign_accession')):
                        ev_str = f"tmalign: {row['tmalign_accession']} ({row['tmalign_score1']})"
                    elif pd.notna(row.get('match_text')):
                        ev_str = f"{row['match_location']}: {row['match_text']}"
                    else:
                        ev_str = ""
                    lines.append(
                        f"| [{row.accession}](https://www.uniprot.org/uniprotkb/{row.accession}) "
                        f"| {row.protnlm_label} (`{row.protnlm_term}`) "
                        f"| {score} | {ev_str} |"
                    )
                lines.append("")
    
            display(Markdown("\n".join(lines)))
    

    Note on IBA-specific comparisons¶

    Some predictions classified as MORE_SPECIFIC vs IBA are already present in GOA via other evidence codes. For example:

    Q9BYD2 (MRPL9, human mitochondrial ribosomal protein)

    ProtNLM2 predicts ribosome (GO:0005840) and ribonucleoprotein complex (GO:1990904), classified as MORE_SPECIFIC vs IBA (which only has mitochondrion). But GOA already has both terms via IEA, plus structural constituent of ribosome via IEA and NAS. The protein is a ribosomal protein — membership in the ribosome complex is literally what it is.

    This highlights a gap in GO's logical inference: if a protein has structural constituent of ribosome (MF), it should by definition be annotated to ribosome (CC). GO lacks the axiom that would make this automatic (something like: "structural constituent of X" implies "part_of X"). ProtNLM2 captures this trivially, but so would a simple rule:

    If annotated to structural constituent of ribosome → add ribosome

    This is another instance of the "bookkeeping" pattern — the prediction is correct but trivially derivable from existing annotations.

    Commentary on NO_OVERLAP cases¶

    Inspecting the NO_OVERLAP cases reveals two common patterns that inflate the apparent novelty of ProtNLM2 predictions:

    Pattern 1: Uninformative cellular component predictions¶

    O87844 has IBA annotation for ribosome disassembly — a process that obviously occurs in the cytoplasm. ProtNLM2 predicts cytoplasm (GO:0005737), which is classified as NO_OVERLAP because there is no is_a/part_of path between a BP term (ribosome disassembly) and a CC term (cytoplasm). But this prediction adds zero information: it would be surprising if ribosome disassembly happened outside the cytoplasm. Generic CC terms like cytoplasm, membrane, and nucleus are low-value predictions that are almost always inferable from the process or function annotations.

    Pattern 2: Ontology gaps in cross-aspect relationships¶

    Q9KZ33 has IBA annotation for sigma factor activity (GO:0016987). ProtNLM2 predicts DNA-templated transcription initiation (GO:0006352). These are biologically tightly coupled — sigma factors perform transcription initiation — but the ontology models this indirectly:

    • sigma factor activity is an MF term whose ancestors include regulation of DNA-templated transcription initiation and transcription regulator activity
    • DNA-templated transcription initiation is a BP term under DNA-templated transcription
    • There is no is_a/part_of path between "regulation of X" and "X" itself

    So the closure-based comparison classifies this as NO_OVERLAP, when biologically the prediction is entirely consistent with the IBA annotation. This is a limitation of using is_a/part_of closure for cross-aspect comparisons — a more sophisticated evaluation would use broader relationship types (e.g. regulates, enables) or semantic similarity metrics.

    Case study: A0A3B6GK97 — trivially correct but reveals annotation gaps¶

    ProtNLM2 predicts lipid catabolic process (GO:0016042) for wheat protein A0A3B6GK97, which is classified as MORE_SPECIFIC because GOA only has the parent term lipid metabolic process (GO:0006629).

    Is this prediction likely correct? The protein has IBA annotations for glycerophospholipase activity and monoacylglycerol lipase activity — both hydrolytic enzymes that degrade lipids. Catabolism is literally what lipases do. We can quantify this by asking: among all wheat proteins with glycerophospholipase activity that also have lipid metabolic process, how many also have lipid catabolic process?

    In [ ]:
    # Co-annotation analysis: glycerophospholipase + lipid metabolism -> lipid catabolism?
    con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_wheat.ddb"), read_only=True)
    
    gba_df = con.execute("""
        WITH has_lipase AS (
            SELECT DISTINCT db_object_id
            FROM gaf_association WHERE ontology_class_ref = 'GO:0004620'
        ),
        has_lipid_met AS (
            SELECT DISTINCT db_object_id
            FROM gaf_association WHERE ontology_class_ref = 'GO:0006629'
        ),
        has_lipid_cat AS (
            SELECT DISTINCT db_object_id
            FROM gaf_association WHERE ontology_class_ref = 'GO:0016042'
        )
        SELECT
            (SELECT COUNT(*) FROM has_lipase) AS n_glycerophospholipase,
            (SELECT COUNT(*) FROM has_lipase INNER JOIN has_lipid_met USING (db_object_id)) AS n_with_lipid_metabolism,
            (SELECT COUNT(*) FROM has_lipase INNER JOIN has_lipid_cat USING (db_object_id)) AS n_with_lipid_catabolism;
    """).fetchdf()
    
    con.close()
    
    n_lipase = gba_df["n_glycerophospholipase"].iloc[0]
    n_met = gba_df["n_with_lipid_metabolism"].iloc[0]
    n_cat = gba_df["n_with_lipid_catabolism"].iloc[0]
    pct = n_cat / n_met * 100 if n_met > 0 else 0
    
    display(Markdown(f"""
    **Guilt-by-association analysis** (wheat proteins with glycerophospholipase activity):
    
    | | Count |
    |---|---|
    | Proteins with glycerophospholipase activity | {n_lipase} |
    | ...that also have `lipid metabolic process` | {n_met} |
    | ...that also have `lipid catabolic process` | {n_cat} |
    | **Co-annotation rate** | **{pct:.0f}%** |
    
    **{n_cat} out of {n_met}** ({pct:.0f}%) proteins with glycerophospholipase activity and
    `lipid metabolic process` also have `lipid catabolic process`. A0A3B6GK97 is one of just
    {n_met - n_cat} that lack the more specific term — almost certainly a gap in the existing
    annotations rather than a genuinely novel prediction by ProtNLM2.
    """))
    

    Interpretation: This is a case where ProtNLM2 trivially outperforms the existing annotation — not because the model has deep biological insight, but because the existing annotations could easily be deepened by simple means:

    • Guilt-by-association (GBA): if 94% of proteins with enzyme activity X also have process annotation Y, propagate Y to the remainder
    • Curatorial rules: lipases are catabolic enzymes by definition; a curator reviewing lipase activity + lipid metabolic process should always add lipid catabolic process
    • Logical inference: GO could define axioms like "lipase activity enables lipid catabolic process" to make this deduction automatic

    This pattern — where ML predictions look impressive but are really just filling obvious gaps that existing pipelines neglect — is important context for evaluating ProtNLM2. The model isn't discovering new biology here; it's doing bookkeeping that the annotation ecosystem hasn't gotten around to yet.

    Case study: A0A3B6RKV1 — annotation transfer from top phmmer hit¶

    ProtNLM2 makes 5 highly specific plant biology predictions for wheat protein A0A3B6RKV1 (JmjC domain-containing protein with F-box), all classified as MORE_SPECIFIC compared to IBA:

    Predicted term GO ID
    gibberellin mediated signaling pathway GO:0010476
    regulation of photomorphogenesis GO:0010099
    positive regulation of seed germination GO:0010030
    epigenetic regulation of gene expression GO:0040029
    response to red light GO:0010114

    These are very specific developmental processes — not generic terms you would associate with any JmjC protein. Only epigenetic regulation of gene expression has any overlap with histone demethylases in Arabidopsis (4 of 78 proteins). The others (gibberellin, photomorphogenesis, seed germination, red light) have zero co-occurrence with JmjC proteins in GOA.

    So where do these predictions come from?

    In [ ]:
    display(Markdown("**ProtNLM2 evidence for A0A3B6RKV1:**"))
    ev_subset = evidence[evidence["accession"] == "A0A3B6RKV1"]
    display(ev_subset[["accession", "evidence_key", "model_score", "match_text",
                        "match_location", "phmmer_accession", "phmmer_score"]].fillna(""))
    
    # The phmmer hit is Q67XX3 = JMJ22 (Arabidopsis). What does JMJ22 have?
    con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_uniprot_gcrp.ddb"), read_only=True)
    jmj22 = con.execute(
        "SELECT ontology_class_ref, t.label, evidence_type, assigned_by "
        "FROM gaf_association g "
        "JOIN term_label t ON g.ontology_class_ref = t.id "
        "WHERE db_object_id = 'Q67XX3' "
        "AND ontology_class_ref IN "
        "('GO:0010476', 'GO:0010099', 'GO:0010030', 'GO:0040029', 'GO:0010114') "
        "ORDER BY ontology_class_ref"
    ).fetchdf()
    con.close()
    
    display(Markdown("**JMJ22 (Q67XX3) — the phmmer hit (score 689.2) — has all 5 terms with experimental evidence:**"))
    display(jmj22)
    

    Every single prediction is a direct copy of JMJ22 experimentally validated annotations.

    The wheat protein hits Arabidopsis JMJ22 (arginine-specific demethylase) via phmmer with a strong score (689.2). JMJ22 has all 5 terms with experimental evidence (IMP, IGI, IDA, IEP). ProtNLM2 simply transfers the top hit annotations — this is the same principle as ISS/ISO evidence in GO, or what IBA does via PAINT phylogenetic trees.

    The IBA annotations for this wheat protein already capture the MF (cis-regulatory region sequence-specific DNA binding) and CC (nucleus) from JMJ22. But the BP terms did not propagate via IBA — likely because the PAINT tree for this family does not include BP annotations at the relevant ancestral node, or the wheat protein falls outside the annotated clade.

    Interpretation: These predictions are probably correct if the wheat protein is truly a JMJ22 ortholog (the phmmer score suggests it is). But ProtNLM2 is not doing anything beyond annotation transfer from the closest characterized sequence. The "added value" over IBA is that ProtNLM2 transfers BP annotations that PAINT more conservative phylogenetic approach chose not to propagate — a difference in methodology (aggressive phmmer transfer vs. conservative clade-based inference), not in biological insight.

    Systematic review of all MORE_SPECIFIC and NO_OVERLAP predictions¶

    Having examined two cases in detail, we now survey all remaining MORE_SPECIFIC and NO_OVERLAP predictions across the four species. They fall into a small number of recurring patterns:

    Pattern Description Example
    Trivial deepening Prediction is an obvious child of the GOA term zinc ion binding > metal ion binding
    Phmmer transfer All predictions trace to the top phmmer hit's annotations JMJ22 case above
    Cross-aspect gap MF and BP terms are biologically coupled but have no closure path kinase activity vs phosphorylation
    Uninformative CC Generic location term that adds nothing cytoplasm for a cytoplasmic process
    Cross-kingdom error Annotations from wrong lineage transferred to the target neuron terms on a plant protein

    Below we walk through each pattern with specific examples.

    Pattern 1: Trivial deepening¶

    These predictions simply replace a broad term with an obvious child. They are almost certainly correct but could be achieved by simple rules or GBA — no ML needed.

    Protein Species ProtNLM2 predicts GOA has
    A0A3B6CCN7 Wheat zinc ion binding metal ion binding
    A0A3B6FNX0 Wheat iron ion binding metal ion binding
    A0A3B6INB8 Wheat DNA binding nucleic acid binding
    A0A3B6S8Y8 Wheat positive regulation of transcription by RNA pol II regulation of transcription by RNA pol II
    Q9M0U6 Arabidopsis SCF-dependent proteasomal ubiquitin-dependent protein catabolic process ubiquitin-dependent protein catabolic process

    Several more predictions are MORE_SPECIFIC against root terms (molecular_function, biological_process, cellular_component) — these proteins effectively have no annotation and any non-root prediction is trivially "more specific": F4IRM4 (plastid vs CC root), F4K5X1 (metal ion binding vs MF root), Q0WW53 (catalytic activity vs MF root), Q8LG27 (phosphorylation vs BP root), Q8RU85 (membrane vs CC root), Q9FFI4 (transferase activity vs MF root), Q9FFW3 (defense response vs BP root), Q9LIR0 (ubiquitin-protein transferase vs MF root), Q9LKR4 (nucleus vs CC root), Q9SSF9 (DNA binding vs MF root).

    Pattern 2: Phmmer-based annotation transfer¶

    Like the JMJ22 case, these predictions are direct copies of the top phmmer hit's annotations. The predictions may be correct if orthology holds, but the method is essentially ISS/ISO — sequence-similarity-based annotation transfer.

    Protein Species Prediction phmmer hit Score
    A0A3B6DJW8 Wheat defense response to fungus P93194 (PGIP, Phaseolus) 258
    A0A3B6TMW0 Wheat defense response to fungus P93194 (PGIP, Phaseolus) 264
    A0A3B6HYU7 Wheat defense response to fungus P93194 (PGIP, Phaseolus) 275
    A0A3B6RKQ5 Wheat copper homeostasis, cadmium response, protein domain binding Q94BT9 (ATX1, Arabidopsis) 114
    A0A3B6SNL4 Wheat recognition of pollen P17801 (SRK, Brassica) 893
    A0A3B6LH70 Wheat negative regulation of ABA signaling, nucleus Q39266 (AZF2, Arabidopsis) 77
    Q9LJ71 Arabidopsis phosphorylation O48716 503
    Q8N7F7 Human ER membrane targeting B7NZQ9 125

    The three polygalacturonase inhibitors (PGIP) all hit the same Phaseolus PGIP and get defense response to fungus — a well-established function for this protein family. | Q93IY1 | S. coelicolor | L-phosphoserine phosphatase activity | P40399 (RsbU, B. subtilis) | 39 |

    The pollen recognition prediction for A0A3B6SNL4 (phmmer score 893 to Brassica SRK) is interesting but may be wrong: SRK-mediated self-incompatibility is specific to Brassicaceae and wheat uses a different self-incompatibility system.

    Pattern 3: Cross-aspect ontology gaps¶

    These are classified as NO_OVERLAP because the protein has an MF annotation and ProtNLM2 predicts the corresponding BP term (or vice versa), but there is no is_a/part_of path between the two aspects. The predictions are biologically plausible — they describe the same biology from a different perspective.

    Protein Species ProtNLM2 predicts GOA has
    A0A3B6H133 Wheat phosphorylation, kinase activity ATP binding, signaling receptor activity (IBA)
    A0A3B6KN92 Wheat protein phosphorylation calcium-dependent protein kinase (IEA)
    A0A3B6NKR6 Wheat phosphorylation GHMP kinase domain (IEA)
    A0A3B6EAI8 Wheat protein ubiquitination, ub-dependent catabolism ubiquitin-protein transferase activity (IEA)
    A0A3B6EQ70 Wheat protein ubiquitination zinc ion binding, calmodulin binding, regulation of transcription (IEA)
    A0A3B6MX28 Wheat methylation S-adenosylmethionine-dependent methyltransferase (IEA)
    Q6P6B1 Human phosphorylation protein binding only (IPI) — no evidence for kinase activity
    Q9KZ33 S. coelicolor transcription initiation sigma factor activity (IBA)

    This is the largest category of apparent "novelty" — and it is entirely an artifact of evaluating with is_a/part_of closure only. Using broader relationship types (enables, regulates, part_of across aspects) or semantic similarity would collapse most of these into EXACT or LESS_SPECIFIC.

    Pattern 5: Cross-kingdom annotation transfer errors¶

    The most serious failure mode: ProtNLM2 transfers annotations from the wrong taxonomic lineage via phmmer, producing biologically impossible predictions.

    F6LAX4 — wheat TOG domain protein

    phmmer hit: A0A7S3G569 (Palpitomonas bilix, a protist), score 420.

    ProtNLM2 predicts:

    • neuron projection (GO:0043005) — plants do not have neurons
    • neuronal cell body (GO:0043025) — plants do not have neurons
    • protein antigen binding (GO:1990405) — adaptive immunity term, not applicable to plants
    • chromosome, centromeric region (GO:0000775) — plausible but likely from wrong context
    • chromosome segregation (GO:0007059) — plausible
    • protein heterodimerization activity (GO:0046982) — generic, plausible

    TOG domains are conserved across eukaryotes (they bind tubulin and regulate microtubule polymerization), but the phmmer hit picks up a protist protein whose annotations include animal-specific neuron terms. ProtNLM2 has no organism-awareness filter and blindly transfers everything.

    This is the same class of error we documented in the BioReason evaluation — failure to respect taxonomic context — and argues for organism-aware filtering in any annotation transfer pipeline.

    Pattern 6: Multidomain protein false positives¶

    A distinct failure mode from cross-kingdom transfer: the phmmer hit matches a shared accessory domain in a multidomain protein, and the catalytic activity of the unrelated domain gets transferred.

    F4JLB7 — RIC7, Arabidopsis LRR protein

    ProtNLM2 predicts kinase activity (GO:0016301) and phosphorylation (GO:0016310), both with model_score 0.23 (low, but still reported).

    The evidence:

    • phmmer hit: Q5S006 = LRRK2 (mouse leucine-rich repeat serine/threonine-protein kinase 2), score 33.0 — barely above noise threshold
    • PANTHER: PTHR48057:SF13 — but UniProt actually assigns F4JLB7 to PTHR48004 (a completely different family). ProtNLM2 matched the wrong PANTHER family.

    What's happening: LRRK2 is a large (2,527 aa) multidomain protein with LRR + ROC + COR + kinase domains. F4JLB7 (RIC7) is a small LRR protein that functions as a ROP GTPase effector in pollen tube growth. The phmmer hit matches only the LRR repeat region (a common structural domain), but the kinase prediction comes from LRRK2's unrelated kinase domain.

    RIC7 has:

    • LRR repeats (structural, no catalytic activity)
    • IBA for signaling receptor activity and plasma membrane
    • IMP for pollen tube growth
    • No kinase domain, no kinase activity

    Verdict: false positive. The model_score (0.23) is appropriately low, suggesting ProtNLM2 has some calibration — it's not confident in this prediction. But it still reports it. This failure mode — annotation leakage from one domain of a multidomain hit to a protein that only shares a different domain — is a well-known problem in sequence-based annotation transfer (sometimes called the "multi-domain problem" or "domain architecture mismatch").

    Pattern 4: Uninformative CC and miscellaneous noise¶

    Some predictions are technically not wrong but add no useful information:

    Protein Species Prediction Why uninformative
    Q8GXB1 Arabidopsis membrane DUF1990 protein — membrane is too generic to be useful
    A0A3B6GUW2 Wheat endoplasmic reticulum Might be correct but low-information without MF/BP context
    Q9M0U6 Arabidopsis SCF ubiquitin ligase complex NO_OVERLAP — but the protein is an F-box protein, so SCF complex membership is trivially implied

    The SCF complex case (Q9M0U6) is particularly instructive: the protein is annotated as an F-box protein, and F-box proteins are by definition components of SCF complexes. This is another "bookkeeping" prediction that any rule-based system could make.

    Summary of error/novelty patterns¶

    Across all four species, every MORE_SPECIFIC and NO_OVERLAP prediction falls into one of five patterns:

    Pattern Count (approx) Genuinely novel?
    Trivial deepening ~20 No — simple rules or GBA
    Phmmer transfer ~15 Maybe — depends on orthology
    Cross-aspect gap ~15 No — evaluation artifact
    Uninformative CC/noise ~5 No
    Cross-kingdom error 1 (6 predictions) No — actively wrong
    Multidomain false positive ~2 No — wrong domain matched

    None of the predictions we examined represent genuinely novel biological insight. The closest are the phmmer transfer cases (e.g. JMJ22, PGIP), where the predictions may be correct but the method (sequence similarity transfer) is decades old and already implemented in GO as ISS/ISO evidence. ProtNLM2 adds value primarily by being more aggressive than PAINT/IBA in transferring annotations — at the cost of occasional cross-kingdom errors like the neuron-in-wheat case.

    Note on evaluation methodology: semantic similarity metrics (Resnik, Lin) as used in CAFA are not a good alternative to closure-based evaluation — they reward predictions that are ontologically close but biologically opposite (e.g. positive vs negative regulation). Our closure-based approach with manual inspection of edge cases is more reliable, though it requires supplementing the is_a/part_of closure with broader relationship types (enables, regulates) for cross-aspect comparisons.

    Caveat: Swiss-Prot does not mean well-characterized¶

    Some Swiss-Prot entries have no functional annotation beyond root-level ND (No Data) terms like molecular_function or biological_process. These are reviewed for existence and structure but remain functionally uncharacterized.

    Example: Q9SSF9 — Pentatricopeptide repeat-containing protein At1g74750

    This is Swiss-Prot since 2004 but GOA has only GO:0003674 molecular_function (ND). ProtNLM2 predicts DNA binding, chloroplast localization, and organelle-nucleus signaling. The localization predictions are probably correct (most PPR proteins are organellar), but "DNA binding" is likely wrong — PPR proteins bind RNA, not DNA. This is a classic domain-level error where the model confuses structurally similar PPR/TPR repeat architectures.

    When ProtNLM2 predictions are classified as MORE_SPECIFIC, many are against ND-only baselines — "anything is more specific than nothing." The table below separates proteins with real functional annotations from those with only ND root terms.

    In [ ]:
    from IPython.display import display, Markdown
    
    # For each species, check how many proteins in GOA only have ND annotations
    nd_stats = {}
    for species_name, taxon_id, db_path, taxon_str in COMPARE_SPECIES:
        csv_path = f"/tmp/protnlm_{species_name.lower().replace(' ', '_').replace('.', '')}_go.csv"
        taxon_and = f"AND db_object_taxon = '{taxon_str}'" if taxon_str else ""
    
        con = duckdb.connect(str(db_path), read_only=True)
        nd_df = con.execute(f"""
            WITH protnlm_accs AS (
                SELECT DISTINCT accession FROM read_csv_auto('{csv_path}')
            ),
            protein_annotations AS (
                SELECT
                    db_object_id,
                    COUNT(*) FILTER (WHERE evidence_type != 'ND') as n_real
                FROM gaf_association
                WHERE db_object_id IN (SELECT accession FROM protnlm_accs)
                {taxon_and}
                GROUP BY db_object_id
            )
            SELECT
                CASE WHEN n_real = 0 THEN 'ND only (no real annotations)'
                     ELSE 'Has functional annotations' END AS annotation_status,
                COUNT(*) AS n_proteins
            FROM protein_annotations
            GROUP BY 1
            ORDER BY 1;
        """).fetchdf()
        con.close()
        nd_stats[species_name] = nd_df
    
    all_nd = pd.concat({k: v.set_index("annotation_status")["n_proteins"] for k, v in nd_stats.items()}, axis=1).fillna(0).astype(int)
    display(Markdown("**Proteins in GOA: ND-only vs functionally annotated**"))
    display(all_nd)
    

    What are the NOT_IN_GCRP proteins?¶

    A large fraction of ProtNLM2 predictions cannot be evaluated because the protein is absent from the GCRP (Gene Centric Reference Proteome) subset of GOA used in our DuckDB databases. But are these proteins truly unannotated, or just absent from the GCRP subset?

    We cross-checked against the full QuickGO API to distinguish:

    • In full GOA but not GCRP — annotations exist, just not in the reference proteome subset
    • Truly absent from GOA — no annotations anywhere
    In [ ]:
    # Analyze human NOT_IN_GCRP proteins
    import urllib.request as urlreq
    con = duckdb.connect(str(Path.home() / "repos/go-db/db/goa_human.ddb"), read_only=True)
    
    human_accs = set(entries[entries["taxon_id"] == 9606]["accession"])
    human_go = predictions[(predictions["pred_type"] == "GO") & (predictions["accession"].isin(human_accs))]
    human_go[["accession", "pred_id", "pred_label"]].to_csv("/tmp/protnlm_human_go.csv", index=False)
    
    not_in_gcrp = con.execute("""
        CREATE OR REPLACE TEMP TABLE protnlm AS SELECT * FROM read_csv_auto('/tmp/protnlm_human_go.csv');
        WITH in_goa AS (SELECT DISTINCT db_object_id FROM gaf_association)
        SELECT DISTINCT p.accession
        FROM protnlm p LEFT JOIN in_goa ig ON p.accession = ig.db_object_id
        WHERE ig.db_object_id IS NULL
    """).fetchdf()
    
    not_in_gcrp_accs = sorted(not_in_gcrp["accession"])
    not_in_gcrp_preds = human_go[human_go["accession"].isin(not_in_gcrp_accs)]
    
    # Cross-check against full QuickGO API
    import time
    has_full_goa = []
    truly_absent = []
    for acc in not_in_gcrp_accs:
        url = f"https://www.ebi.ac.uk/QuickGO/services/annotation/search?geneProductId={acc}&limit=1"
        try:
            req = urlreq.Request(url)
            req.add_header("Accept", "application/json")
            with urlreq.urlopen(req, timeout=30) as resp:
                data = json.loads(resp.read())
            n = data.get("numberOfHits", 0)
            if n > 0:
                has_full_goa.append((acc, n))
            else:
                truly_absent.append(acc)
        except:
            truly_absent.append(acc)
        time.sleep(0.3)
    
    con.close()
    
    display(Markdown(f"""
    **Human NOT_IN_GCRP: {len(not_in_gcrp_accs)} proteins, {len(not_in_gcrp_preds)} GO predictions**
    
    Cross-check against full QuickGO:
    
    | | Count |
    |---|---|
    | Total NOT_IN_GCRP | {len(not_in_gcrp_accs)} |
    | Have annotations in full GOA (not in GCRP subset) | {len(has_full_goa)} |
    | **Truly absent from GOA** | **{len(truly_absent)}** |
    
    Over half ({len(has_full_goa)}/{len(not_in_gcrp_accs)}) of the "missing" proteins actually
    have GOA annotations — they are just absent from the Gene Centric Reference Proteome subset.
    These are typically alternate TrEMBL accessions for proteins whose canonical accession is
    in the reference proteome.
    """))
    
    # Show the truly absent ones
    display(Markdown("**Truly absent from GOA (no annotations in QuickGO):**"))
    truly_absent_info = []
    for acc in truly_absent:
        entry = entries[entries["accession"] == acc]
        pname = entry.iloc[0]["protein_name"] if len(entry) > 0 else "?"
        ev = evidence[evidence["accession"] == acc]
        phmmer = ev[ev["phmmer_accession"].notna()]
        hit = f"{phmmer.iloc[0]['phmmer_accession']} ({phmmer.iloc[0]['phmmer_score']})" if len(phmmer) > 0 else "none"
        truly_absent_info.append({"accession": acc, "protein_name": pname, "phmmer_hit": hit})
    display(pd.DataFrame(truly_absent_info))
    

    Interpretation: The NOT_IN_GCRP category splits into two distinct groups:

    1. In full GOA but not GCRP — These are alternate TrEMBL accessions for proteins whose canonical accession has annotations in the reference proteome. ProtNLM2's predictions for these are likely correct but entirely redundant. Example: F5H290 (ZNF7) has a canonical accession P17097 with 11 GOA annotations.

    2. Truly absent from GOA — These are genuinely interesting. The standout case is J3KSL8 (926 aa, PE=1 — evidence at protein level, UniProt's highest confidence) which hits Q9Y2L5 (TRAPPC8) at phmmer score 2102 yet has zero GOA annotations. It's unclear why the normal UniProt/GOA pipelines failed to produce any annotations for a protein with such strong homology to a well-characterized Swiss-Prot entry. This represents a genuine gap in the GOA pipeline rather than an achievement of ProtNLM2.

    Other truly-absent proteins tend to be domain-containing fragments with moderate phmmer hits (100–350 range) or uncharacterized proteins with no phmmer support at all.

    Per-species detailed comparison¶

    Below we drill into each species individually. For every ProtNLM2 GO prediction where the protein exists in GOA, we show the predicted term, the match category, and the closest GOA term. This lets us inspect exactly what ProtNLM2 gets right, where it is redundant, and where it might be adding novel (or incorrect) information.

    In [ ]:
    for species_name, detail in all_details.items():
        n_evaluable = len(detail)
        n_proteins = detail["accession"].nunique()
    
        cat_counts = detail["match_category"].value_counts()
        exact = cat_counts.get("EXACT", 0)
        less = cat_counts.get("LESS_SPECIFIC", 0)
        more = cat_counts.get("MORE_SPECIFIC", 0)
        no_overlap = cat_counts.get("NO_OVERLAP", 0)
    
        # Check if MORE_SPECIFIC cases are against root terms (ND annotations)
        root_terms = {"molecular_function", "biological_process", "cellular_component"}
        more_rows = detail[detail["match_category"] == "MORE_SPECIFIC"]
        more_vs_root = more_rows["closest_goa_label"].isin(root_terms).sum() if len(more_rows) > 0 else 0
    
        lines = [
            f"#### {species_name}",
            f"",
            f"**{n_evaluable} predictions** across **{n_proteins} proteins** that are in GOA.",
            f"",
            f"| Category | Count | Interpretation |",
            f"|---|---|---|",
            f"| Exact | {exact} | Predicted term is the same GO term already in GOA |",
            f"| Less specific | {less} | Predicted term is a strict ancestor of a GOA term (redundant) |",
        ]
        if more > 0:
            note = ""
            if more_vs_root > 0:
                note = f" (**{more_vs_root} of these are vs root-level ND annotations** — the protein effectively has no real annotation)"
            lines.append(f"| More specific | {more} | Predicted term is a strict descendant of a GOA term{note} |")
        if no_overlap > 0:
            lines.append(f"| No overlap | {no_overlap} | Predicted term is in a different ontology branch from all GOA annotations |")
        lines.append("")
    
        display(Markdown("\n".join(lines)))
        display(detail.style.set_properties(**{"text-align": "left"}).set_caption(f"{species_name} — detailed predictions"))