PDB: Deposited Structures as Functional-Insight Evidence

MATURE PIPELINE

Warnings (6)

PDB: Deposited Structures as Functional-Insight Evidence

Goal

Many genes in this review pipeline have experimentally solved 3D structures deposited in
the Protein Data Bank. A structure is not just a picture: when it
captures a bound ligand, cofactor, catalytic metal, or partner macromolecule, it is
direct, experimental evidence for molecular function, binding, and complex membership — the
very things GO annotation review hinges on.

This project (1) inventories what is deposited across the whole pipeline, and (2) prioritizes
the structures most likely to change or strengthen a gene's functional annotation, so that
structure-informed review effort goes where it pays off.

What "functional insight" means here

A deposited structure is most valuable for review when what is bound in it tells us
something the current annotations do not. The prioritization therefore combines two axes:

  1. Structure richness — what the structure actually contains:
  2. bound cofactor / catalytic metal (FAD, NAD(P), PQQ, F430, Fe–S clusters, heme, Zn, …)
    → strongest single clue to catalytic mechanism / molecular function
  3. bound substrate / product / inhibitor ligand → active-site and specificity evidence
  4. complex (≥2 protein entities, or bound nucleic acid) → interaction & CC evidence
  5. high sequence coverage (full-length vs a short peptide fragment in a partner's structure)

  6. Annotation sparsity — a structure adds the most where we currently know the least.
    Genes whose GO annotations are electronic-only (IEA/IBA) with no experimental
    molecular-function term
    are upweighted: here a ligand-bound structure (plus literature)
    can ground a specific, experiment-grade MF annotation that today rests only on inference.

The two are multiplied: richness × sparsity. An apo structure of an already
well-annotated protein scores low; a cofactor-bound, full-length structure of an IEA-only
enzyme scores high.

What is deposited (pipeline-wide inventory)

Computed offline from the cached UniProt records (*-uniprot.txt, DR PDB cross-references):

Metric Count
Genes with ≥1 deposited PDB structure 949 / 2529
Total deposited PDB entries 13,415
Eukaryotic genes with a structure 815
Genes with a structure but no experimental GO at all 73
Genes with a structure but no experimental molecular-function GO 114
Genes where a review disputed a catalytic MF (REMOVE/over-annotated) 130

Top organisms by genes-with-structure: human (588), yeast (65), ARATH (53), BPT4 (32),
mouse (30), ECOLI (23), PSEPK (19), SCHPO (17), worm (17), rat (15).

Three prioritization cuts

enrich_rcsb.py enriches the union of three candidate cuts (278 genes; each tagged in the
candidate_reason column), so the priority list is no longer prokaryote-dominated:

Cut Definition Genes What structure adds
dark_mf no experimental molecular-function GO 114 grounds a first experiment-grade MF
euk eukaryotic and ≤2 experimental MF terms 100 sharpens an IEA term / pins complex membership
contested review marked a catalytic MF REMOVE/over-annotated 130 adjudicates the disputed activity (pseudo-enzyme / over-general / wrong-specific)

Full inventory: PDB/data/pdb_inventory.tsv (per entry) and
pdb_gene_summary.tsv (per gene).

Prioritized candidates (first pass)

The 114 genes with a structure but no experimental MF GO were enriched with RCSB metadata
(bound ligands, entity counts, titles). Of the 105 that resolved:

These are IEA-only enzymes/proteins whose function is structurally (and often
biochemically) characterized but whose GO annotations have not been promoted beyond
electronic inference — the sweet spot for structure-grounded review.

# Gene Organism UniProt n PDB cofactor ligand complex cofactors/ligands
1 rpsD PSEAE O52759 6 GDP,ZN
2 psaC CHLRE Q00914 17 FES,SF4
3 secA BACSU P28366 18 ADP
4 (DsrAB) DESVH P07598 12 FE2,HEC,SF4,ZN
5 mcrA METAC Q8THH1 4 COB,F430,FE,SAM,SF4
6 wac BPT4 P10104 117 ZN
7 rbcL 9POAL P0C512 3 NDP (RuBisCO)
8 algK PSEPK Q88NC7 1 NI
9 mxaI METEA P14775 3 PQQ
10 fae METEA Q9FA38 11 H4MPT,DCP,CA,MG
12 cbh1 HYPJE P62694 48 (cellobiohydrolase)
13 pqqB PSEPK Q88QV5 8 CU,MN,ZN
16 merA PSEAI P00392 6 FAD,NADP
17 mtdA METEA P55818 5 NADP
18 pcaF PSEPK Q88N39 4 COA
23 mdh METEA Q84FY8 2 NAD
25 xoxF1 METEA C5B120 2 PQQ

Full ranked list: PDB/data/pdb_gene_enriched.tsv (now includes the
RCSB per-entry structure-paper PMIDs and an is_eukaryote flag).

Eukaryotic candidates (so they aren't drowned out)

Broadening beyond the strict "no experimental MF" cut to euk + contested surfaces a
much richer eukaryotic slice (815 eukaryotic genes have a structure). Top eukaryotic
candidates by score, with the cut(s) that flagged them:

# Gene Org UniProt reason nPDB cof lig cplx cofactors/ligands paper
1 psaC CHLRE Q00914 dark_mf,euk 17 FES,SF4 (Photosystem I) PMID:36979472
2 rbcL 9POAL P0C512 dark_mf,euk 3 NDP (RuBisCO) PMID:22609438
3 PNO1 yeast Q99216 euk 28 GTP,ZN (ribosome assembly) PMID:33326748
4 RPS3 human P23396 contested 133 ZN (ribosomal / endonuclease?) PMID:29875412
5 NAA15 human Q9BXJ9 contested 10 AcCoA (NatA auxiliary) PMID:40639378
6 HEN1 ARATH Q9C5Q8 contested,euk 1 SAH (RNA methyltransferase) PMID:19812675
7 TERT human O14746 contested 17 (telomerase RT) PMID:27903649
8 cbh1 HYPJE P62694 contested,dark_mf,euk 48 cellobiohydrolase Cel7A PMID:26307003
9 XYL1 PICST P31867 contested,dark_mf,euk 2 NADP (xylose reductase) PMID:30487522
10 UPF1 human Q92900 contested 11 ATP,Zn (NMD helicase) PMID:38709891
11 BRCA2 human P51587 contested 14 ATP (HR mediator) PMID:40441151
12 DOT1 yeast Q04089 contested 5 SAM/SAH (H3K79 MTase) PMID:33479126
13 SIRT2 human Q8IXJ6 contested 60 NAD,Zn (deacetylase) PMID:28286128

The verified flagships (IDH3B, ATAD1, XYL1, psaC, COX6B1, SPR, COI1) are written up in
PDB/STRUCTURE_PAPERS.md. For human genes the structure typically sharpens an
existing IEA term or pins complex membership rather than revealing function from scratch.

Contested catalytic functions (structure adjudicates)

130 genes with a structure have a review that marked a catalytic molecular function as
REMOVE or over-annotated. A deposited structure is decisive here — it shows whether the
cofactor/active-site pocket is actually present. Two distinct cases (don't conflate them):

Gene Org disputed catalytic MF action cofactor present? paper
HEN1 ARATH peptidyl-prolyl cis-trans isomerase REMOVE SAH (→ methyltransferase) PMID:19812675
CASP3 human aspartic-type endopeptidase REMOVE (cysteine protease)
mcrA METAC transferase activity (generic) REMOVE F430,SAM,Fe-S PMID:39772843
HAP1 human deoxyribonuclease (pyrimidine dimer) REMOVE Mn (AP endonuclease) PMID:25251148
BRCA2 human histone acetyltransferase REMOVE ATP PMID:40441151
DOT1 yeast methyltransferase activity (generic) over-annotated SAM/SAH PMID:33479126
SIRT2 human transferase activity (generic) over-annotated NAD,Zn PMID:28286128
pcaF PSEPK acyltransferase activity (generic) over-annotated CoA PMID:32647822
XYL1 PICST oxidoreductase activity (generic) REMOVE NADP PMID:30487522

Full list with all disputed terms per gene: pdb_gene_enriched.tsv (candidate_reason
contains contested; contested_cat_mf lists the term/label/action). As always, the
disputed-term mapping and the structure PMID must both be verified before use.

Grounding in the structure papers

A bound ligand is the clue; the primary structure paper carries the functional
interpretation. PDB/STRUCTURE_PAPERS.md records verified, PubMed-sourced
notes for the shortlist above (and the prokaryotic flagships mcrA, merA, pcaF), with the
GO-annotation implication for each.

Caveat surfaced by doing this: the structure_papers PMIDs are the RCSB per-entry
primary citation — the paper that deposited that coordinate set, which is often a downstream
ligand/inhibitor or methods study rather than the definitive structure/function paper. Verified
drift cases: SPR's PMIDs are inhibitor-screening papers; cbh1's are glycosylation/propranolol
NMR; merA's is the N-terminal NmerA-domain NMR. Others (XYL1, psaC, IDH3B, ATAD1, pcaF, COX6B1)
are the definitive paper. Each PMID must be verified against the gene before it is cited in
a review
(the "verify, don't trust" rule), exactly as STRUCTURE_PAPERS.md does.

Patterns worth noting: a cluster of methylotrophy / PQQ-dependent dehydrogenases
(METEA mxaI, xoxF1, mdh, mtdA, fae, PSEPK pedH, pqqB) and redox cofactor
enzymes
(merA FAD/NADP, psaC/mcrA/pqqE Fe–S, DsrAB heme/siroheme) — these have
diagnostic cofactors visible in their structures, making them low-effort, high-yield review
targets.

Reproduce

# 1. offline inventory from cached UniProt + GOA (no network)
python3 projects/PDB/inventory_pdb.py
# 2. RCSB enrichment of the prioritized candidate genes (network)
python3 projects/PDB/enrich_rcsb.py
# 3. structure-paper -> GOA citation gap (offline); writes CURATION_GAP.md
python3 projects/PDB/curation_gap.py
# 4. ranked GAP_OPPORTUNITY review worklist (offline); writes GAP_WORKLIST.md
python3 projects/PDB/gap_worklist.py
# 5. H1 frontier test set: GAP_NO_EXP_CURATION genes (offline); writes data/h1_testset.tsv
python3 projects/PDB/h1_testset.py

See PDB/RESULTS.md for method detail, caveats, and the full output schema.

Does structural evidence fill annotation gaps? (PDB/H1_LEDGER.md)

H1_LEDGER.md tests whether structures fill GO gaps that traditional publications would
not. It records the GAP_OPPORTUNITY→GAP_NO_EXP_CURATION methodology correction, the
three-evidence-layer model of a structure paper (coordinates → low-information binding
terms; the paper's integrative hypothesis → informative function; sequence/EC → catalytic
identity), and the Layer-2 scoring pass. Bottom line: structures reliably supply first
experimental-grade
evidence for under-curated proteins, but genuinely new informative
function is rare — throttled by subunit mismatch and GO expressivity.

Caveats

Are structure papers overlooked by curation? (PDB/CURATION_GAP.md)

PDB/curation_gap.py measures, for every deposited structure with a linked primary
publication, whether that PMID is cited in the gene's GOA REFERENCE column. Across
737 structure-paper × gene pairs (247 genes), only 15% are cited by GOA; 65%
are GAP_OPPORTUNITY (the paper predates the gene's last experimental annotation yet is
never referenced), and 174/247 genes cite zero of their structure papers. "Not cited"
means the structural study is absent from the evidence trail, not that the function is
unannotated — but it quantifies how under-used the structural literature is as a GO evidence
source. The lag boundary uses the latest experimental annotation year, since overall GOA
dates are inflated by IEA/IBA pipeline refreshes.

The reusable GOA-citation helper lives in core
(ai_gene_review.validation.goa_validator.referenced_pmids); the analysis is
project-specific.

Prioritized worklist (PDB/GAP_WORKLIST.md)

PDB/gap_worklist.py ranks the GAP_OPPORTUNITY papers by gene priority
(dark-MF / eukaryote / contested) plus the cofactor / ligand / complex richness of the
uncited structures, collapsed to one row per gene (the review unit). Top targets:
yeast PNO1, human RPS3, human BIRC5, human GCH1, human SIRT2, human MAPK1,
ARATH CRY2. Per-paper detail in PDB/data/gap_worklist.tsv.

Next steps