ProtNLM2 Data Source History

ProtNLM2 Data Source History

Predictions are fetched from the UniProt REST API endpoint https://rest.uniprot.org/uniprotkb/protnlm/{accession}. The canonical accession list (26,856 entries) is published at the FTP site.

python projects/PROTNLM_EVALUATION/fetch_protnlm_api.py          # fetch all + convert to TSVs
python projects/PROTNLM_EVALUATION/fetch_protnlm_api.py --resume  # resume interrupted fetch
python projects/PROTNLM_EVALUATION/fetch_protnlm_api.py --convert-only  # re-convert existing JSONL

The API returns JSON with evidence inline per prediction. The fetch script saves raw JSONL (protnlm_api.jsonl) for reproducibility and converts to 3 TSV files (entries, predictions, evidence).

Pre-release XML (historical)

The original exploratory analysis used a pre-release XML export (post-processed-2026_02_28k.xml, 28,553 entries) parsed by parse_protnlm_xml.py. This included 1,697 entries subsequently removed during quality filtering (64.5% Swiss-Prot entries excluded from the TrEMBL-only public pilot, plus QC-filtered TrEMBL entries). The API serves the same predictions in a cleaner format.

Dataset statistics

Type Count Description
Protein name 28,553 Every entry gets a predicted name (22,467 recommended, 6,086 submitted)
GO terms 6,833 entries GO annotations derived from model predictions
Subcellular location 13,690 entries Predicted localization
Function comment 5,438 entries Free-text functional description
Name only 8,690 entries Only protein name predicted, no GO/location/function

Evidencer corroboration provenance

Each prediction has a model score (0–1, threshold 0.05) and post-hoc corroboration from the Evidencer:

TSV files

File Description
entries.tsv One row per entry (accession, name, prediction counts)
predictions.tsv One row per GO/function/location prediction
evidence.tsv One row per evidence block (scores, provenance)
taxonomy.tsv Accession -> species mapping from UniProt