ProtNLM2 Data Source History
REST API (current, recommended)
Predictions are fetched from the UniProt REST API endpoint https://rest.uniprot.org/uniprotkb/protnlm/{accession}. The canonical accession list (26,856 entries) is published at the FTP site.
python projects/PROTNLM_EVALUATION/fetch_protnlm_api.py # fetch all + convert to TSVs
python projects/PROTNLM_EVALUATION/fetch_protnlm_api.py --resume # resume interrupted fetch
python projects/PROTNLM_EVALUATION/fetch_protnlm_api.py --convert-only # re-convert existing JSONL
The API returns JSON with evidence inline per prediction. The fetch script saves raw JSONL (protnlm_api.jsonl) for reproducibility and converts to 3 TSV files (entries, predictions, evidence).
Pre-release XML (historical)
The original exploratory analysis used a pre-release XML export (post-processed-2026_02_28k.xml, 28,553 entries) parsed by parse_protnlm_xml.py. This included 1,697 entries subsequently removed during quality filtering (64.5% Swiss-Prot entries excluded from the TrEMBL-only public pilot, plus QC-filtered TrEMBL entries). The API serves the same predictions in a cleaner format.
Dataset statistics
| Type | Count | Description |
|---|---|---|
| Protein name | 28,553 | Every entry gets a predicted name (22,467 recommended, 6,086 submitted) |
| GO terms | 6,833 entries | GO annotations derived from model predictions |
| Subcellular location | 13,690 entries | Predicted localization |
| Function comment | 5,438 entries | Free-text functional description |
| Name only | 8,690 entries | Only protein name predicted, no GO/location/function |
Evidencer corroboration provenance
Each prediction has a model score (0–1, threshold 0.05) and post-hoc corroboration from the Evidencer:
- domain (9,950): Domain architecture match
- GO (8,727): Direct GO term prediction
- PANTHER (4,190): PANTHER family/subfamily match
- keyword (4,111): UniProt keyword match
- InterPro (1,022): InterPro family match
- recommended_protein_name (543): Name transferred from characterized homolog
- Plus many smaller categories (Pfam, SUPFAM, Gene3D, CDD, etc.)
TSV files
| File | Description |
|---|---|
entries.tsv |
One row per entry (accession, name, prediction counts) |
predictions.tsv |
One row per GO/function/location prediction |
evidence.tsv |
One row per evidence block (scores, provenance) |
taxonomy.tsv |
Accession -> species mapping from UniProt |