ARO ↔ GO mappings & the UniProt→ARO→GO pipeline
Part of the Antimicrobial Resistance project.
Files
| File | What it is |
|---|---|
aro2go.sssom.yaml |
Curated ARO → GO mapping set in SSSOM format (source of truth). CURIE+label tuples throughout; validates with linkml-validate. |
aro2go.terms.yaml |
Generated nested term-tuple form of the mapping, validated by linkml-term-validator (do not edit by hand). |
sssom_to_terms.py |
Converter: reshapes the flat SSSOM rows into the nested {id,label} form the term-validator checks. |
uniprot2aro2go.py |
Pipeline that chains UniProt → ARO → GO by applying the SSSOM mapping. |
aro2go.html |
Curator-facing HTML view of the mappings + gaps (CARD/AmiGO links). Generated by render_sssom_html.py (just render-mappings). |
ANNOTATION_GAIN.md |
Report of candidate GO annotations UniProt would gain. Generated by annotation_gain_report.py (just annotation-gain). |
data/uniprot_card_xrefs.tsv |
Snapshot of UniProtKB entries with a CARD cross-reference (ARO id + existing GO), from the UniProt REST API. |
data/candidate_new_annotations.tsv |
One row per (entry, candidate new GO term) produced by the gain report. |
examples/rgi_example_mphA.txt |
Tiny example of CARD RGI tab-delimited output, used to demonstrate the sequence-based route. |
examples/rgi_example_betalactamases.txt |
RGI example (CTX-M-15/KPC-2/NDM-1) demonstrating exact-or-narrower propagation from the beta-lactamase family node. |
Curator view & annotation-gain report
just render-mappings→aro2go.html: open in a browser to review every mapping (ARO and GO ids
link out to CARD and AmiGO) and the recorded GO gaps, with the per-mapping UniProt gain count.just annotation-gain→ANNOTATION_GAIN.md+data/candidate_new_annotations.tsv: applies the
mappings (exact-or-narrower) to all 4,182 UniProtKB entries that carry a CARD cross-reference and
reports the GO terms they would gain that are not already annotated. The filter is
subsumption-aware: a candidate is suppressed when the entry already has a more specific (is_a
descendant) GO term, so over-general parents are not proposed. Current snapshot: 630 candidate new
annotations (a further 104 suppressed as redundant — almost all aminoglycoside families, where GO
has specific child terms) — e.g. all 79 colistin/MCR entries lackGO:0043838, and 448
beta-lactamases lackGO:0008800. (The mechanism→GO:0046677mappings add nothing new — every CARD
entry already has "response to antibiotic" — a good sanity check.)
Validation: just validate-mappings
Two layers, both run by the recipe (and by tests/test_aro2go_mapping.py):
- Structural —
linkml-validateagainst the SSSOM LinkML schema (required slots, shape). - Ontology terms —
linkml-term-validatorchecks every ARO and GO CURIE in the mapping
both resolves (ARO viasqlite:obo:aro, GO viasqlite:obo:go) and that the supplied label
matches the ontology label — the "curie+label tuple" guarantee. This uses the sidecar schema
src/ai_gene_review/schema/aro_go_mapping.yaml, whosesubject/objectare{id,label}tuples
withbindingsto dynamic enums (AROTermEnumreachable fromARO:1000001;GOTermEnumfrom the
three GO roots). Prefix→adapter resolution is configured inconf/oak_config.yaml.
A label typo or a wrong id fails the build, e.g.:
Label mismatch for 'GO:0050073': expected 'macrolide 2'-kinase activity', got '...'.
Why a pipeline?
CARD/ARO has no native ARO→GO mapping (verified against aro.owl). We supply the
ARO→GO bridge as curated SSSOM, then need a UniProt→ARO step to apply it to real
protein entries. There are two routes:
-
DR CARDcross-reference (deterministic, sparse). UniProt flat-files sometimes carry
a CARD cross-reference, e.g.
DR CARD; ARO:3000318; mphB; ARO:0001004; antibiotic inactivation.
This single line gives both the determinant ARO id and the resistance-mechanism ARO
id for free. But UniProt only populates it for a minority of entries — in this repo,
1 of 2334 cached UniProt records (MphB). MphA (Q47396) has none. -
Sequence-based assignment with RGI (scalable). CARD's
Resistance Gene Identifier assigns ARO ids to protein
sequences. This is the route for entries lacking aDR CARDline. We do not run RGI
in this script (it needs RGI + the local CARD database installed); instead we parse its
output if you run it. Related:argNorm
normalises the output of RGI/AMRFinderPlus/ABRicate/etc. to ARO accessions.
Propagation: exact or narrower (the high-value logic)
A GO term mapped at an ARO term applies to a gene whose ARO assignment is that term or any
narrower (is_a descendant) ARO term — i.e. propagate down the ARO hierarchy, never up. This is
why family-level ARO nodes are the high-value mapping targets: one mapping covers an entire
subtree. For example the single beta-lactamase (ARO:3000001) → GO:0008800 mapping reaches
its 5,317 descendant ARO gene terms (CTX-M, KPC, NDM, …), none of which need their own row.
The pipeline implements this by walking each gene's ARO is_a ancestors (via OAK
sqlite:obo:aro) and firing any mapping whose subject is an ancestor-or-self. The output
aro_relation column records exact (mapping is on the gene's own ARO term) vs narrower (the
gene's ARO term is narrower than the mapped family/mechanism node). Use --no-propagate for
exact-only. Demonstration:
$ uv run python uniprot2aro2go.py --sssom aro2go.sssom.yaml --rgi-output examples/rgi_example_betalactamases.txt
gene_aro_id gene_aro_label mapped_aro_id mapped_aro_label aro_relation go_id go_label
ARO:3001878 CTX-M-15 ARO:3000001 beta-lactamase narrower GO:0008800 beta-lactamase activity
ARO:3002312 KPC-2 ARO:3000001 beta-lactamase narrower GO:0008800 beta-lactamase activity
ARO:3000589 NDM-1 ARO:3000001 beta-lactamase narrower GO:0008800 beta-lactamase activity
Usage
Run via the just recipe (validates the SSSOM first, then runs the chain on the AMR genes):
just aro2go-pipeline
Or directly:
# Route 1 — DR CARD line(s) from cached UniProt record(s):
uv run python projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py \
--sssom projects/ANTIMICROBIAL_RESISTANCE/aro2go.sssom.yaml \
genes/ECO8N/mphB/mphB-uniprot.txt
# Sweep every cached UniProt record (recursive glob; quote it):
uv run python projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py \
--sssom projects/ANTIMICROBIAL_RESISTANCE/aro2go.sssom.yaml \
'genes/**/*-uniprot.txt' -o /tmp/aro2go_all.tsv
# Route 2 — parse RGI output (for entries without a DR CARD line, e.g. MphA):
uv run python projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py \
--sssom projects/ANTIMICROBIAL_RESISTANCE/aro2go.sssom.yaml \
--rgi-output projects/ANTIMICROBIAL_RESISTANCE/examples/rgi_example_mphA.txt
To generate real RGI output for the sequence-based route:
rgi load --card_json card.json # one-time, from https://card.mcmaster.ca/download
rgi main --input_sequence protein.faa --input_type protein --output_file rgi_out
# then pass rgi_out.txt to --rgi-output
Output
A TSV of candidate GO annotations with full provenance — uniprot_acc,
gene_aro_id/gene_aro_label (the gene's ARO assignment), mapped_aro_id/mapped_aro_label
(the ARO term the mapping is on), aro_relation (exact or narrower),
predicate_id/predicate_label, go_id/go_label, mapping_justification, and route
(DR_CARD or RGI).
These are leads for a curator, not final annotations: the enables/relatedMatch predicates
indicate the relationship type, and a family-level mapping is only a prior for its members. As a
sanity check, the pipeline reproduces exactly the GO terms curators assigned by hand:
- MphB (
DR CARDroute) →GO:0050073macrolide 2'-kinase activity +GO:0046677response to antibiotic - MphA (RGI route) →
GO:0050073macrolide 2'-kinase activity
Coverage (23 mappings)
- Mechanism →
GO:0046677response to antibiotic (skos:relatedMatch): antibiotic inactivation,
efflux, target alteration, target protection, target replacement, reduced permeability. - AMR gene family → GO MF: MPH (
GO:0050073), beta-lactamase (GO:0008800), CAT (GO:0008811),
APH/AAC/ANT (GO:0034071/0034069/0034068), Erm (GO:0008988— the family-safe N6-methyltransferase
parent, not the di-methyltransferase, to avoid over-annotating mono-methylating variants),
dfr (GO:0004146), sul (GO:0004156),
colistin/MCR phosphoethanolamine transferase (GO:0043838) viaenables;
FosA/FosA2/FosA3/fosA5 (GO:0004364; the glutathione-specific nodes only, excluding FosB/FosX) and
16S rRNA m7G1405 methyltransferase / ArmA-Rmt (GO:0070043) viarelatedMatch(the ARO family
is narrower than a general GO MF, still propagatable). - Determinant → GO MF (
enables): mphA, mphB (GO:0050073).
Gaps: ARO families with no suitable GO term (recorded in the mapping file)
GO genuinely lacks a specific MF term for several resistance enzymes. Rather than mapping them to a
near-but-wrong term (a paralog or the wrong chemistry), these are recorded in aro2go.sssom.yaml
itself using the SSSOM no-match convention — object_id: sssom:NoTermFound, object_source:
obo:go.owl, predicate_id: skos:exactMatch (the relation we would use), and a comment explaining
the gap. They are GO new-term-request candidates. The converter and pipeline skip these rows (no GO
object), so they never produce a candidate annotation; test_gap_rows_present keeps them from being
silently dropped.
| ARO family | Why no mapping (GO gap) |
|---|---|
lincosamide nucleotidyltransferase / Lnu (ARO:3000221) |
No lincosamide O-nucleotidyltransferase GO MF term. |
streptogramin Vat acetyltransferase (ARO:3000453) |
No streptogramin A O-acetyltransferase GO MF term. |
streptogramin Vgb lyase (ARO:3000376) |
No streptogramin B lyase GO MF term. |
macrolide esterase / Ere (ARO:3000320) |
No macrolide/erythromycin esterase GO MF term. |
rifampin ADP-ribosyltransferase / Arr (ARO:3000390) |
GO:0003950 is poly-ADP-ribosyltransferase; Arr is mono. |
rifampin monooxygenase (ARO:3000445) |
Only the general GO:0004497 monooxygenase activity exists. |
tetracycline inactivation / Tet(X) (ARO:3000036) |
No tetracycline-destructase monooxygenase GO MF term. |
Cfr 23S rRNA methyltransferase (ARO:3000202) |
GO:0070040 is the C2 (RlmN) activity; Cfr methylates C8. |
D-Ala-D-Lac ligase / VanA (ARO:3002978) |
GO:0008716 is D-Ala-D-Ala ligase; VanA makes D-Ala-D-Lac. |
Validation & tests
just validate-mappings— structural (linkml-validate) and ontology-term
(linkml-term-validator, checks every ARO/GO id resolves + label matches).tests/test_aro2go_mapping.py— SSSOM structure, terms-file sync, and (integration) term-validator.- Pipeline parser: doctest —
uv run python -m doctest projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py.