ARO ↔ GO mappings & the UniProt→ARO→GO pipeline

ARO ↔ GO mappings & the UniProt→ARO→GO pipeline

Part of the Antimicrobial Resistance project.

Files

File What it is
aro2go.sssom.yaml Curated ARO → GO mapping set in SSSOM format (source of truth). CURIE+label tuples throughout; validates with linkml-validate.
aro2go.terms.yaml Generated nested term-tuple form of the mapping, validated by linkml-term-validator (do not edit by hand).
sssom_to_terms.py Converter: reshapes the flat SSSOM rows into the nested {id,label} form the term-validator checks.
uniprot2aro2go.py Pipeline that chains UniProt → ARO → GO by applying the SSSOM mapping.
aro2go.html Curator-facing HTML view of the mappings + gaps (CARD/AmiGO links). Generated by render_sssom_html.py (just render-mappings).
ANNOTATION_GAIN.md Report of candidate GO annotations UniProt would gain. Generated by annotation_gain_report.py (just annotation-gain).
data/uniprot_card_xrefs.tsv Snapshot of UniProtKB entries with a CARD cross-reference (ARO id + existing GO), from the UniProt REST API.
data/candidate_new_annotations.tsv One row per (entry, candidate new GO term) produced by the gain report.
examples/rgi_example_mphA.txt Tiny example of CARD RGI tab-delimited output, used to demonstrate the sequence-based route.
examples/rgi_example_betalactamases.txt RGI example (CTX-M-15/KPC-2/NDM-1) demonstrating exact-or-narrower propagation from the beta-lactamase family node.

Curator view & annotation-gain report

Validation: just validate-mappings

Two layers, both run by the recipe (and by tests/test_aro2go_mapping.py):

  1. Structurallinkml-validate against the SSSOM LinkML schema (required slots, shape).
  2. Ontology termslinkml-term-validator checks every ARO and GO CURIE in the mapping
    both resolves (ARO via sqlite:obo:aro, GO via sqlite:obo:go) and that the supplied label
    matches the ontology label
    — the "curie+label tuple" guarantee. This uses the sidecar schema
    src/ai_gene_review/schema/aro_go_mapping.yaml, whose subject/object are {id,label} tuples
    with bindings to dynamic enums (AROTermEnum reachable from ARO:1000001; GOTermEnum from the
    three GO roots). Prefix→adapter resolution is configured in conf/oak_config.yaml.

A label typo or a wrong id fails the build, e.g.:
Label mismatch for 'GO:0050073': expected 'macrolide 2'-kinase activity', got '...'.

Why a pipeline?

CARD/ARO has no native ARO→GO mapping (verified against aro.owl). We supply the
ARO→GO bridge as curated SSSOM, then need a UniProt→ARO step to apply it to real
protein entries. There are two routes:

  1. DR CARD cross-reference (deterministic, sparse). UniProt flat-files sometimes carry
    a CARD cross-reference, e.g.
    DR CARD; ARO:3000318; mphB; ARO:0001004; antibiotic inactivation.
    This single line gives both the determinant ARO id and the resistance-mechanism ARO
    id for free. But UniProt only populates it for a minority of entries — in this repo,
    1 of 2334 cached UniProt records (MphB). MphA (Q47396) has none.

  2. Sequence-based assignment with RGI (scalable). CARD's
    Resistance Gene Identifier assigns ARO ids to protein
    sequences. This is the route for entries lacking a DR CARD line. We do not run RGI
    in this script (it needs RGI + the local CARD database installed); instead we parse its
    output if you run it. Related: argNorm
    normalises the output of RGI/AMRFinderPlus/ABRicate/etc. to ARO accessions.

Propagation: exact or narrower (the high-value logic)

A GO term mapped at an ARO term applies to a gene whose ARO assignment is that term or any
narrower (is_a descendant) ARO term
— i.e. propagate down the ARO hierarchy, never up. This is
why family-level ARO nodes are the high-value mapping targets: one mapping covers an entire
subtree. For example the single beta-lactamase (ARO:3000001) → GO:0008800 mapping reaches
its 5,317 descendant ARO gene terms (CTX-M, KPC, NDM, …), none of which need their own row.

The pipeline implements this by walking each gene's ARO is_a ancestors (via OAK
sqlite:obo:aro) and firing any mapping whose subject is an ancestor-or-self. The output
aro_relation column records exact (mapping is on the gene's own ARO term) vs narrower (the
gene's ARO term is narrower than the mapped family/mechanism node). Use --no-propagate for
exact-only. Demonstration:

$ uv run python uniprot2aro2go.py --sssom aro2go.sssom.yaml --rgi-output examples/rgi_example_betalactamases.txt
gene_aro_id   gene_aro_label   mapped_aro_id   mapped_aro_label   aro_relation   go_id        go_label
ARO:3001878   CTX-M-15         ARO:3000001     beta-lactamase     narrower       GO:0008800   beta-lactamase activity
ARO:3002312   KPC-2            ARO:3000001     beta-lactamase     narrower       GO:0008800   beta-lactamase activity
ARO:3000589   NDM-1           ARO:3000001     beta-lactamase     narrower       GO:0008800   beta-lactamase activity

Usage

Run via the just recipe (validates the SSSOM first, then runs the chain on the AMR genes):

just aro2go-pipeline

Or directly:

# Route 1 — DR CARD line(s) from cached UniProt record(s):
uv run python projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py \
    --sssom projects/ANTIMICROBIAL_RESISTANCE/aro2go.sssom.yaml \
    genes/ECO8N/mphB/mphB-uniprot.txt

# Sweep every cached UniProt record (recursive glob; quote it):
uv run python projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py \
    --sssom projects/ANTIMICROBIAL_RESISTANCE/aro2go.sssom.yaml \
    'genes/**/*-uniprot.txt' -o /tmp/aro2go_all.tsv

# Route 2 — parse RGI output (for entries without a DR CARD line, e.g. MphA):
uv run python projects/ANTIMICROBIAL_RESISTANCE/uniprot2aro2go.py \
    --sssom projects/ANTIMICROBIAL_RESISTANCE/aro2go.sssom.yaml \
    --rgi-output projects/ANTIMICROBIAL_RESISTANCE/examples/rgi_example_mphA.txt

To generate real RGI output for the sequence-based route:

rgi load --card_json card.json            # one-time, from https://card.mcmaster.ca/download
rgi main --input_sequence protein.faa --input_type protein --output_file rgi_out
# then pass rgi_out.txt to --rgi-output

Output

A TSV of candidate GO annotations with full provenance — uniprot_acc,
gene_aro_id/gene_aro_label (the gene's ARO assignment), mapped_aro_id/mapped_aro_label
(the ARO term the mapping is on), aro_relation (exact or narrower),
predicate_id/predicate_label, go_id/go_label, mapping_justification, and route
(DR_CARD or RGI).

These are leads for a curator, not final annotations: the enables/relatedMatch predicates
indicate the relationship type, and a family-level mapping is only a prior for its members. As a
sanity check, the pipeline reproduces exactly the GO terms curators assigned by hand:

Coverage (23 mappings)

Gaps: ARO families with no suitable GO term (recorded in the mapping file)

GO genuinely lacks a specific MF term for several resistance enzymes. Rather than mapping them to a
near-but-wrong term (a paralog or the wrong chemistry), these are recorded in aro2go.sssom.yaml
itself
using the SSSOM no-match convention — object_id: sssom:NoTermFound, object_source: obo:go.owl, predicate_id: skos:exactMatch (the relation we would use), and a comment explaining
the gap. They are GO new-term-request candidates. The converter and pipeline skip these rows (no GO
object), so they never produce a candidate annotation; test_gap_rows_present keeps them from being
silently dropped.

ARO family Why no mapping (GO gap)
lincosamide nucleotidyltransferase / Lnu (ARO:3000221) No lincosamide O-nucleotidyltransferase GO MF term.
streptogramin Vat acetyltransferase (ARO:3000453) No streptogramin A O-acetyltransferase GO MF term.
streptogramin Vgb lyase (ARO:3000376) No streptogramin B lyase GO MF term.
macrolide esterase / Ere (ARO:3000320) No macrolide/erythromycin esterase GO MF term.
rifampin ADP-ribosyltransferase / Arr (ARO:3000390) GO:0003950 is poly-ADP-ribosyltransferase; Arr is mono.
rifampin monooxygenase (ARO:3000445) Only the general GO:0004497 monooxygenase activity exists.
tetracycline inactivation / Tet(X) (ARO:3000036) No tetracycline-destructase monooxygenase GO MF term.
Cfr 23S rRNA methyltransferase (ARO:3000202) GO:0070040 is the C2 (RlmN) activity; Cfr methylates C8.
D-Ala-D-Lac ligase / VanA (ARO:3002978) GO:0008716 is D-Ala-D-Ala ligase; VanA makes D-Ala-D-Lac.

Validation & tests