# NCBIFAM -> GO mapping (SSSOM) -- curated seed for a missing "ncbifam2go" pipeline
#
# These are hand-verified NCBIFAM(protein family) -> GO mappings. Unlike RHEA, GO publishes NO
# ncbifam2go external2go file (confirmed: https://current.geneontology.org/ontology/external2go/
# ncbifam2go -> HTTP 403; only interpro2go is served). NCBIFAM therefore reaches GOA only where
# its signature is integrated into an InterPro entry that itself has an interpro2go row -- but
# NCBIFAM already assigns its own GO terms in the PGAP HMM metadata (hmm_PGAP.tsv `go_terms`
# column), which GO does not ingest. This seed adopts a sample of those NCBI assignments after
# independent verification, so they could seed an ncbifam2go mapping (see ../NCBIFam.md and
# NCBIFAM-METHODOLOGY.md, gap classes G4/G5).
#
# 28 mappings: 27 skos:exactMatch + 1 skos:broadMatch (as in the RHEA project's rhea2go.sssom.yaml).
# The object is always the mapping WE propose for ncbifam2go (not a transcription of NCBI's go_terms):
#   * skos:exactMatch  -- the GO term IS the family's function; a ready-to-add ncbifam2go row.
#                         Enzyme rows are EC-bridge supported: ec2go maps the family's EC to exactly
#                         this GO term (verified live). Where NCBI's go_terms was too broad (even the
#                         GO:0003824 near-root) but a precise child EXISTS and is EC-bridged, we
#                         propose the specific child here and record NCBI's broad value in the comment.
#                         Conversely, FtsX (TIGR00439) shows the opposite judgement: we choose the
#                         curator-consensus GO:0051301 (cell division) over the more specific
#                         cytokinesis terms because propagating those would be over-annotation.
#   * skos:broadMatch  -- reserved for the case where NO more-specific GO term exists to adopt: the
#                         best available term is a whole-complex CC term a subunit is part_of (VirB5).
#                         Only 1 such row remains.
#
# Verification (do NOT trust the NCBI assignment blindly -- the scoping sample contained an OBSOLETE
# id, GO:0009448 on a GABA transaminase, AND an outright WRONG one, GO:0003951 "NAD+ kinase activity"
# on the diacylglycerol-kinase model NF009874 (EC 2.7.1.107; correct term GO:0004143) -- both
# excluded):
#   * every GO id/label was checked non-obsolete against QuickGO on 2026-06-20;
#   * every EC->GO bridge was confirmed against the live ec2go file;
#   * every family accession + product name + family_type + EC + go_terms is from hmm_PGAP.tsv;
#   * `gain_*` in each comment is the live UniProtKB propagation gain from ncbifam_go_gain.py
#     (entries carrying the NCBIFAM signature that lack the GO term or any descendant; rev = the
#     Swiss-Prot/reviewed subset, the curation-relevant number; all = mostly TrEMBL).
#
# Validate with:  just validate-ncbifam-mappings
#   (linkml-validate against the SSSOM 'mapping set' shape, then GO term/label validation against
#    the full GO graph -- MF, BP, and CC, since NCBIFAM families span all three aspects).

mapping_set_id: https://w3id.org/ai4curation/ai-gene-review/mappings/ncbifam2go
mapping_set_title: NCBIFAM to GO mapping (curated seed for a missing ncbifam2go pipeline)
mapping_set_description: >-
  Curated NCBIFAM protein-family -> GO mappings adopting NCBI's own hmm_PGAP.tsv go_terms
  assignments after independent QuickGO/ec2go verification. GO publishes no ncbifam2go external2go
  file, so these NCBI-assigned functions reach GOA only via InterPro integration (interpro2go,
  GO_REF:0000002) and are largely absent from the unreviewed (TrEMBL) entries that carry the family
  signature. exactMatch rows are ready-to-add ncbifam2go entries (enzyme rows EC-bridge supported);
  broadMatch rows attach a family to the best available broader GO term and name the narrower term
  to use/request. Each mapping records the live UniProtKB propagation gain (reviewed and all). Seeds
  the NCBIFam project's curation of an ncbifam2go mapping and quality-checks GO annotation of the
  corresponding families. Aspects span MF, BP, and CC (NCBIFAM is a whole-protein family resource,
  not an enzyme-only one).
license: https://creativecommons.org/licenses/by/4.0/
creator_label:
- AI Gene Review project
mapping_date: "2026-06-20"
subject_source: ncbifam
object_source: GO
curie_map:
  NCBIFAM: https://www.ncbi.nlm.nih.gov/protfam/
  GO: http://purl.obolibrary.org/obo/GO_
  skos: http://www.w3.org/2004/02/skos/core#
  semapv: https://w3id.org/semapv/vocab/

mappings:
# ===== exactMatch: ready-to-add ncbifam2go rows (GO term == the family's function) =====

- subject_id: NCBIFAM:NF009803
  subject_label: "formamidase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004328
  object_label: formamidase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge supported: hmm_PGAP gives EC 3.5.1.49 and ec2go maps EC 3.5.1.49 -> exactly GO:0004328.
    equivalog family. Propagation gain: gain_rev=0 (27/27 reviewed already have it) but gain_all=54
    -- the gap is entirely in unreviewed TrEMBL entries carrying NF009803.

- subject_id: NCBIFAM:NF005824
  subject_label: "acetolactate synthase large subunit (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0003984
  object_label: acetolactate synthase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    The large (catalytic) subunit carries the activity. equivalog. gain_rev=0 (1/1), gain_all=6.
    Clean propagation case -- nearly fully covered already, included as an exactMatch exemplar.

- subject_id: NCBIFAM:NF045700
  subject_label: "AttM family quorum-quenching N-acyl homoserine lactonase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0102007
  object_label: acyl-L-homoserine-lactone lactonohydrolase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge supported: EC 3.1.1.81 -> GO:0102007 in ec2go. equivalog, PMID:11930013. Quorum
    quenching. gain_rev=0 (4/4), gain_all=115 -- large TrEMBL gap.

- subject_id: NCBIFAM:TIGR03230
  subject_label: "lipoprotein lipase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004465
  object_label: lipoprotein lipase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge supported: EC 3.1.1.34 -> GO:0004465 in ec2go. equivalog, PMID:8308035.
    gain_rev=0 (11/11), gain_all=9.

- subject_id: NCBIFAM:NF033545
  subject_label: "IS630 family transposase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004803
  object_label: transposase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    equivalog. The single largest propagation gap in the scoping sample: gain_all=18874 of 18881
    entries carrying NF033545 lack GO:0004803 (or any descendant), and gain_rev=2. Mobile-element
    families are precisely where InterPro integration lags, so NCBIFAM's GO is unused.

- subject_id: NCBIFAM:NF041162
  subject_label: "family 2A encapsulin nanocompartment shell protein (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0140737
  object_label: encapsulin nanocompartment
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Cellular-component mapping: the shell protein localises to / constitutes the encapsulin
    nanocompartment. equivalog, PMID:25024436,34362927,35146412. gain_all=897 of 901, gain_rev=0.

- subject_id: NCBIFAM:NF042963
  subject_label: "anti-phage-associated DUF1156 domain-containing protein (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0051607
  object_label: defense response to virus
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Biological-process mapping for an anti-phage defense family. equivalog, PMID:32855333.
    gain_all=151 of 151 (0 carry the term), gain_rev=0 -- a brand-new defense family with no GO
    propagation at all; a clear gap-fill / proposed-annotation candidate.

# ----- exactMatch, EC-bridge supported (ec2go(EC) == this GO term, verified live 2026-06-20) -----

- subject_id: NCBIFAM:NF000320
  subject_label: "PEN family class A beta-lactamase, Bcc-type (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0008800
  object_label: beta-lactamase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.5.2.6 -> GO:0008800 in ec2go. AMR-relevant. equivalog, PMID:19075063,9371340.
    gain_all=0 of 45 (already fully covered in this reviewed-heavy family), included as a clean AMR
    exemplar; cf. NF033105 (a different beta-lactamase family) which maps to the same GO term.

- subject_id: NCBIFAM:NF033105
  subject_label: "subclass B3 metallo-beta-lactamase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0008800
  object_label: beta-lactamase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.5.2.6 -> GO:0008800. AMR-relevant (carbapenem-hydrolysing metallo class).
    equivalog. gain_all=165 of 1170 -- a substantial TrEMBL gap. Two distinct NCBIFAM families
    (NF000320 serine class A, NF033105 metallo B3) legitimately share one GO term: family -> GO is
    many-to-one, the structural analog of RHEA's many-reactions-to-one-activity finding.

- subject_id: NCBIFAM:NF002525
  subject_label: "D-alanine--D-alanine ligase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0008716
  object_label: D-alanine-D-alanine ligase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 6.3.2.4 -> GO:0008716. Peptidoglycan biosynthesis (Ddl, a vancomycin-resistance
    locus). equivalog. gain_all=3 of 777, gain_rev=0.

- subject_id: NCBIFAM:NF003009
  subject_label: "5'-deoxynucleotidase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0002953
  object_label: 5'-deoxynucleotidase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.1.3.89 -> GO:0002953. equivalog. gain_all=170 of 1714, gain_rev=0.

- subject_id: NCBIFAM:NF004018
  subject_label: "uridine kinase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004849
  object_label: uridine kinase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 2.7.1.48 -> GO:0004849. Pyrimidine salvage. equivalog. gain_all=329 of 15080 --
    the largest TrEMBL gap among the enzyme rows in this batch; gain_rev=0.

- subject_id: NCBIFAM:NF006707
  subject_label: "class I fructose-bisphosphate aldolase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004332
  object_label: fructose-bisphosphate aldolase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 4.1.2.13 -> GO:0004332. Glycolysis. equivalog. gain_all=4 of 1476, gain_rev=0.

- subject_id: NCBIFAM:NF007054
  subject_label: "alpha-amylase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004556
  object_label: alpha-amylase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.2.1.1 -> GO:0004556. equivalog. gain_all=1 of 37, gain_rev=0.

- subject_id: NCBIFAM:NF011000
  subject_label: "acylphosphatase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0003998
  object_label: acylphosphatase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.6.1.7 -> GO:0003998. equivalog. gain_all=4 of 1135, gain_rev=0.

- subject_id: NCBIFAM:NF040791
  subject_label: "glycerate 2-kinase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0043798
  object_label: glycerate 2-kinase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 2.7.1.165 -> GO:0043798. equivalog, PMID:16684110. gain_all=6 of 7 -- small but
    nearly total gap for this family.

- subject_id: NCBIFAM:NF045654
  subject_label: "acid phosphatase PhoC (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0003993
  object_label: acid phosphatase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.1.3.2 -> GO:0003993. equivalog, PMID:10877772,8081499. gain_all=1 of 80.

- subject_id: NCBIFAM:TIGR02321
  subject_label: "phosphonopyruvate hydrolase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0033978
  object_label: phosphonopyruvate hydrolase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 3.11.1.3 -> GO:0033978. Phosphonate catabolism. equivalog, PMID:12697754.
    gain_all=71 of 137, gain_rev=0.

- subject_id: NCBIFAM:TIGR02694
  subject_label: "arsenate reductase (azurin) small subunit (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0050611
  object_label: arsenate reductase (azurin) activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 1.20.9.1 -> GO:0050611. Arsenic detoxification/respiration. equivalog,
    PMID:12679550. gain_all=144 of 355, and gain_rev=2 -- reviewed entries are missing it too.

- subject_id: NCBIFAM:TIGR03828
  subject_label: "1-phosphofructokinase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0008662
  object_label: 1-phosphofructokinase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 2.7.1.56 -> GO:0008662. Fructose catabolism. equivalog. gain_all=16 of 5974,
    gain_rev=0.

- subject_id: NCBIFAM:NF001277
  subject_label: "adenosylcobinamide-GDP ribazoletransferase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0051073
  object_label: adenosylcobinamide-GDP ribazoletransferase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    EC-bridge: EC 2.7.8.26 -> GO:0051073. Cobalamin (B12) biosynthesis (CobS). equivalog.
    gain_all=4 of 1293, gain_rev=0.

# ===== exactMatch (our proposed mapping): specific term we curate to REPLACE NCBI's broad =====
# NCBI's hmm_PGAP go_terms gives only a broad/near-root term for these families, but a precise GO
# term already exists and is EC-bridge confirmed (ec2go(EC) = the specific term). The mapping we
# propose for ncbifam2go is therefore the SPECIFIC term as an exactMatch -- not NCBI's broad value
# (which is recorded in the comment as the thing being corrected). Reclassifying broadMatch->exactMatch
# here also UNMASKS the real propagation gain: the broad parent was near-universal so its gain looked
# ~0, whereas the specific term reveals large gaps, including reviewed (Swiss-Prot) ones.

- subject_id: NCBIFAM:NF002326
  subject_label: "deoxyguanosinetriphosphate triphosphohydrolase / dGTPase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0008832
  object_label: dGTPase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Our proposed term. NCBI's go_terms gave only the broad parent GO:0016793 (triphosphoric monoester
    hydrolase activity); the specific GO:0008832 (dGTPase activity) is correct (product name = dGTPase,
    EC 3.1.5.1; ec2go EC 3.1.5.1 -> GO:0008832). equivalog. Against the specific term gain_all=456 of
    3719 and gain_rev=13 -- a substantial reviewed gap the broad parent (gain~0) hid entirely.

- subject_id: NCBIFAM:NF005804
  subject_label: "enoyl-CoA hydratase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004300
  object_label: enoyl-CoA hydratase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Our proposed term. NCBI assigned the ontology near-root GO:0003824 (catalytic activity) to this
    EC 4.2.1.17 enzyme; ec2go maps EC 4.2.1.17 -> GO:0004300 directly. equivalog. Against the specific
    term gain_all=184 of 485, gain_rev=1 (the GO:0003824 parent gain was ~0 -- altitude masking).

- subject_id: NCBIFAM:NF006559
  subject_label: "dihydroorotase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004151
  object_label: dihydroorotase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Our proposed term. NCBI assigned the broad parent GO:0016810; the specific GO:0004151 is the ec2go
    target of EC 3.5.2.3. Pyrimidine biosynthesis (PyrC). equivalog. Against the specific term
    gain_all=491 of 1354 (vs 4 for the broad parent), gain_rev=0.

- subject_id: NCBIFAM:TIGR00417
  subject_label: "spermidine synthase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0004766
  object_label: spermidine synthase activity
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Our proposed term. NCBI assigned the ontology near-root GO:0003824 (catalytic activity) to this
    EC 2.5.1.16 enzyme; ec2go maps EC 2.5.1.16 -> GO:0004766 directly. Polyamine biosynthesis.
    equivalog. Against the specific term gain_all=575 of 9260 and gain_rev=1 (the GO:0003824 parent
    gain was ~0).

- subject_id: NCBIFAM:TIGR03542
  subject_label: "LL-diaminopimelate aminotransferase (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0010285
  object_label: "L,L-diaminopimelate:2-oxoglutarate transaminase activity"
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Our proposed term. NCBI assigned the broad class GO:0008483 (transaminase activity); the specific
    GO:0010285 is the ec2go target of EC 2.6.1.83. Lysine/peptidoglycan biosynthesis. equivalog,
    PMID:17093042,17583737. Against the specific term gain_all=1185 of 2649 (vs 77 for the broad
    class), gain_rev=2.

# ===== broadMatch: only a broader/whole-complex term exists; no specific child to adopt =====
# One row remains broadMatch: VirB5 is a subunit part_of a complex (the term is the whole complex,
# no subunit-specific CC term exists).

- subject_id: NCBIFAM:TIGR02791
  subject_label: "P-type DNA transfer protein VirB5 (equivalog)"
  predicate_id: skos:broadMatch
  predicate_label: broad match
  object_id: GO:0043684
  object_label: type IV secretion system complex
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    VirB5 is a minor pilus/component subunit part_of the T4SS; the GO term is the whole complex, so
    the relation is part_of (broadMatch), not exact, and there is no VirB5-specific CC term to propose.
    equivalog, PMID:12855161,14673074,15901731. gain_all=298 of 298 (0 carry it), gain_rev=4.

# ===== exactMatch (our proposed mapping): curator-consensus term, declining the over-annotation =====

- subject_id: NCBIFAM:TIGR00439
  subject_label: "permease-like cell division protein FtsX (equivalog)"
  predicate_id: skos:exactMatch
  predicate_label: exact match
  object_id: GO:0051301
  object_label: cell division
  mapping_justification: semapv:ManualMappingCuration
  comment: >-
    Our proposed term, chosen against NCBI's GO:0000910 (cytokinesis) on empirical + ontology grounds.
    Ontology: GO:0000910 cytokinesis is part_of GO:0051301 cell division, so NCBI's term is actually
    NARROWER, not broader. Empirically all 7 reviewed FtsX proteins carry GO:0051301 (cell division)
    but only 2/7 carry cytokinesis or GO:0043093 (FtsZ-dependent cytokinesis). FtsX/FtsEX regulates
    septal peptidoglycan hydrolysis and divisome assembly, so the conservative participation term
    (cell division) is the curator consensus. gain_all=22 of 3306 (already near-universal -- a
    confirmatory, low-gap mapping). NOTE: mapping to the specific GO:0043093 instead would show a huge
    apparent gain (3304) but curators apply it to only 2/7 entries, so blanket propagation would be
    OVER-ANNOTATION -- we deliberately keep the safe consensus term. The opposite call from the five
    altitude rows above, where the specific term WAS the correct, universally-applicable one.
