Function Knowledge Gaps
What does biology genuinely not know about how a gene works — and how do we state it
rigorously enough that someone could close it?
This project builds a curated, literature-grounded register of function knowledge gaps:
specific, defensible statements of what is unknown about a gene product's molecular function,
mechanism, partners, or biological role. It is the inverse of the rest of this repository.
Everywhere else we adjudicate what is known; here we adjudicate, with the same evidentiary
discipline, what is not.
The framing is deliberately ambitious. If a foundation set out to systematically eliminate the
conserved "unknome," the first deliverable would not be more experiments — it would be an honest
map of where the real gaps are, sharp enough to fund. That map is the goal.
Why this matters: the unknome
Decades into the genomic era, a large fraction of even well-studied proteomes remains
functionally dark:
- ~20% of proteins in well-studied model organisms still lack any informative description of
their biological role, and many of these are conserved from yeast to human — implying
fundamental, not peripheral, functions (Wood et al. 2019, Open Biology, "Hidden in plain
sight", PMID:30938578, DOI). - The Unknome database, which ranks proteins by an evidence-weighted "knownness" score
(manual annotation = 0.9, electronic = 0.0; "unknown" = score ≤ 1.0), found 23% of human
protein clusters still below that threshold — down from 43% a decade earlier, i.e. slow
progress. RNAi screening of 260 conserved-unknown Drosophila genes found 62 essential for
viability and 59 with measurable phenotypes — the dark set is dense with real biology
(Rocha, Freeman et al. 2023, PLOS Biology, PMID:37552676,
DOI).
Genes stay dark largely for sociological reasons — research concentrates on what is already
famous (the streetlight effect / Matthew effect; Stoeger et al. 2018, PLOS Biology,
PMID:30226837; Edwards et al. 2011, Nature, "Too many roads not taken", PMID:21307913). The
same neglected set is over-represented among unsolved Mendelian disease, unannotated GWAS
peaks, and the druggable-but-unexplored proteome (the NIH IDG program; Oprea et al. 2018,
Nat Rev Drug Discov, PMID:29472638; Kustatscher et al. 2022, Nat Methods, "Understudied
proteins", PMID:35534633). Closing function gaps is the denominator problem under all of these.
Core principle: a knowledge gap is a curated judgment, not a metric
The central methodological commitment of this project: a knowledge gap is determined by
reading the primary literature and exercising judgment, not by a pattern in the annotations.
We tested the alternative and it failed instructively. A score that flags genes with no
specific molecular function (à la Unknome, but using our adjudicated core_functions) marks
~14% of our reviewed genes as "MF-dark" — reassuringly close to Wood's ~20% and Unknome's 23%.
But when the conserved subset of those was inspected, "no molecular function" decomposed into
four completely different things:
| What the score called a "gap" | Share | What it actually is | Owner |
|---|---|---|---|
| Structural/accessory subunits (TOM, TRAPP, ESCRT, V-ATPase…) | ~64% | Function is "be part of the machine"; the GO MF aspect can't express it | Ontology |
| Function-kind known, MF term simply not filled (e.g. SOX9) | ~14% | Curation incompleteness | Curation |
Stub / incomplete reviews (TODO descriptions) |
~3% | Data hygiene | — |
| Genuinely unknown mechanism (CFAP300, tam10, P3R3URF…) | ~18% | The true unknome | Experiment |
(Snapshot over ~2,100 reviewed genes; proportions are triage-grade, not curated.) A heavily
annotated gene can hide a gaping mechanistic hole; a sparsely annotated one can be perfectly
understood and merely under-curated. Only reading tells them apart. The metric is therefore
demoted to a back-of-house triage aid that produces a read-list, and never appears in the
product.
A taxonomy of function knowledge gaps
Two axes. First, what kind of ignorance (this determines who can fix it):
- Biology gap — nobody knows. Resolved by experiments. This is the unknome and the
project's primary target. - Curation gap — the knowledge exists in the literature but is not yet annotated, or is
annotated too generically. Resolved by curation (much of it in-house). - Ontology gap — the knowledge exists but no GO term can express it (e.g. "structural
subunit of complex X", or a novel activity). Resolved by ontology development;
tracked viaproposed_new_terms. Worked exemplar: POLE4 below.
Most real entries are a blend (CFAP300 below is biology-dominant with an ontology shadow),
and naming the blend is the actionable part.
A useful third framing emerged from curation, cutting across the above: the residual sub-gap —
a gene whose core function is textbook-solid but which still hides one sharp, load-bearing
mechanistic hole (e.g. RAB9A's unidentified GEF/GAP, RASA1's catalysis-independent scaffolding,
atg101's WF-finger recruit). These are easy to miss precisely because the gene looks finished;
flagging them is high-value because the hole is often the rate-limiting unknown for the pathway.
Second, which GO aspect is dark — most "dark" genes are not uniformly dark:
- MF-dark — process/location known, molecular mechanism unknown (the most common and most
insidious: rich BP/CC make the gene look known). Theprotein bindingsmell lives here. - BP-dark — an activity is known but not what it is for (common in microbial/plant
metabolism). - CC-dark — function known, but where/when unknown.
- Wholly dark — only root terms / IEA /
protein bindingsurvive review (the deep unknome).
The unit of work: anatomy of a gap entry
Each gap is a small, defensible scholarly object. Required elements:
- Gap statement — the specific unknown, stated precisely. Not "role unclear" but
"the direct substrate / the catalytic activity / the essential partner is undetermined." - Boundary of knowledge — what is established, so the gap is sharply delimited.
- Provenance for the gap itself — evidence that the unknown is real and not merely
uncurated. The gold standard is the field's own admissions: primary-literature hedges
("remains to be determined", "the precise role is unknown"), "domain of unknown function"
names, and a literature trajectory that confirms-but-never-mechanizes. Cited with the same
[PMID:xxxx "supporting text"]discipline this repo uses for positive claims. - Type judgment — biology / curation / ontology (or blend), per the taxonomy.
- Significance — why closing it matters.
- What would resolve it — the experiment, ontology term, or curation action.
This is strictly more than the existing suggested_questions field (which asks the right things
but cites nothing): the added value is the adjudicated boundary plus provenance for the
unknown.
Structured curation: the KnowledgeGap schema class
The anatomy above is now a first-class schema object, KnowledgeGap, so a gap can be curated
in the gene/module YAML itself rather than only as prose on this page. This mirrors the way
the Monarch dismech knowledge base promotes
unknowns to a first-class Discussion object (with kind / status / proposed_experiment),
and goes further by demanding provenance for the unknown and a biology/curation/ontology
typing.
A KnowledgeGap (see src/ai_gene_review/schema/gene_review.yaml) carries:
gap_statement(required) andboundary— the precise unknown and the edge of knowledge;gap_kind(BIOLOGY/CURATION/ONTOLOGY, multivalued for blends),dark_aspect
(MF_DARK/BP_DARK/CC_DARK/WHOLLY_DARK/RESIDUAL_SUBGAP), and a dismech-style
lifecyclestatus(OPEN/NARROWING/CLOSING/RESOLVED);significanceandresolution;provenance— a list ofSupportingTextInReference, so each "the field's own admission of
ignorance" quote is a verbatim substring checked by the reference validator, exactly like
supported_by. (When the only source is a DOI-only paper or a local analysis, anchor to a
file:reference, as elsewhere in this repo.) Caveat: only PMID/PMC-style references are
verbatim-checked;file:(and the otherskip_prefixesin
conf/reference_validator_config.yaml) are not quote-checked, so afile:-anchored quote
such as RAB9A's must be verified by hand. Prefer a checkable PMID quote where one exists.proposed_terms— inlineProposedOntologyTerms forONTOLOGYgaps.
It can be attached at five levels — the whole gene (GeneReview.knowledge_gaps), a single
annotation (ExistingAnnotation.review.knowledge_gaps, ideal for residual sub-gaps), a
core function (CoreFunction.knowledge_gaps), a whole module (ModuleReview.knowledge_gaps),
or a single module step (ModuleNode.knowledge_gaps).
Worked YAML exemplars in the KB: genes/human/CFAP300/ (gene-level biology+ontology gap,
two PMID provenance quotes) and genes/human/RAB9A/ (annotation-level residual sub-gap on the
GTPase-activity review, the missing GEF/GAP).
Rendered from data: running just aggregate-knowledge-gaps (or
python scripts/aggregate_knowledge_gaps.py) walks every gene/module YAML and regenerates the
Structured Knowledge-Gap Register plus
reports/knowledge_gaps.tsv. The prose worked-entries below remain the curated narrative; the
register is their queryable, schema-backed counterpart and will grow as gaps are recorded in the
YAML.
Worked example: the CFAP300 molecular-function gap
CFAP300 (formerly C11orf70) is a dynein axonemal assembly factor; loss-of-function causes
primary ciliary dyskinesia (PCD). It is in the KB (genes/human/CFAP300/) and is an ideal
exemplar: clear biomedical importance, a clearly established role, and a sharp, durable
mechanistic silence.
Boundary (what is firmly known):
- LOF causes PCD with combined outer + inner dynein arm (ODA+IDA) loss; CFAP300 is required
for cytoplasmic preassembly of axonemal dyneins and their IFT-dependent delivery into the
axoneme (Höben et al. 2018, AJHG, PMID:29727693, DOI;
Fassad et al. 2018, AJHG, PMID:29727692, DOI).
- Localizes mainly to cytoplasm, moves into cilia by IFT, and interacts with the preassembly
factor DNAAF2 (Höben 2018).
- Re-confirmed in 2025: LOF → no CFAP300 protein → total ODA+IDA loss (Demchenko et al. 2025,
IJMS, PMID:40806783, DOI).
Gap statement: The biochemical activity of CFAP300 is unknown. It is unresolved whether
it acts as a chaperone/co-chaperone, a scaffold, or an adaptor; what its client/substrate is;
and whether it acts at one common step upstream of both arm types or in parallel ODA- and
IDA-specific steps.
Provenance that the gap is real (the judgment):
1. Both founding papers describe the gene as "uncharacterized" and frame the mechanism as a
hypothesis ("supporting our hypothesis that C11orf70 is a preassembly factor") — they
establish requirement and localization, never a biochemical activity (PMID:29727692,
PMID:29727693).
2. Its sole domain is DUF4498 — a "domain of unknown function"
(genes/human/CFAP300/CFAP300-deep-research-falcon.md).
3. The trajectory is the strongest evidence: seven years and many cohorts on (Slavic 2019
PMID:30916986; Cypriot 2021 PMID:33715250; Russian 2024 PMID:39180133; ALI-culture 2025
PMID:40806783), every follow-up is diagnostic — confirming loss — not mechanistic. A
durable mechanistic silence despite clear motivation is the signature of a genuine gap, and
reading the arc of the literature is the only way to see it.
Type judgment: biology gap (dominant) with an ontology shadow — even "a dynein-preassembly
factor" has no adequate GO MF term, which is why the gene reads as MF-dark.
Significance: this is the assembly step whose failure removes all axonemal dynein motors —
mechanistically central to motile ciliopathy.
What would resolve it: proximity/affinity proteomics of the CFAP300–DNAAF2 module;
in vitro reconstitution of dynein assembly intermediates; structural characterization of
DUF4498.
Worked gap entries (from the seed read-list)
The following entries were curated by reading each gene's -deep-research-*.md and notes/GOA
files for the field's own statements of ignorance, in the CFAP300 format but condensed. Every
PMID cited below was independently verified against PubMed; ignorance quotes are reproduced
verbatim and attributed to the source file. Where a gene's deep-research file cited sources only
by DOI, or by PMIDs that failed verification, the gap is provenance-anchored to the file path
rather than to an unverified PMID (see the KCTD14 caution).
swrD (BACSU) — ~71-aa swarming-motility factor
- Boundary: Required for swarming on soft agar; increases per-flagellum motor torque/power at
the level of MotAB stator activity (not stator abundance); deletion cuts single-flagellum
torque ~6-fold and is rescued by motAB overexpression (Hall et al. 2018, J Bacteriol,
PMID:29061663, DOI). GOA carries only one BP term
(GO:0071978, swarming motility) — no MF, no CC. - Gap statement: SwrD's direct binding partner(s), biochemical activity, membrane/motor
localization, and structural domain are undefined. - Provenance (verbatim): "they proposed that SwrD may reduce stator dynamism by facilitating
stator retention at the motor, but were unable to directly visualize stator dynamics due to
nonfunctional fluorescent fusions—thus mechanistic details remain hypothetical"; "Experimental
localization of SwrD ... remains undetermined"; "no resolved structural domain showing
stomatin/SPFH" (genes/BACSU/swrD/swrD-deep-research-falcon.md). - Type: biology gap (primary) with an ontology/curation shadow — genetics support an MF in
stator/torque modulation, but no MF term is assignable. - Resolve: functional tagged-SwrD co-IP + super-resolution to test direct MotAB binding/
retention; in vitro reconstitution or cryo-EM of the SwrD–stator interface.
mxaC (METEA) — methanol-oxidation auxiliary VWA protein
- Boundary: In the mxa methanol-oxidation cluster and genetically required for methanol
oxidation — the founding paper states mxaC is required while "the function of the other two
genes is still unknown" (Morris et al. 1995, J Bacteriol, PMID:7592474,
DOI). Encodes a von Willebrand factor A
(VWA) domain protein; recent work implicates it in Ca²⁺ loading during methanol-dehydrogenase
(MDH) maturation alongside a MoxR AAA+ ATPase module (Zhou et al. 2025, Nat Commun — cited by
DOI in the file; no PMID present). - Gap statement: Whether MxaC's VWA/MIDAS domain itself binds Ca²⁺, and how it delivers Ca²⁺
into the MxaF active site (adaptor vs scaffold vs transporter), is undemonstrated. - Provenance (verbatim): "Direct physical interaction data (e.g., co-purification of MxaC with
MxaR) are not shown in the retrieved excerpts."; "There is no direct localization assay for
MxaC in the retrieved materials..." (genes/METEA/mxaC/mxaC-deep-research-falcon.md); "Does
MxaC directly bind Ca²⁺ through its vWFA domain?" (genes/METEA/mxaC/mxaC-notes.md). - Type: biology gap (primary) + ontology gap (no GO term for "metal incorporation into
metalloenzyme" / MDH-complex assembly). - Resolve: MIDAS-mutant Ca²⁺-binding assays on purified MxaC; cryo-EM / co-purification of the
MxaR/S/C/L module with MDH maturation intermediates.
TRAPPC12 (human) — moonlighting TRAPP subunit
- Boundary: Bona fide TRAPP subunit acting at an early stage of ER-to-Golgi trafficking
(Scrivens et al. 2011, Mol Biol Cell, PMID:21525244,
DOI); member of TRAPPIII/autophagy machinery (Kim et
al. 2016 review, PMID:27066478, DOI). Separately, a
mitotic "moonlighting" role: depletion gives noncongressed chromosomes and mitotic arrest, it
regulates CENP-E recruitment to kinetochores, and is phospho-cycled through mitosis (Milev et al.
2015, J Cell Biol, PMID:25918224, DOI). - Gap statement: The molecular mechanism of the mitotic kinetochore function — and how it is
biochemically partitioned from the TRAPP trafficking role — is undefined; the CENP-E interaction
is captured only as uninformativeprotein binding. - Provenance (verbatim): "TRAMM cycles between its role in TRAPP in interphase cells, and its
newly identified roles during mitosis"; "Small amounts of TRAMM associated with chromosomes"
(genes/human/TRAPPC12/TRAPPC12-notes.md, re PMID:25918224). - Type: blend — biology gap (mechanism of the second, non-TRAPP pool) + curation/ontology gap
(CENP-E recruitment under-captured asprotein binding). - Resolve: separation-of-function/domain-mapping mutants that uncouple kinetochore binding from
TRAPP assembly; mitotic-phase proximity proteomics of the non-TRAPP TRAMM pool.
AGR3 (human) — noncanonical PDI-family protein
- Boundary: ER-resident thioredoxin/PDI-family protein restricted to ciliated airway
epithelium; required for Ca²⁺-mediated regulation of ciliary beat frequency and mucociliary
clearance but not for ciliogenesis (Bonser et al. 2015, Am J Respir Cell Mol Biol,
PMID:25751668, DOI); binds α-dystroglycan and C4.4a
by yeast two-hybrid in a cancer context (Fletcher et al. 2003, Br J Cancer, PMID:12592373,
DOI). Has a noncanonical thioredoxin motif (DCYQS,
lacking the second catalytic cysteine of CXXC PDIs). - Gap statement: Whether AGR3's noncanonical motif retains any thiol-disulfide catalytic
activity, what its physiological substrates/folding clients are, and the receptor that mediates
extracellular-AGR3 signaling, are all unknown. - Provenance (verbatim): "no definitive AGR3-specific enzymatic substrate, direct folding
client, or biochemical turnover measurement is established"; "much of AGR3 biochemistry remains
inferential rather than directly measured" (genes/human/AGR3/AGR3-deep-research-falcon.md). - Type: biology gap (primary) + curation gap (GOA carries only generic
protein bindingplus a
single experimentaldystroglycan binding; no MF for the core ciliary/Ca²⁺ role). - Resolve: thiol-disulfide / reductase assays on WT vs catalytic-Cys-mutant AGR3 plus
proximity labeling for ER clients; identify the surface receptor for extracellular AGR3.
tam10 (SCHPO) — meiotic sequence orphan
- Boundary: Bona fide protein-coding gene (RT-PCR / RNA-seq / RACE confirmed), named for
"transcripts altered in meiosis"; deletion is viable / non-essential under standard conditions;
the only direct functional GO terms (RNA binding, nucleolus) are ISO/ISS transfers, never
validated in S. pombe (genes/SCHPO/tam10/tam10-deep-research.md;tam10-goa.tsv). - Gap statement: No experimentally determined molecular activity, RNA/protein substrate,
partner, or in vivo role exists; the RNA-binding/nucleolus annotations are unverified
computational transfers. - Provenance (verbatim): "no specific molecular function has been ascribed to Tam10";
"classifying it as a 'sequence orphan' with no obvious homology to characterized proteins"
(tam10-deep-research.md); "No evidence-supported enzymatic activity, substrate specificity,
transport substrate, or pathway membership can be asserted." (tam10-deep-research-falcon.md).
(The two providers even disagree on whether it has any annotatable domain.) - Type: blend — biology gap (primary) + curation gap (unverified ISO/ISS; provider
disagreement on domain content). - Resolve: GFP localization + AP-MS / CLIP to test the predicted nucleolar/RNA-binding role;
meiosis-specific phenotyping of tam10Δ (sporulation, spore viability, meiotic timing).
P3R3URF (human) — uORF microprotein (non-canonical-ORF dark matter)
- Boundary: Protein-coding locus encoding a ~95-aa microprotein from an ORF upstream of
PIK3R3; UniProt evidence at protein level (PE1); testis / late-spermatid enriched; the sole
GOA term is IBA (GO:0019221, cytokine-mediated signaling — phylogenetic inference, not
experiment) (genes/human/P3R3URF/P3R3URF-deep-research-openai.md;P3R3URF-goa.tsv). - Gap statement: Whether the microprotein is stably expressed as an independent protein, where
it localizes, and what its molecular function/partner is — every current functional claim is
computational inference. - Provenance (verbatim): "direct, P3R3URF-specific functional evidence is very limited in the
accessible corpus"; "Primary function ... Unknown in the accessible corpus."
(P3R3URF-deep-research-falcon.md); "While direct experimental evidence is still lacking..."
(P3R3URF-deep-research-openai.md). In the one dataset that tested for it, the transcript was
not detected. - Type: blend — biology gap (primary) + curation gap (IBA-only term risks conflation with
canonical PIK3R3 / p55γ). - Resolve: mass-spec detection of a P3R3URF-unique peptide + Ribo-seq initiation evidence in
testis; ORF-specific perturbation (sparing the PIK3R3 CDS) with interactome/localization readout. - This is the deliberately-included "we did not even know there was a player" frontier.
KCTD14 (human) — least-studied KCTD-family BTB protein
- Boundary (PubMed-verified): member of the potassium-channel-tetramerization-domain (KCTD)
family, which "consists of 26 members with mostly unknown functions"; many KCTDs act as BTB-domain
CUL3 substrate adaptors, but most members remain functionally uncharacterized (Liu et al. 2013,
Cell Biosci, PMID:24268103, DOI). - Gap statement: KCTD14's biochemical activity, bona fide substrate/partner, and biological
process are unknown — including whether it functions as a CUL3 substrate adaptor like several
paralogs, or via another mechanism. - Provenance the gap is real (verifiable facts, not summary prose): the gene's GOA record
(genes/human/KCTD14/KCTD14-goa.tsv) contains no experimental function annotation — its
entire molecular-function content is fiveprotein binding(GO:0005515, IPI) rows plus one
electronic protein-homooligomerization term (GO:0051260, IEA), with no IDA/IMP/IGI evidence for
any specific activity, substrate, or process. Combined with the family review's explicit
"mostly unknown functions" (PMID:24268103), this establishes an unfilled molecular-function gap
rather than mere under-curation. - Type: biology gap (primary; no experiment defines its activity) + curation gap (the
high-throughput interactors behind theprotein bindingrows are uninformative and no specific MF
is captured). - Resolve: endogenous AP-MS to define the partner/substrate and a CUL3 co-IP to test the
adaptor hypothesis directly; knockout/knockdown phenotyping with proteomics to assign a process. - Note: an earlier draft leaned on the
-deep-research-cyberian.mdsummary, which cited PMIDs
that fail verification (its "PMID:30929316" is an SGLT paper; "PMID:36362138" is a xylanase
paper). Those claims (including a reported CUL3-non-binding result attributed to an
unlocatable 2024 paper) are excluded here; the entry now rests only on the verified review and
the repo's own GOA data.
C18orf21 (human) — ORF-named gene, a closing gap
- Boundary: 220-aa UPF0711-family protein with a single DUF4674 domain and no enzymatic motifs;
ubiquitously expressed, nucleolar-enriched; a recurrent CRISPR fitness-screen hit. 2025 preprints
propose it is a metazoan RNase MRP-specific subunit (alias RMRPP1) required for pre-rRNA
processing (genes/human/C18orf21/C18orf21-deep-research.md;
C18orf21-deep-research-falcon.md). - Gap statement: Whether C18orf21 is genuinely a constitutive RNase MRP RNP subunit (and its
mechanistic role in RMRP stabilization / pre-rRNA cleavage) versus an uncharacterized DUF4674
protein is not yet established in peer-reviewed, GO-curated form. - Provenance (verbatim): "evolutionarily conserved protein of 220 amino acids with no
well-characterized biochemical function"; "current databases report no defined Gene Ontology
molecular function for this gene"; "To date, there are no published studies focusing
exclusively on C18orf21's function." (C18orf21-deep-research.md). - Type: curation / recency gap (leaning) — function-defining evidence currently exists only as
2025 bioRxiv preprints (no PMIDs) and has not propagated to GOA (empty) or UniProt (still
"UPF0711"). - Resolve: peer-reviewed validation of RNase MRP-specific membership (reciprocal AP-MS vs
RPP21; RIP-seq showing RMRP, not RPPH1, enrichment) plus a pre-rRNA-processing readout. - A clean illustration of a closing gap — and of why preprint-only evidence must not be
auto-annotated.
Worked gap entries — second batch (read-list deepening)
Curating the un-vetted leads surfaced a useful third category alongside the wholly-dark genes:
the residual sub-gap — a gene whose core function is textbook-solid, but which still hides a
sharp, specific mechanistic hole (RAB9A's missing GEF/GAP, RASA1's catalysis-independent
scaffolding, atg101's WF-finger recruit). These matter because a heavily annotated gene can look
finished while a load-bearing mechanism is undetermined — exactly the failure mode the project's
core principle warns about. All eight leads adjudicated to real gaps (none were spurious);
every PMID below was PubMed-verified, and one mis-attributed citation was caught and dropped.
MTC7 (yeast / S. cerevisiae) — telomere-capping sequence orphan
- Boundary: small ~139-aa basic protein; mtc7Δ clusters genetically with short-telomere /
telomere-maintenance deletions and is synthetically sick with cdc13-1 in a genome-wide screen
(Addinall et al. 2008, Genetics — cited in the files only by DOI/PMC, no verified PMID;
DOI). GOA marks every functional aspect ND
(no data) plus one IEA membrane keyword. - Gap statement: Mtc7's biochemical activity, substrate/partner, and the mechanism by which it
influences telomere capping/length are entirely unknown. - Provenance (verbatim): "MTC7 (YEL033W) encodes a protein of unknown molecular function. No
enzymatic activity or specific biochemical function has been demonstrated to date"
(genes/yeast/MTC7/MTC7-deep-research.md); "No study in the retrieved corpus provides a direct
molecular function for Mtc7 ... MTC7 remains functionally unannotated mechanistically"
(genes/yeast/MTC7/MTC7-deep-research-falcon.md). - Type: biology gap (primary) + curation gap (a telomere-maintenance BP is arguably capturable
from the genetic evidence as IGI, yet GOA still carries only ND). - Resolve: AP-MS / Y2H for physical partners; GFP localization + telomere-length / TPE assays
in mtc7Δ.
RAB9A (human) — known Rab, unknown switch (residual sub-gap)
- Boundary: endosome-to-TGN retrograde Rab GTPase that recycles mannose-6-phosphate receptors
(Lombardi et al. 1993, EMBO J, PMID:8440258,
DOI); has well-defined effectors including
the p40 effector (Díaz et al. 1997, J Cell Biol, PMID:9230071,
DOI) and TIP47/GCC185 (deep-research files). - Gap statement: The specific GEF that activates RAB9A on late endosomes and the GAP that
inactivates it have not been definitively identified. - Provenance (verbatim): "the specific guanine nucleotide exchange factor (GEF) that activates
RAB9A and the GTPase-activating protein (GAP) that inactivates it have not been definitively
identified" (genes/human/RAB9A/RAB9A-deep-research-cyberian.md); "Regulators specific to
RAB9A (cognate GEFs/GAPs) ... remain less well defined"
(genes/human/RAB9A/RAB9A-deep-research-falcon.md). (A claim in the openai file that DENND2 is
the GEF rests on a general DENN-domain paper covering other Rabs, not RAB9A — an
over-extrapolation that reinforces that no validated RAB9A GEF exists.) - Type: biology gap — the regulators are genuinely undiscovered.
- Resolve: in vitro GEF assays across candidate DENN-domain GEFs; a TBC-domain GAP screen with
a CI-MPR mis-sorting readout on knockdown.
RASA1 (human) — catalysis solved, scaffolding unsolved (residual sub-gap)
- Boundary: p120 RasGAP; accelerates Ras GTP hydrolysis ~10⁵-fold via the arginine finger
Arg789, structurally defined with the transition-state (AlF) mimic (Scheffzek et al. 1997,
Science, PMID:9219684, DOI); multidomain
(SH2-SH3-SH2, PH, C2, GAP) with tandem-SH2 phosphotyrosine engagement
(genes/human/RASA1/RASA1-deep-research-falcon.md). - Gap statement: The molecular mechanism of RASA1's GAP-activity-independent (scaffolding)
functions — e.g. how p190RhoGAP recruitment drives directed cell movement and contributes to
blood-vessel formation independently of its own Ras-GAP activity — is unresolved. - Provenance (verbatim): "Experimental evidence indicates RASA1 is necessary for directed cell
movement in vitro, and this role depends on its ability to recruit p190^RhoGAP (independent of
RASA1's own Ras-GAP activity)"; "the embryonic blood vessel defects in RASA1-null embryos are
partly due to Ras-independent actions of RASA1" (RASA1-deep-research-openai.md); "How these
two functions are coordinated, and whether they can be separated therapeutically, warrants
further study." (RASA1-deep-research-cyberian.md). - Type: blend — biology gap (the scaffolding mechanism is unresolved) + curation gap (GO
captures the catalytic GAP branch; the p190RhoGAP-recruitment role in migration is
under-annotated). - Resolve: GAP-dead (Arg789) vs scaffold-dead (SH2/SH3) separation-of-function knock-ins
scoring migration / vascular tube formation; BioID of GAP-dead RASA1 to map the
catalysis-independent interactome.
BAIAP2L2 (human) — Pinkbar, dark in its native tissue
- Boundary: epithelial I-BAR/IMD protein ("Pinkbar") that binds phosphoinositide membranes and
generates planar membrane sheets, localizing to Rab13 vesicles and intercellular junctions in
intestine/kidney (Pykäläinen et al. 2011, Nat Struct Mol Biol, PMID:21743456,
DOI). In cochlear hair cells it is a row-2 stereocilia-tip
component (deep-research files). - Gap statement: The molecular function of BAIAP2L2/Pinkbar in its name-defining native
intestinal/renal epithelium — what membrane/junctional structure it builds, and through which
partner — is unknown, because knockout mice show no overt epithelial phenotype. - Provenance (verbatim): "mice lacking BAIAP2L2 display normal kidney and colon tissue
morphology and maintain normal electrolyte homeostasis and tissue architecture under
physiological conditions"; "BAIAP2L2's relationship to microvillar formation and maintenance in
intestinal brush borders remains incompletely understood."
(genes/human/BAIAP2L2/BAIAP2L2-deep-research-perplexity.md). - Type: biology gap (primary; the epithelial MF was never measured) + curation gap (the
epithelial GO terms are IEA/ISS/IBA inferences). - Resolve: challenge-condition / conditional-KO phenotyping of intestinal & renal epithelium
(barrier integrity, brush-border architecture under stress); Pinkbar proximity-labeling
interactome in polarized enterocytes.
SCGB1C1 (human) — orphan secretoglobin
- Boundary: small secreted secretoglobin-fold protein localized to Bowman's glands of the
olfactory mucosa, with a hydrophobic cavity capable of binding small ligands; in a mouse
OVA-asthma model, recombinant SCGB1C1 suppressed Th2 inflammation and expanded Tregs (Kim et al.
2024, Int J Mol Sci, PMID:38892470, DOI). GOA carries
only a single IEA extracellular-region term. - Gap statement: The endogenous physiological ligand(s) SCGB1C1 binds in vivo, and the
cell-surface receptor / signaling mechanism behind its immunomodulatory (Treg-expanding) effect,
are unidentified. - Provenance (verbatim): "The specific hydrophobic ligands that SCGB1C1 binds in vivo remain
incompletely characterized."; "The exact receptors and signaling cascades through which SCGB1C1
mediates these immunomodulatory effects remain to be definitively identified, representing an
important area for future investigation." (SCGB1C1-deep-research-perplexity.md). - Type: biology gap (ligand + receptor genuinely undiscovered) + curation gap (only one IEA CC
term; the established secretoglobin fold and the mouse phenotype are uncaptured). - Resolve: a biochemical ligand-binding screen (lipidomic / odorant affinity) on recombinant
human SCGB1C1; receptor identification by pulldown/proximity-labeling on Tregs + LOF validation.
FGFRL1 (human) — a receptor that signals without a kinase
- Boundary: atypical FGFR with three Ig-like ectodomains but no tyrosine-kinase domain;
binds FGF ligands and heparin and acts as a decoy receptor (Trueb et al. 2003, J Biol Chem,
PMID:12813049, DOI); forms constitutive dimers and
mediates HSPG-dependent cell adhesion (Rieckmann et al. 2007, Exp Cell Res, PMID:18061161,
DOI); essential for kidney, diaphragm and skull
development, with the cytoplasmic tail dispensable (deep-research files). - Gap statement: The precise molecular mechanism by which kinase-dead FGFRL1 modulates
FGF/FGFR signaling in vivo (pure ligand sink vs inhibitory FGFR complexes vs intracellular
Sprouty/Spred recruitment), and the identity of its Ig3 cell-fusion partner, are unknown. - Provenance (verbatim): "manipulating FGFRL1 levels in vitro did not measurably change cell
proliferation or ERK phosphorylation"; "How exactly does FGFRL1 regulate FGF signaling in vivo?
Is it purely by sequestering FGFs (acting as a sink), or does it form inhibitory complexes with
the signaling receptors?"; "The 'target protein' involved in FGFRL1-mediated cell fusion is
currently unknown." (genes/human/FGFRL1/FGFRL1-deep-research.md). - Type: biology gap (primary; transduction mechanism undetermined) + curation gap (the GOA term
GO:0005007"fibroblast growth factor receptor activity" overstates canonical signaling for a
kinase-dead receptor — a candidate MODIFY). - Resolve: FGFRL1–FGFR1/2/3/4 co-IP + live-cell co-imaging during ligand stimulation to test
sink-vs-complex models; a fusion-defective Ig3 point-mutant knock-in plus a screen for the Ig3
partner.
atg101 (SCHPO) — known subunit, unknown recruit (residual sub-gap)
- Boundary: core subunit of the S. pombe Atg1/ULK autophagy-initiation complex (Yu et al.
2021, J Cell Sci, PMID:34499173, DOI); forms an obligate
HORMA heterodimer that stabilizes the Atg13 HORMA domain, and its conserved "WF finger" recruits
downstream factors (Suzuki et al. 2015, Nat Struct Mol Biol, PMID:26030876,
DOI). - Gap statement: The molecular identity of the downstream factor(s) recruited by the Atg101
WF-finger surface in S. pombe is unknown, despite that surface being genetically required for
autophagy independent of Atg13 binding. - Provenance (verbatim): "A WF-finger triple mutant (W110A, P111A, F112A) retained Atg13
binding but impaired autophagy, indicating Atg101 has functional roles beyond stabilizing/binding
Atg13." (atg101-deep-research-falcon.md; the underlying WF-finger result is from a thesis, no
PMID); "The precise protein targets of the WF finger motif remain incompletely characterized,
though WIPI family proteins represent likely candidates." (atg101-deep-research-perplexity.md). - Type: biology gap (the binding partner is unidentified) + minor curation gap (no MF term for
the WF-finger recruitment; the key evidence sits in an uncited thesis, not a curatable PMID). - Resolve: IP-MS comparing WT vs WF-finger AAA mutant in starved S. pombe; targeted binding
tests against candidate WIPI / Atg18 orthologs.
irg-1 (worm) — famous reporter, mystery protein
- Boundary: intestinal infection-response gene transcriptionally induced by Pseudomonas
aeruginosa via the bZIP factor ZIP-2 (Estes et al. 2010, PNAS, PMID:20133860,
DOI); annotated to antibacterial innate immune
response by IEP (expression) evidence; the protein carries predicted NADAR/YbiA-like domains (a
domain prediction only). - Gap statement: The actual biochemical/enzymatic activity of the IRG-1 protein — whether its
predicted NADAR/YbiA-like domain confers a real catalytic function, and on what substrate — is
undefined; everything known concerns its transcriptional induction, not what the protein
does. GOA recordsmolecular_functionas ND. - Provenance (verbatim): "irg-1 enables GO:0003674 molecular_function ... ND"
(genes/worm/irg-1/irg-1-goa.tsv); "Despite domain predictions (NADAR/YbiA-like), no direct
enzymatic reaction, substrate specificity, or transport function has been experimentally defined
for the IRG-1 protein in C. elegans." (genes/worm/irg-1/irg-1-deep-research-falcon.md). - Type: biology gap (MF genuinely unknown), honestly reflected as an MF ND annotation — not a
curation/ontology defect. A clean BP-known / MF-dark exemplar. - Resolve: assay recombinant IRG-1 for NADAR-family activity (NAD / ADP-ribose-related
hydrolase) against candidate substrates; structure-guided catalytic-residue mutagenesis + rescue.
Worked gap entries — third batch (cluster + subunit cases)
This batch closes out the taxonomy: it adds the project's first worked ontology-dominant entry
(POLE4 — the structural-subunit pattern that the core-principle table estimates at ~64% of
apparent MF-darkness), a consolidated cluster gap (the M. extorquens lanthanophore accessory
genes), and three more residual sub-gaps. With these, all three gap kinds (biology, curation,
ontology) and the residual-sub-gap framing have curated exemplars.
M. extorquens methanol/lanthanide cluster — uncharacterized accessory genes (consolidated)
- Scope & boundary: the mxa (Ca²⁺-dependent MDH), xox (lanthanide-dependent MDH), and
mll (lanthanophore biosynthesis) clusters drive methanol oxidation in METEA. Several accessory
components are now firmly defined and are not gaps: XoxG (cytochrome c_L) and XoxJ are
biochemically/structurally characterized (Featherston et al. 2019, ChemBioChem, PMID:31017712,
DOI); the lanthanide-switch regulatory cascade
MxcQE→MxbDM→mxa/xox is mapped (Skovran et al. 2011, J Bacteriol, PMID:21873495,
DOI); and the lanthanophore methylolanthanin plus its
biosynthetic machinery were identified in 2024 (Zytnick et al. 2024, PNAS, PMID:39078674,
DOI). - Shared gap statement: for the mll accessory genes (mllG, mllH, mllJ and the NIS-synthetase
components mllA/mllF), no enzyme of the methylolanthanin pathway has been biochemically
reconstituted — the specific reaction, substrate, and direct molecular role of each protein are
inferred from homology and cluster context, not demonstrated. - Per-gene provenance (verbatim): mllG — "No enzymatic activities have been directly
demonstrated for mllG through standard biochemical assays in the current literature."
(genes/METEA/mllG/mllG-deep-research-perplexity.md); mllA — "No direct biochemical studies
have been conducted on the mllA enzyme, and its precise substrate specificity, catalytic rate
constants, and reaction mechanism await experimental determination..."
(genes/METEA/mllA/mllA-deep-research-perplexity.md); mllH — "Direct enzymatic characterization
of recombinant META1p4137 protein has not yet been reported in the peer-reviewed literature..."
(genes/METEA/mllH/mllH-deep-research-perplexity.md); mllJ — "The primary molecular function of
MexAM1_META1p4138 (mllJ) remains partially characterized due to its recent discovery and the
presence of a domain of unknown function classification."
(genes/METEA/mllJ/mllJ-deep-research-perplexity.md). - Type: biology gap (dominant — no reconstituted enzymology for the mll accessory proteins) +
ontology shadow (GO lacks precise terms for lanthanide / metallophore handling). - Resolve: in vitro reconstitution of the methylolanthanin pathway with purified MllA/F/H plus
carriers (as done for aerobactin/petrobactin); single-gene (not whole-cluster) deletions of
mllG/H/J with lanthanide-bioaccumulation phenotyping and localization to assign individual roles.
MAP7D1 (human) — paralog-specific mechanism unknown (residual sub-gap)
- Boundary: MAP7-family microtubule-associated protein; binds MTs via its N-terminal half and
recruits/activates kinesin-1 via the MAP7 domain; all four paralogs bind kinesin-1 and act
redundantly in HeLa (Hooikaas et al. 2019, J Cell Biol, PMID:30770434,
DOI); MAP7D1 is specifically required to maintain
acetylated/stable microtubules, mechanistically distinct from MAP7D2 (Kikuchi et al. 2022, Life
Sci Alliance, PMID:35470240, DOI). - Gap statement: The precise molecular mechanism by which MAP7D1 specifically maintains
acetylated/stable microtubules (vs its paralogs), and how it scaffolds nuclear DNA-damage-response
factors from the cytoplasm, are unknown. - Provenance (verbatim): "How exactly does MAP7D1 maintain acetylated tubulin levels? Does it
regulate acetyltransferases, inhibit deacetylases, or protect acetylated microtubules from
depolymerization?"; "What is the precise molecular mechanism by which MAP7D1 participates in DNA
damage response? How do cytoplasmic MAP proteins interact with nuclear DDR machinery...?"
(genes/human/MAP7D1/MAP7D1-deep-research-cyberian.md). - Type: blend — biology gap (the acetylation-maintenance and DDR-scaffolding mechanisms are
unresolved) + curation gap (GOA carries only family-level/IBA terms plus uninformativeprotein binding; the experimentally supported kinesin-1-activation and acetylated-MT-maintenance roles
are not captured as specific MF/BP terms). - Resolve: test whether MAP7D1 depletion alters ATAT1/HDAC6 activity or protects the
K40-acetylated lattice (in-cell + in-vitro); map the MAP7D1 region binding DDR factors and test
whether it is required for damage-focus formation.
POLE4 (human) — a structural subunit GO can't describe (ontology-dominant)
- Verdict & framing: this is the project's worked exemplar of the ontology gap — the
apparent MF-darkness is GO's inability to express "be part of / stabilize the machine," not true
biological ignorance. POLE4's biology is largely solved. - Boundary: accessory subunit of DNA polymerase ε with no catalytic activity of its own; a
structural scaffold required for holoenzyme stability (in Pole4-KO mice the whole Polε complex
is destabilized) (Bellelli et al. 2018, Mol Cell, PMID:29754823,
DOI); with POLE3 it is a bona fide histone
H3–H4 chaperone coupling replication to nucleosome assembly (Bellelli et al. 2018, Mol Cell,
PMID:30217558, DOI). - Gap statement: Whether POLE4 has any molecular function beyond a histone-fold
structural/scaffolding subunit and (with POLE3) an H3–H4 chaperone — and how that scaffolding
role should be expressed in GO — is the open question, not its biology. - Provenance (verbatim): "Whether POLE4 plays a functional role in ATAC-mediated transcription
or whether this association reflects promiscuous histone fold interactions remains to be
determined." (genes/human/POLE4/POLE4-deep-research-cyberian.md); "The 2024 human Pol ε–PCNA
structures do not resolve the non-catalytic lobe (POLE2–POLE3–POLE4)..."
(genes/human/POLE4/POLE4-deep-research-falcon.md). - Type: ontology-dominant + curation cleanup. GOA captures the biology adequately at the
complex level (part_ofepsilon DNA polymerase complex GO:0008622; involved in DNA replication),
but the MF rows fall back to uninformativeprotein binding(×7) and an inferred
DNA-directed DNA polymerase activity (GO:0003887, TAS) that POLE4 does not itself perform. - Resolve (ontology + curation, not experiment): annotate the scaffolding role with a
structural-constituent MF term (structural molecule activity, GO:0005198, or a replisome child)
and a histone-chaperone-activity MF term for the H3–H4 function; MODIFY/demote the misleading
polymerase-activity row and the genericprotein binding. (Biology residue: a dimerization-vs-
chaperone separation-of-function allele to test whether processivity and histone-recycling
phenotypes dissociate.)
AP3B2 (human) — neuronal adaptor, unknown cargo (residual sub-gap)
- Boundary (PubMed-verified): AP3B2 encodes the neuron-specific β subunit of the AP-3
adaptor complex; the ubiquitous paralog AP3B1 instead causes Hermansky-Pudlak syndrome type 2, and
biallelic AP3B2 variants cause an early-onset epileptic encephalopathy with optic atrophy (Assoum
et al. 2016, Am J Hum Genet, PMID:27889060, DOI;
further cases in Dilber et al. 2022, Clin Neurol Neurosurg, PMID:36356440,
DOI). In mouse, the neuronal isoform knockout
(Ap3b2⁻/⁻) reduces selected synaptic-vesicle proteins — the opposite of the ubiquitous
Ap3b1⁻/⁻ — establishing isoform-divergent SV biogenesis (Newell-Litwa et al. 2009, Mol Biol
Cell, PMID:19144828, DOI); AP-3 also regulates SV size
in a brain-region-specific manner (Newell-Litwa et al. 2010, J Neurosci, PMID:20089890,
DOI). - Gap statement: The complete repertoire of neuron-specific cargoes selected by the β3B (AP3B2)
AP-3 isoform — and the molecular basis by which that selection differs from the ubiquitous β3A
(AP3B1) isoform — is not defined. - Provenance the gap is real (verifiable facts): the Ap3b2 vs Ap3b1 knockouts have opposite
effects on SV protein content (PMID:19144828) yet no study enumerates the β3B-specific cargo set;
correspondingly, the AP3B2 GOA record (genes/human/AP3B2/AP3B2-goa.tsv) carries no
experimental (IDA/IMP/IGI) cargo or function annotation — every term is IBA/IEA/NAS/ISS/TAS — so
the isoform-specific cargo specificity is genuinely undetermined, not merely uncurated. - Type: blend — biology gap (the β3B-specific cargo set / isoform divergence is
uncharacterized) + ontology dimension (GO annotates the subunit gene rather than the
β3B-containing complex). - Resolve: comparative quantitative proteomics of AP-3 vesicles immunoisolated from neuronal
(β3B) vs non-neuronal (β3A) cells; β3B-specific-KO neuron SV proteomics vs β3A rescue. - Note: an earlier draft leaned on the
-deep-research-cyberian.mdsummary, whose PMIDs fail
verification ("PMID:7545544" → nNOS/dystrophin; "PMID:19116307" → ADAMTS13); those are excluded
and the entry now rests on PubMed-verified primary papers plus the repo's GOA data.
atg2 (SCHPO) — known lipid bridge, unresolved directionality (residual sub-gap)
- Boundary: chorein-N-family lipid-transfer protein that bridges membranes through an extended
hydrophobic channel, carrying tens of lipids at once (Lees & Reinisch 2020, Curr Opin Cell
Biol, PMID:32213462, DOI); acts at the phagophore
rim / ER–phagophore contact and is required for autophagosome formation in S. pombe (Wang et
al. 2023, Nat Commun, PMID:37553386, DOI). - Gap statement: How directionality, timing, and regulation of lipid flow are enforced in the
coupled Atg2–Atg9(scramblase)–Atg18/WIPI system at ER–phagophore contact sites is
mechanistically unresolved. - Provenance (verbatim): "precise molecular coordination among ATG2, ATG9, and Atg18/WIPI
(timing, directionality, and regulation of lipid flow) remains incompletely resolved."
(genes/SCHPO/atg2/atg2-deep-research-falcon.md); "the calculated transfer rates appeared
insufficient to account for the tens of millions of lipids required to build an entire
autophagosome" (genes/SCHPO/atg2/atg2-deep-research-perplexity.md). - Type: biology gap (residual mechanistic). The core MF/BP/CC are correctly captured by
experimental GO annotations, so this is not a curation or ontology gap — only the mechanism of
directional lipid coupling is open. - Resolve: reconstitute Atg2–Atg9–Atg18 on asymmetric proteoliposomes with leaflet-specific
reporters to test whether scramblase coupling sets net directionality; cryo-EM of the Atg2–Atg9
junction at the phagophore rim.
EryCIV / EryCV (S. erythraea) — desosamine-pathway enzymes with unresolved catalytic identity (curation+experiment gap)
From the BGC erythromycin curation (terms/erythromycin_biosynthesis/). These are not unknown
genes — both are clearly TDP-D-desosamine biosynthesis enzymes of the ery cluster — but the
precise catalytic reaction of each is undetermined, and the existing annotations actively
conflict.
- Boundary (what is established): both contribute to TDP-D-desosamine biosynthesis (a
monosaccharide donor for erythromycin); EryCIV (A4F7N3) is a PLP-dependent DegT/DnrJ/EryC1-family
enzyme; EryCV (A4F7N2) is a radical-SAM-type enzyme (binds a [4Fe-4S] cluster and SAM). The
cluster's dedicated PLP sugar aminotransferase is the separate gene EryCI. - Gap statement: EryCIV's reaction — PLP-dependent 3,4-dehydratase/isomerase (UniProt/EMBL
name) vs transaminase (GOA term) — is unresolved; EryCV's reaction — 3,4-enoyl reductase
(UniProt/EMBL name) vs ammonia-lyase (GOA term) — is unresolved. - Provenance for the gap (checkable repo facts): the GOA files carry only electronic (IEA)
assignments with no experimental evidence code (genes/SACEN/eryCIV/eryCIV-goa.tsv,
genes/SACEN/eryCV/eryCV-goa.tsv), and those IEA terms contradict the UniProt/EMBL
ECO:0000313names recorded in each gene's*-uniprot.txt. No primary biochemical
characterization was located (the erythromycin module deep-research run,modules/…-falcon.md,
explicitly states it "deliberately avoids asserting … the full enzymatic sequence of deoxysugar
biosynthesis … because direct primary textual support … was not successfully retrieved"). Both
reviews carryaction: UNDECIDEDfor the catalytic MF. - Type: curation+experiment gap — the function-kind is known; the exact MF needs a primary
source (likely Salah-Bey et al. 1998 deoxysugar gene-inactivation work, not yet read) or new
reconstitution. - Significance: correct MF terms for these would complete the desosamine arm and let the
pathway module assert the donor-biosynthesis steps rather than leaving them UNDECIDED. - Resolve: read the primary desosamine-pathway literature (gene-inactivation / heterologous
reconstitution) to adjudicate; or reconstitute EryCIV (PLP) and EryCV ([4Fe-4S]/SAM) in vitro on
the TDP-keto-deoxysugar intermediates. Tracked per-gene insuggested_questions/suggested_experiments.
Methodology
For each candidate gene:
- Deep literature search — use cached publications, the gene's
-deep-research-*.md,
PubMed, and full text where available. Read primarily for the boundary and for the field's
own statements of ignorance. - Harvest ignorance signals — author hedges ("remains unknown / unclear / to be
determined"), DUF / "uncharacterized" / hypothetical-protein naming, orphan-activity vs
orphan-gene framing, and the diagnostic-not-mechanistic trajectory. - Judge the type — separate biology gaps from curation gaps from ontology gaps. This is
the irreducible human/curatorial step. - Write the gap entry with full provenance.
- Route it — experiment (suggested_experiments), ontology (proposed_new_terms), or
curation (a normal review action).
Selection of which genes to read is curatorial, in the spirit of PomBase's "priority
unstudied genes" determination (Wood 2019) — informed, but not decided, by the triage
read-list.
Reviewing deep-research output
Deep-research -deep-research-{provider}.md files are LLM-generated and are leads, not sources.
Curating this project surfaced a consistent set of failure modes; the table below is grounded in
actual errors caught during this work, because the cost of not catching them is a curated artifact
that cites a real-looking paper that says something else.
| Failure mode | Why it's dangerous | Examples caught here |
|---|---|---|
| Mismatched-but-real PMID (well-formed, resolves, wrong paper) | Passes format validation; only a content check catches it | RASA1 "36323259" (→ILC paper); KCTD14 "30929316/36362138" (→SGLT/xylanase); AP3B2 "7545544/19116307" (→nNOS/ADAMTS13); Edwards/Oprea/Kustatscher in the intro |
| Citation by DOI / PMC / citekey only (no PMID) | Not verifiable by PMID tooling; easy to copy a wrong one | falcon files broadly; mxaC (Zhou 2025), tam10, C18orf21, MTC7 (Addinall) |
| Over-extrapolation (real PMID, but it doesn't support the claim) | The citation "checks out" yet the claim is unsupported | RAB9A: DENND2-as-GEF cited to a general DENN-domain paper on other Rabs |
| Unlocatable citation (claimed paper not findable at all) | Can't be confirmed or refuted | KCTD14's reported CUL3-non-binding "2024 paper" / PMC10856315 |
| Cross-provider disagreement | One file's confident claim contradicts another's | tam10 domain content; C18orf21 "characterized" vs "uncharacterized" |
Mechanisms, layered from automated to manual:
- Automated PMID resolution + title-match (the missing layer). Today
src/ai_gene_review/tools/validate_pmid_references.pychecks only PMID format and that cited
PMIDs appear in the review YAML — it cannot catch a real PMID pointing at the wrong paper. The
high-leverage addition: for each PMID in a deep-research file, fetch metadata (reusing the
refresh_pmid_titles/ publication-cache machinery) and compare the fetched title/first-author/
year against the text adjacent to the citation; flag low-similarity matches and emit a per-file
citation-reliability score. This directly catches the dominant failure mode. - Ground claims against authoritative repo data, not prose. Each gene folder already ships
non-LLM ground truth —-goa.tsv(GO terms + evidence codes) and-uniprot.txt. Any claim about
annotation/function state should be checked against these, and a gap's provenance should prefer a
checkable repo fact (e.g. "GOA molecular_function is ND / IEA-only") over an LLM hedge. (This is
exactly how KCTD14 and AP3B2 were re-sourced above.) - Cross-provider corroboration. Treat a claim asserted independently by ≥2 providers as
stronger; treat a single-provider mechanistic claim as a lead; route disagreements to a human. - Provider-reliability tracking. Log citation-accuracy per provider over time (with a
small-sample caveat) so curators can weight accordingly — e.g. in this small sample the
cyberianfiles produced mismatched PMIDs twice (KCTD14, AP3B2), andfalcontends to cite by
DOI rather than PMID. - Promotion rule (human-in-the-loop). Nothing enters a curated artifact (review YAML, project
page) unless its citation is independently PubMed-verified or anchored to a checkable repo
fact. Verbatim LLM hedges are starting points, never the provenance of record. - Quote-and-source discipline. Require a verbatim supporting quote plus a locatable source for
each promoted claim; a claim with no findable quote+source is, by construction, unverified.
Recommendation: implement mechanism 1 as a validate-deep-research check (or an opt-in,
network-gated mode of the existing tool), reporting a reliability score and a list of
mismatched/unresolvable PMIDs per file. Mechanisms 2, 5 and 6 are process rules already adopted in
this project; 3 and 4 fall out for free once 1 produces structured per-file output.
Relation to existing KB machinery
The building blocks already exist and are reused, not replaced:
| Signal | Existing field | Gap role |
|---|---|---|
| Open questions | suggested_questions |
Seed for biology-gap statements (to be elevated with provenance) |
| Proposed experiments | suggested_experiments |
The "what would resolve it" |
| Missing GO terms | proposed_new_terms |
Ontology-gap entries |
| Unresolvable annotation | action: UNDECIDED |
Often a curation gap (esp. literature inaccessible) |
protein binding avoidance rule |
curation guideline | Flags MF-mechanism gaps |
Schema direction (implemented): this has now been elevated into a provenance-bearing
knowledge_gaps structure (gap statement + boundary + verbatim provenance quotes + kind/aspect/
status typing) so gaps are first-class and queryable — see the Structured curation section above
for the KnowledgeGap class. The project first proved the unit and method on the curated examples
below; the schema element now makes them machine-tractable.
Prioritization (curatorial, not computed)
When choosing what to read next, weight toward gaps that are both high-value and tractable:
- Conservation depth — LECA-deep conservation implies fundamental function (Wood filter).
- Disease / GWAS / IDG overlap — for human genes; the funder hook. (Disease-mechanism
prioritization can be handed to / shared with Monarch's dismech.) - A coevolution handle — phylogenetic-profile or co-expression clustering with a known
module gives instant guilt-by-association and a ready hypothesis. - Microproteins / smORFs / non-canonical ORFs — a deliberately included frontier where the
gap is "we did not even know there was a player" (e.g. P3R3URF).
Seed read-list (triage candidates — to be read, not yet curated)
From the conserved-MF-dark triage, after stripping structural-subunit and curation-completeness
false positives. These are candidates for deep reading; the eight marked worked now have
gap entries above.
| Gene | Org | Status | Why interesting |
|---|---|---|---|
| CFAP300 | human | worked | Dynein preassembly; mechanism unknown; DUF4498 (full exemplar) |
| P3R3URF | human | worked | 95-aa microprotein from a uORF; non-canonical-ORF dark matter |
| tam10 | SCHPO | worked | Meiotic sequence orphan; only ISO/ISS transfers |
| AGR3 | human | worked | Noncanonical PDI-family; catalytic activity / clients unknown |
| swrD | BACSU | worked | ~71-aa swarming-motility enhancer; mechanism unknown |
| mxaC | METEA | worked | VWA auxiliary protein for methanol oxidation; mechanism unknown |
| TRAPPC12 | human | worked | Moonlighting TRAPP factor; mitotic function ill-defined |
| KCTD14 | human | worked | Least-studied KCTD BTB protein; no experimental function in GOA |
| C18orf21 | human | worked | ORF-named; a closing gap (2025 RNase MRP preprints) |
| MTC7 | yeast | worked | Telomere-capping sequence orphan; all GO aspects ND |
| RAB9A | human | worked | Residual sub-gap: known Rab, cognate GEF/GAP unidentified |
| RASA1 | human | worked | Residual sub-gap: catalysis solved, scaffolding mechanism unsolved |
| BAIAP2L2 | human | worked | Pinkbar; native epithelial function unmeasured (KO normal) |
| SCGB1C1 | human | worked | Orphan secretoglobin; ligand + receptor unidentified |
| FGFRL1 | human | worked | Kinase-dead FGFR; signaling mechanism unresolved |
| atg101 | SCHPO | worked | Residual sub-gap: WF-finger recruitment partner unknown |
| irg-1 | worm | worked | BP-known / MF-dark; IRG-1 protein activity undefined (MF ND) |
| METEA mll cluster | METEA | worked | Consolidated: lanthanophore accessory enzymes unreconstituted |
| MAP7D1 | human | worked | Residual sub-gap: paralog-specific MT-acetylation mechanism |
| POLE4 | human | worked | Ontology gap exemplar: structural Polε subunit GO can't express |
| AP3B2 | human | worked | Residual sub-gap: β3B vs β3A neuron-specific cargo undefined |
| atg2 | SCHPO | worked | Residual sub-gap: directionality of Atg2–Atg9 lipid flow |
Reproducible read-list (ignorance-signal scan)
A grep of every genes/**/*-deep-research-*.md for author hedges (remains unknown / unclear /
undetermined / elusive / uncharacterized / to be determined) and for precise (function|role|
mechanism) ... (unknown|unclear|not) surfaced ~60 files carrying explicit ignorance
statements — a reproducible candidate pool beyond the triage table. The strongest leads from that
scan have now been curated (the second- and third-batch entries above). Remaining un-vetted leads
worth reading next, spotted in the same scan: human/PUS3, human/FGFRL1-adjacent CFAP418,
human/SOCS4/SOCS5, human/RFT1, worm/pef-1, worm/fshr-1, SCHPO/alo1, and DESVH/Q72DT1.
Curation caution (learned here): deep-research files frequently cite sources only by DOI,
and some cite PMIDs that do not resolve to the right paper (e.g. the KCTD14 cyberian file). Every
PMID promoted into a gap entry must be verified against PubMed before use; otherwise anchor the
gap to the deep-research file path and verbatim quote.
Prior art and references
- Wood et al. 2019, Open Biology — "Hidden in plain sight" (PMID:30938578)
- Rocha, Freeman et al. 2023, PLOS Biology — Functional unknomics / Unknome database (PMID:37552676)
- Stoeger et al. 2018, PLOS Biology — why important genes are ignored (PMID:30226837)
- Edwards et al. 2011, Nature — "Too many roads not taken" (PMID:21307913)
- Oprea et al. 2018, Nat Rev Drug Discov — unexplored therapeutic genome / IDG (PMID:29472638)
- Kustatscher et al. 2022, Nat Methods — understudied proteins initiative (PMID:35534633)
- de Crécy-Lagard et al. 2025 — enzymes/proteins of unknown function, prediction error types (PMID:40703034)
- Related internal projects:
STRUCTURE_FUNCTION.md(dark proteome via structure),
IBA_REVIEW.md(how orthology propagation manufactures false knowledge),
OVER_ANNOTATION_PATTERNS.md(annotation that masks gaps). - External model: monarch-initiative/dismech — Claude-Code-curated disease-mechanism KB with
the same per-record-YAML + verbatim-quote evidentiary discipline; derives gap/priority
dashboards over its corpus.
Status
- [x] Literature scan (unknome / Wood; Unknome database; dismech approach)
- [x] Survey of how gaps are captured in the KB today
- [x] Core principle established: gaps are curated judgments, not metrics
- [x] Gap taxonomy (biology / curation / ontology; MF/BP/CC darkness)
- [x] Unit defined (anatomy of a gap entry)
- [x] One worked exemplar (CFAP300)
- [x] Curate the seed read-list into worked gap entries (8 added: swrD, mxaC, TRAPPC12, AGR3, tam10, P3R3URF, KCTD14, C18orf21; all cited PMIDs PubMed-verified)
- [x] Reproducible ignorance-signal read-list established (~60 deep-research files)
- [x] Read-list deepening, batch 2 (8 added: MTC7, RAB9A, RASA1, BAIAP2L2, SCGB1C1, FGFRL1, atg101, irg-1; all adjudicated to real gaps; PMIDs PubMed-verified)
- [x] Third gap framing identified: the residual sub-gap of otherwise well-characterized genes
- [x] Read-list deepening, batch 3 (5 added: METEA mll cluster, MAP7D1, POLE4, AP3B2, atg2; PMIDs PubMed-verified)
- [x] All three gap kinds now have worked exemplars (biology = most; ontology = POLE4; curation = woven through MAP7D1/AP3B2)
- [x] Re-sourced summary-only entries (KCTD14, AP3B2) onto verified primary literature + repo GOA data
- [x] Documented deep-research review mechanisms (failure-mode table + layered checks)
- [ ] Implement
validate-deep-research: PMID resolution + title-match + per-file reliability score (extendsvalidate_pmid_references.py) - [ ] Read-list deepening, batch 4: PUS3, CFAP418, SOCS4/SOCS5, RFT1, pef-1, fshr-1, alo1
- [ ] Decide unit granularity (per-gap vs per-gene narrative)
- [x] Decide home: standalone register vs
knowledge_gapsschema element — done: added a first-classKnowledgeGapschema class (gene/annotation/core-function/module/module-node), with the structured register rendered from it - [ ] Conservation / disease prioritization pass over candidates
Notes
- 2026-06: Project initiated. Brainstorm grounded in (a) the unknome literature, (b) the
dismech curation model, and (c) how gaps are recorded in this KB today. Key finding from a
throwaway triage script: apparent "missing molecular function" is dominated (~64% of the
conserved subset) by GO's inability to describe complex subunits, not by true ignorance —
reinforcing that the gap call must be made by reading, not counting. CFAP300 curated as the
first worked exemplar. - 2026-06: Eight seed-list genes curated into worked gap entries by reading their
-deep-research-*.md/ notes / GOA for the field's own ignorance statements (verbatim quotes).
Established a reproducible "ignorance-signal grep" over deep-research files (~60 hits) as a
standing read-list. Verified every promoted PMID against PubMed and caught a deep-research file
(KCTD14 cyberian) citing PMIDs that resolve to unrelated papers — codified the verify-or-anchor-
to-file-path rule. Span of gap types now covered: enzyme-adjacent bacterial (swrD, mxaC),
moonlighting eukaryotic (TRAPPC12), noncanonical-fold (AGR3), sequence orphan (tam10),
microprotein (P3R3URF), unstudied family member (KCTD14), and a closing gap (C18orf21). - 2026-06: Second curation batch — eight read-list leads (MTC7, RAB9A, RASA1, BAIAP2L2, SCGB1C1,
FGFRL1, atg101, irg-1) each adjudicated by reading their deep-research/GOA files; all eight were
genuine gaps (no spurious calls). Surfaced the residual sub-gap category (RAB9A GEF/GAP, RASA1
scaffolding, atg101 WF-finger) — a sharp mechanistic hole inside an otherwise well-characterized
gene. PMID verification again paid off: PMID:36323259, cited by a deep-research summary as the
RASA1 tandem-SH2 reference, resolves to an unrelated stem-cell/ILC paper and was dropped;
RASA1's domain architecture is anchored to the file path instead. Sixteen worked entries now
span bacterial, fungal (budding + fission yeast), nematode, and human genes. - 2026-06: Third curation batch (5 entries) completes the taxonomy. Added the first worked
ontology-dominant gap (POLE4 — a structural Polε subunit whose apparent MF-darkness is GO's
inability to say "structural constituent / scaffold," not real ignorance; the fix is curation +
a structural-molecule/histone-chaperone MF term, not an experiment), a consolidated cluster
entry for the M. extorquens lanthanophore (mll) accessory genes (unreconstituted enzymology,
carefully excluding the now-solved XoxG/XoxJ/MxcQE/MxbDM and the 2024 methylolanthanin discovery,
PMID:39078674), and three residual sub-gaps (MAP7D1, AP3B2, atg2). PMID verification caught two
more mis-citations in a deep-research file (AP3B2 cyberian: "PMID:7545544" → nNOS/dystrophin;
"PMID:19116307" → ADAMTS13), so AP3B2 was initially flagged. 22 worked entries now
cover all three gap kinds plus the residual-sub-gap framing. - 2026-06: Re-sourced the two summary-only entries (KCTD14, AP3B2) from primary literature after a
review flagged that anchoring a gap to an LLM-generated-deep-research-*.md— especially one
whose PMIDs were just shown to be hallucinated — is near-circular. KCTD14 now rests on the
PubMed-verified KCTD family review (PMID:24268103) plus the repo's own GOA record (no experimental
MF); AP3B2 on Assoum 2016 (PMID:27889060), Newell-Litwa 2009/2010 (PMID:19144828, PMID:20089890),
Dilber 2022 (PMID:36356440) plus its GOA record (no experimental cargo annotation). The
unlocatable KCTD14 CUL3-non-binding claim was dropped. General principle adopted: provenance for
a gap must be a verified primary source or a checkable repo fact (GOA evidence codes), never an
unverified deep-research summary — see the new "Reviewing deep-research output" section.