Function Knowledge Gaps

MATURE PIPELINEFLAGSHIP

Warnings (2)

Function Knowledge Gaps

What does biology genuinely not know about how a gene works — and how do we state it
rigorously enough that someone could close it?

This project builds a curated, literature-grounded register of function knowledge gaps:
specific, defensible statements of what is unknown about a gene product's molecular function,
mechanism, partners, or biological role. It is the inverse of the rest of this repository.
Everywhere else we adjudicate what is known; here we adjudicate, with the same evidentiary
discipline, what is not.

The framing is deliberately ambitious. If a foundation set out to systematically eliminate the
conserved "unknome," the first deliverable would not be more experiments — it would be an honest
map of where the real gaps are, sharp enough to fund. That map is the goal.

Why this matters: the unknome

Decades into the genomic era, a large fraction of even well-studied proteomes remains
functionally dark:

Genes stay dark largely for sociological reasons — research concentrates on what is already
famous (the streetlight effect / Matthew effect; Stoeger et al. 2018, PLOS Biology,
PMID:30226837; Edwards et al. 2011, Nature, "Too many roads not taken", PMID:21307913). The
same neglected set is over-represented among unsolved Mendelian disease, unannotated GWAS
peaks, and the druggable-but-unexplored proteome (the NIH IDG program; Oprea et al. 2018,
Nat Rev Drug Discov, PMID:29472638; Kustatscher et al. 2022, Nat Methods, "Understudied
proteins", PMID:35534633). Closing function gaps is the denominator problem under all of these.

Core principle: a knowledge gap is a curated judgment, not a metric

The central methodological commitment of this project: a knowledge gap is determined by
reading the primary literature and exercising judgment, not by a pattern in the annotations.

We tested the alternative and it failed instructively. A score that flags genes with no
specific molecular function (à la Unknome, but using our adjudicated core_functions) marks
~14% of our reviewed genes as "MF-dark" — reassuringly close to Wood's ~20% and Unknome's 23%.
But when the conserved subset of those was inspected, "no molecular function" decomposed into
four completely different things:

What the score called a "gap" Share What it actually is Owner
Structural/accessory subunits (TOM, TRAPP, ESCRT, V-ATPase…) ~64% Function is "be part of the machine"; the GO MF aspect can't express it Ontology
Function-kind known, MF term simply not filled (e.g. SOX9) ~14% Curation incompleteness Curation
Stub / incomplete reviews (TODO descriptions) ~3% Data hygiene
Genuinely unknown mechanism (CFAP300, tam10, P3R3URF…) ~18% The true unknome Experiment

(Snapshot over ~2,100 reviewed genes; proportions are triage-grade, not curated.) A heavily
annotated gene can hide a gaping mechanistic hole; a sparsely annotated one can be perfectly
understood and merely under-curated. Only reading tells them apart. The metric is therefore
demoted to a back-of-house triage aid that produces a read-list, and never appears in the
product.

A taxonomy of function knowledge gaps

Two axes. First, what kind of ignorance (this determines who can fix it):

  1. Biology gap — nobody knows. Resolved by experiments. This is the unknome and the
    project's primary target.
  2. Curation gap — the knowledge exists in the literature but is not yet annotated, or is
    annotated too generically. Resolved by curation (much of it in-house).
  3. Ontology gap — the knowledge exists but no GO term can express it (e.g. "structural
    subunit of complex X", or a novel activity). Resolved by ontology development;
    tracked via proposed_new_terms. Worked exemplar: POLE4 below.

Most real entries are a blend (CFAP300 below is biology-dominant with an ontology shadow),
and naming the blend is the actionable part.

A useful third framing emerged from curation, cutting across the above: the residual sub-gap
a gene whose core function is textbook-solid but which still hides one sharp, load-bearing
mechanistic hole (e.g. RAB9A's unidentified GEF/GAP, RASA1's catalysis-independent scaffolding,
atg101's WF-finger recruit). These are easy to miss precisely because the gene looks finished;
flagging them is high-value because the hole is often the rate-limiting unknown for the pathway.

Second, which GO aspect is dark — most "dark" genes are not uniformly dark:

The unit of work: anatomy of a gap entry

Each gap is a small, defensible scholarly object. Required elements:

This is strictly more than the existing suggested_questions field (which asks the right things
but cites nothing): the added value is the adjudicated boundary plus provenance for the
unknown
.

Structured curation: the KnowledgeGap schema class

The anatomy above is now a first-class schema object, KnowledgeGap, so a gap can be curated
in the gene/module YAML itself rather than only as prose on this page. This mirrors the way
the Monarch dismech knowledge base promotes
unknowns to a first-class Discussion object (with kind / status / proposed_experiment),
and goes further by demanding provenance for the unknown and a biology/curation/ontology
typing.

A KnowledgeGap (see src/ai_gene_review/schema/gene_review.yaml) carries:

It can be attached at five levels — the whole gene (GeneReview.knowledge_gaps), a single
annotation (ExistingAnnotation.review.knowledge_gaps, ideal for residual sub-gaps), a
core function (CoreFunction.knowledge_gaps), a whole module (ModuleReview.knowledge_gaps),
or a single module step (ModuleNode.knowledge_gaps).

Worked YAML exemplars in the KB: genes/human/CFAP300/ (gene-level biology+ontology gap,
two PMID provenance quotes) and genes/human/RAB9A/ (annotation-level residual sub-gap on the
GTPase-activity review, the missing GEF/GAP).

Rendered from data: running just aggregate-knowledge-gaps (or
python scripts/aggregate_knowledge_gaps.py) walks every gene/module YAML and regenerates the
Structured Knowledge-Gap Register plus
reports/knowledge_gaps.tsv. The prose worked-entries below remain the curated narrative; the
register is their queryable, schema-backed counterpart and will grow as gaps are recorded in the
YAML.

Worked example: the CFAP300 molecular-function gap

CFAP300 (formerly C11orf70) is a dynein axonemal assembly factor; loss-of-function causes
primary ciliary dyskinesia (PCD). It is in the KB (genes/human/CFAP300/) and is an ideal
exemplar: clear biomedical importance, a clearly established role, and a sharp, durable
mechanistic silence.

Boundary (what is firmly known):
- LOF causes PCD with combined outer + inner dynein arm (ODA+IDA) loss; CFAP300 is required
for cytoplasmic preassembly of axonemal dyneins and their IFT-dependent delivery into the
axoneme (Höben et al. 2018, AJHG, PMID:29727693, DOI;
Fassad et al. 2018, AJHG, PMID:29727692, DOI).
- Localizes mainly to cytoplasm, moves into cilia by IFT, and interacts with the preassembly
factor DNAAF2 (Höben 2018).
- Re-confirmed in 2025: LOF → no CFAP300 protein → total ODA+IDA loss (Demchenko et al. 2025,
IJMS, PMID:40806783, DOI).

Gap statement: The biochemical activity of CFAP300 is unknown. It is unresolved whether
it acts as a chaperone/co-chaperone, a scaffold, or an adaptor; what its client/substrate is;
and whether it acts at one common step upstream of both arm types or in parallel ODA- and
IDA-specific steps.

Provenance that the gap is real (the judgment):
1. Both founding papers describe the gene as "uncharacterized" and frame the mechanism as a
hypothesis ("supporting our hypothesis that C11orf70 is a preassembly factor") — they
establish requirement and localization, never a biochemical activity (PMID:29727692,
PMID:29727693).
2. Its sole domain is DUF4498 — a "domain of unknown function"
(genes/human/CFAP300/CFAP300-deep-research-falcon.md).
3. The trajectory is the strongest evidence: seven years and many cohorts on (Slavic 2019
PMID:30916986; Cypriot 2021 PMID:33715250; Russian 2024 PMID:39180133; ALI-culture 2025
PMID:40806783), every follow-up is diagnostic — confirming loss — not mechanistic. A
durable mechanistic silence despite clear motivation is the signature of a genuine gap, and
reading the arc of the literature is the only way to see it.

Type judgment: biology gap (dominant) with an ontology shadow — even "a dynein-preassembly
factor" has no adequate GO MF term, which is why the gene reads as MF-dark.

Significance: this is the assembly step whose failure removes all axonemal dynein motors —
mechanistically central to motile ciliopathy.

What would resolve it: proximity/affinity proteomics of the CFAP300–DNAAF2 module;
in vitro reconstitution of dynein assembly intermediates; structural characterization of
DUF4498.

Worked gap entries (from the seed read-list)

The following entries were curated by reading each gene's -deep-research-*.md and notes/GOA
files for the field's own statements of ignorance, in the CFAP300 format but condensed. Every
PMID cited below was independently verified against PubMed; ignorance quotes are reproduced
verbatim and attributed to the source file. Where a gene's deep-research file cited sources only
by DOI, or by PMIDs that failed verification, the gap is provenance-anchored to the file path
rather than to an unverified PMID (see the KCTD14 caution).

swrD (BACSU) — ~71-aa swarming-motility factor

mxaC (METEA) — methanol-oxidation auxiliary VWA protein

TRAPPC12 (human) — moonlighting TRAPP subunit

AGR3 (human) — noncanonical PDI-family protein

tam10 (SCHPO) — meiotic sequence orphan

P3R3URF (human) — uORF microprotein (non-canonical-ORF dark matter)

KCTD14 (human) — least-studied KCTD-family BTB protein

C18orf21 (human) — ORF-named gene, a closing gap

Worked gap entries — second batch (read-list deepening)

Curating the un-vetted leads surfaced a useful third category alongside the wholly-dark genes:
the residual sub-gap — a gene whose core function is textbook-solid, but which still hides a
sharp, specific mechanistic hole (RAB9A's missing GEF/GAP, RASA1's catalysis-independent
scaffolding, atg101's WF-finger recruit). These matter because a heavily annotated gene can look
finished while a load-bearing mechanism is undetermined — exactly the failure mode the project's
core principle warns about. All eight leads adjudicated to real gaps (none were spurious);
every PMID below was PubMed-verified, and one mis-attributed citation was caught and dropped.

MTC7 (yeast / S. cerevisiae) — telomere-capping sequence orphan

RAB9A (human) — known Rab, unknown switch (residual sub-gap)

RASA1 (human) — catalysis solved, scaffolding unsolved (residual sub-gap)

BAIAP2L2 (human) — Pinkbar, dark in its native tissue

SCGB1C1 (human) — orphan secretoglobin

FGFRL1 (human) — a receptor that signals without a kinase

atg101 (SCHPO) — known subunit, unknown recruit (residual sub-gap)

irg-1 (worm) — famous reporter, mystery protein

Worked gap entries — third batch (cluster + subunit cases)

This batch closes out the taxonomy: it adds the project's first worked ontology-dominant entry
(POLE4 — the structural-subunit pattern that the core-principle table estimates at ~64% of
apparent MF-darkness), a consolidated cluster gap (the M. extorquens lanthanophore accessory
genes), and three more residual sub-gaps. With these, all three gap kinds (biology, curation,
ontology) and the residual-sub-gap framing have curated exemplars.

M. extorquens methanol/lanthanide cluster — uncharacterized accessory genes (consolidated)

MAP7D1 (human) — paralog-specific mechanism unknown (residual sub-gap)

POLE4 (human) — a structural subunit GO can't describe (ontology-dominant)

AP3B2 (human) — neuronal adaptor, unknown cargo (residual sub-gap)

atg2 (SCHPO) — known lipid bridge, unresolved directionality (residual sub-gap)

EryCIV / EryCV (S. erythraea) — desosamine-pathway enzymes with unresolved catalytic identity (curation+experiment gap)

From the BGC erythromycin curation (terms/erythromycin_biosynthesis/). These are not unknown
genes — both are clearly TDP-D-desosamine biosynthesis enzymes of the ery cluster — but the
precise catalytic reaction of each is undetermined, and the existing annotations actively
conflict.

Methodology

For each candidate gene:

  1. Deep literature search — use cached publications, the gene's -deep-research-*.md,
    PubMed, and full text where available. Read primarily for the boundary and for the field's
    own statements of ignorance.
  2. Harvest ignorance signals — author hedges ("remains unknown / unclear / to be
    determined"), DUF / "uncharacterized" / hypothetical-protein naming, orphan-activity vs
    orphan-gene framing, and the diagnostic-not-mechanistic trajectory.
  3. Judge the type — separate biology gaps from curation gaps from ontology gaps. This is
    the irreducible human/curatorial step.
  4. Write the gap entry with full provenance.
  5. Route it — experiment (suggested_experiments), ontology (proposed_new_terms), or
    curation (a normal review action).

Selection of which genes to read is curatorial, in the spirit of PomBase's "priority
unstudied genes" determination (Wood 2019) — informed, but not decided, by the triage
read-list.

Reviewing deep-research output

Deep-research -deep-research-{provider}.md files are LLM-generated and are leads, not sources.
Curating this project surfaced a consistent set of failure modes; the table below is grounded in
actual errors caught during this work, because the cost of not catching them is a curated artifact
that cites a real-looking paper that says something else.

Failure mode Why it's dangerous Examples caught here
Mismatched-but-real PMID (well-formed, resolves, wrong paper) Passes format validation; only a content check catches it RASA1 "36323259" (→ILC paper); KCTD14 "30929316/36362138" (→SGLT/xylanase); AP3B2 "7545544/19116307" (→nNOS/ADAMTS13); Edwards/Oprea/Kustatscher in the intro
Citation by DOI / PMC / citekey only (no PMID) Not verifiable by PMID tooling; easy to copy a wrong one falcon files broadly; mxaC (Zhou 2025), tam10, C18orf21, MTC7 (Addinall)
Over-extrapolation (real PMID, but it doesn't support the claim) The citation "checks out" yet the claim is unsupported RAB9A: DENND2-as-GEF cited to a general DENN-domain paper on other Rabs
Unlocatable citation (claimed paper not findable at all) Can't be confirmed or refuted KCTD14's reported CUL3-non-binding "2024 paper" / PMC10856315
Cross-provider disagreement One file's confident claim contradicts another's tam10 domain content; C18orf21 "characterized" vs "uncharacterized"

Mechanisms, layered from automated to manual:

  1. Automated PMID resolution + title-match (the missing layer). Today
    src/ai_gene_review/tools/validate_pmid_references.py checks only PMID format and that cited
    PMIDs appear in the review YAML — it cannot catch a real PMID pointing at the wrong paper. The
    high-leverage addition: for each PMID in a deep-research file, fetch metadata (reusing the
    refresh_pmid_titles / publication-cache machinery) and compare the fetched title/first-author/
    year against the text adjacent to the citation; flag low-similarity matches and emit a per-file
    citation-reliability score. This directly catches the dominant failure mode.
  2. Ground claims against authoritative repo data, not prose. Each gene folder already ships
    non-LLM ground truth — -goa.tsv (GO terms + evidence codes) and -uniprot.txt. Any claim about
    annotation/function state should be checked against these, and a gap's provenance should prefer a
    checkable repo fact (e.g. "GOA molecular_function is ND / IEA-only") over an LLM hedge. (This is
    exactly how KCTD14 and AP3B2 were re-sourced above.)
  3. Cross-provider corroboration. Treat a claim asserted independently by ≥2 providers as
    stronger; treat a single-provider mechanistic claim as a lead; route disagreements to a human.
  4. Provider-reliability tracking. Log citation-accuracy per provider over time (with a
    small-sample caveat) so curators can weight accordingly — e.g. in this small sample the
    cyberian files produced mismatched PMIDs twice (KCTD14, AP3B2), and falcon tends to cite by
    DOI rather than PMID.
  5. Promotion rule (human-in-the-loop). Nothing enters a curated artifact (review YAML, project
    page) unless its citation is independently PubMed-verified or anchored to a checkable repo
    fact. Verbatim LLM hedges are starting points, never the provenance of record.
  6. Quote-and-source discipline. Require a verbatim supporting quote plus a locatable source for
    each promoted claim; a claim with no findable quote+source is, by construction, unverified.

Recommendation: implement mechanism 1 as a validate-deep-research check (or an opt-in,
network-gated mode of the existing tool), reporting a reliability score and a list of
mismatched/unresolvable PMIDs per file. Mechanisms 2, 5 and 6 are process rules already adopted in
this project; 3 and 4 fall out for free once 1 produces structured per-file output.

Relation to existing KB machinery

The building blocks already exist and are reused, not replaced:

Signal Existing field Gap role
Open questions suggested_questions Seed for biology-gap statements (to be elevated with provenance)
Proposed experiments suggested_experiments The "what would resolve it"
Missing GO terms proposed_new_terms Ontology-gap entries
Unresolvable annotation action: UNDECIDED Often a curation gap (esp. literature inaccessible)
protein binding avoidance rule curation guideline Flags MF-mechanism gaps

Schema direction (implemented): this has now been elevated into a provenance-bearing
knowledge_gaps structure (gap statement + boundary + verbatim provenance quotes + kind/aspect/
status typing) so gaps are first-class and queryable — see the Structured curation section above
for the KnowledgeGap class. The project first proved the unit and method on the curated examples
below; the schema element now makes them machine-tractable.

Prioritization (curatorial, not computed)

When choosing what to read next, weight toward gaps that are both high-value and tractable:

Seed read-list (triage candidates — to be read, not yet curated)

From the conserved-MF-dark triage, after stripping structural-subunit and curation-completeness
false positives. These are candidates for deep reading; the eight marked worked now have
gap entries above.

Gene Org Status Why interesting
CFAP300 human worked Dynein preassembly; mechanism unknown; DUF4498 (full exemplar)
P3R3URF human worked 95-aa microprotein from a uORF; non-canonical-ORF dark matter
tam10 SCHPO worked Meiotic sequence orphan; only ISO/ISS transfers
AGR3 human worked Noncanonical PDI-family; catalytic activity / clients unknown
swrD BACSU worked ~71-aa swarming-motility enhancer; mechanism unknown
mxaC METEA worked VWA auxiliary protein for methanol oxidation; mechanism unknown
TRAPPC12 human worked Moonlighting TRAPP factor; mitotic function ill-defined
KCTD14 human worked Least-studied KCTD BTB protein; no experimental function in GOA
C18orf21 human worked ORF-named; a closing gap (2025 RNase MRP preprints)
MTC7 yeast worked Telomere-capping sequence orphan; all GO aspects ND
RAB9A human worked Residual sub-gap: known Rab, cognate GEF/GAP unidentified
RASA1 human worked Residual sub-gap: catalysis solved, scaffolding mechanism unsolved
BAIAP2L2 human worked Pinkbar; native epithelial function unmeasured (KO normal)
SCGB1C1 human worked Orphan secretoglobin; ligand + receptor unidentified
FGFRL1 human worked Kinase-dead FGFR; signaling mechanism unresolved
atg101 SCHPO worked Residual sub-gap: WF-finger recruitment partner unknown
irg-1 worm worked BP-known / MF-dark; IRG-1 protein activity undefined (MF ND)
METEA mll cluster METEA worked Consolidated: lanthanophore accessory enzymes unreconstituted
MAP7D1 human worked Residual sub-gap: paralog-specific MT-acetylation mechanism
POLE4 human worked Ontology gap exemplar: structural Polε subunit GO can't express
AP3B2 human worked Residual sub-gap: β3B vs β3A neuron-specific cargo undefined
atg2 SCHPO worked Residual sub-gap: directionality of Atg2–Atg9 lipid flow

Reproducible read-list (ignorance-signal scan)

A grep of every genes/**/*-deep-research-*.md for author hedges (remains unknown / unclear / undetermined / elusive / uncharacterized / to be determined) and for precise (function|role| mechanism) ... (unknown|unclear|not) surfaced ~60 files carrying explicit ignorance
statements — a reproducible candidate pool beyond the triage table. The strongest leads from that
scan have now been curated (the second- and third-batch entries above). Remaining un-vetted leads
worth reading next, spotted in the same scan: human/PUS3, human/FGFRL1-adjacent CFAP418,
human/SOCS4/SOCS5, human/RFT1, worm/pef-1, worm/fshr-1, SCHPO/alo1, and DESVH/Q72DT1.

Curation caution (learned here): deep-research files frequently cite sources only by DOI,
and some cite PMIDs that do not resolve to the right paper (e.g. the KCTD14 cyberian file). Every
PMID promoted into a gap entry must be verified against PubMed before use; otherwise anchor the
gap to the deep-research file path and verbatim quote.

Prior art and references

Status

Notes