Allergens Project

Allergenicity is a cross-species immunological property — IgE reactivity in
a sensitized human — not the protein's evolved molecular function. GO, and
this project's reviews, stay focused on evolved function in the source organism.
The "allergen" label is used here only as a prioritization bucket: a cohort
of proteins worth reviewing because their function is medically actionable and
often poorly understood.

Why an allergens cohort

This is not a new kind of annotation. It is a triage layer that selects genes for
ordinary function review. The motivation is concrete and clinical:

We increasingly intervene on allergens — so we should know what they do first.
Major allergens are being knocked out, neutralized, or engineered:

CRISPR knockout of the Fel d 1 genes (CH1/CH2) to make hypoallergenic cats
[PMID:35386981].
Antibody neutralization (anti–Fel d 1 in cat food; therapeutic IgG4 in
patients).
Allergen-specific immunotherapy with recombinant/peptide allergens.

Each of these abolishes or suppresses a protein. The obvious safety question —
"what physiological function do we lose?" — is exactly a gene-function-review
question. For the cat allergen Fel d 1 the honest answer is largely unknown, and
that is the single most decision-relevant finding, captured in each review's
knowledge_gaps.

Scope and unit of analysis

The reviews are per gene / UniProt entry, but allergen databases are organized
per allergen molecule (e.g. WHO/IUIS Fel d 1, which spans two genes,
FELCA/CH1 + FELCA/CH2). The project therefore needs an index/mapping layer
that bridges that one-to-many relationship — the allergen molecule is the cohort
member; the genes are the review units.

Prioritization metric

Rank candidates within the bucket on intervention pressure × function
uncertainty:

Intervention pressure — is the protein being drugged, knocked out, or
engineered? (IgE prevalence, epitope load, "therapeutic target" flag.)
Function uncertainty — how confidently is the evolved function known? This is
the axis already modelled in the Function Knowledge Gaps
project.

Allergens score high on both, which is why they are a natural first cohort. The
per-gene deliverable is "here is the evolved function, or here is the
sharply-bounded gap" — not "is it an allergen" (WHO/IUIS already says so).

flowchart TD A[WHO/IUIS registry
= cohort membership] --> B{Intervention pressure?
IEDB epitopes, IgE %, drug/KO/engineering} B -->|high| C{Evolved function known?} B -->|low| Z[Deprioritize] C -->|confident| D[Document function
→ safe-to-target call] C -->|gap| E[Curate + record knowledge gap
→ flag risk of unknown loss] D --> F[Gene reviews
per UniProt entry] E --> F

External databases

These are triage inputs and cross-references only — they never change a GO
annotation. Build ETL only for sources with genuine APIs; for the rest, parse
downloads (do not write code that just points at a website).

Facet	Database	Provides	Access
Identity / nomenclature	WHO/IUIS Allergen Nomenclature	Official allergen name, source taxon, MW, route, isoallergens	Download tables (no real API)
Aggregation	Allergome	Per-molecule records, isoforms, refs (already in UniProt DR lines)	Limited/legacy
Epitopes	IEDB	T-cell & B-cell/IgE epitopes mapped to sequence, assays, MHC	REST/Query API + exports
Structure + IgE epitopes + cross-reactivity	SDAP	Structures, IgE epitopes, FAO/WHO similarity tools	Web tools + downloads
Sequence allergenicity / cross-reactivity	AllergenOnline, COMPARE	Curated allergen FASTA sets	Registered download → local BLAST
Structure / fold	PDB, AlphaFoldDB	3D structure	API (via UniProt)

Anchors: WHO/IUIS (defines the cohort) + IEDB (rich, API-accessible epitope
data). Layer SDAP / AllergenOnline later for cross-reactivity. Treat Allergome as a
cross-reference resolver since UniProt already links it.

These fit the repo's existing patterns: cache-per-record (like publications/,
reactome/), an index TSV (like gocams/index.tsv) keying allergen → UniProt →
review, and SSSOM mapping sets (like the ARO→GO / RHEA→GO projects) for
allergen→UniProt links. Allergen-specific metadata (route of exposure, IgE
prevalence, epitopes) would live in a schema side-block, not in GO — note that
UniProt keyword KW-0020 Allergen exists but there is no clean "allergen activity"
GO molecular function, which is precisely why the side-channel is needed.

Case study: the secretoglobin allergens

The first genes curated under this lens are all secretoglobins, which makes a
sharp point: the family shares a "PLA2 modulation + hydrophobic-ligand binding +
immunomodulation" theme, yet no family member has a confirmed endogenous
function — including the best-studied one.

Gene	Allergen / protein	Evolved-function status (this project)
FELCA/CH1	Fel d 1 chain 1 (major cat allergen)	Ca²⁺ binding (structural); LPS binding → TLR4/TLR2 enhancement; steroid/fatty-acid (pheromone?) binding. Native cat role unknown.
FELCA/CH2	Fel d 1 chain 2 (glycosylated chain)	Same complex-level activities; contributes Ca²⁺-coordinating residues. Native cat role unknown.
mouse/Scgb1a1	Uteroglobin / CC10 / CC16 (family prototype)	Potent phospholipase A2 inhibitor; phospholipid/PCB binding; suppresses Th2 cytokines (IL-4/5/13) via GATA-3 mRNA destabilization. Best-characterized member — yet its primary physiological role is still debated.

The contrast is instructive for the "know before you knock out" thesis: even
uteroglobin, the archetype with decades of study, is annotated mainly through
PLA2 inhibition and Th2 suppression rather than a settled endogenous purpose;
the uteroglobin knockout mouse shows inflammation/cancer susceptibility, so
"harmless to remove" is not a safe default for Fel d 1 either.

The cat Fel d 1 reviews additionally drew on FutureHouse Falcon deep research,
which surfaced the experimentally-grounded LPS-binding / TLR4-enhancement activity
(Herre et al. 2013, [PMID:23878318]) and ligand-binding data
([PMID:34026578]); all such claims were verified against the cited primary
literature before annotation.

Allergen → UniProt index

The cohort membership and the molecule→gene bridge are maintained as a generated
TSV, ALLERGENS/allergen_index.tsv, built by
ALLERGENS/build_allergen_index.py:

uv run python projects/ALLERGENS/build_allergen_index.py \
    -o projects/ALLERGENS/allergen_index.tsv

The builder is deliberately download-honest: it derives membership from the
already-cached UniProt records (the Allergen= name, the Allergen keyword, and
Allergome cross-references) and joins them to the local reviews. It does not
call or fake a WHO/IUIS API; to fold in the official registry, drop its downloaded
table into the folder and extend the merge step. Re-running picks up every allergen
gene present under genes/, so the index grows automatically as the cohort expands.

Columns: allergen_molecule (the WHO/IUIS unit), allergome_id, source_taxon_id,
species_code, gene_symbol, uniprot, uniprot_allergen_name, review_path,
review_status, n_core_functions, n_knowledge_gaps, function_gap_flagged,
iedb_epitopes, iedb_has_ige (the last two merged from the IEDB ETL below).

Membership is detected from the cached UniProt records either by the reviewed
Allergen keyword/Allergen= name or by an Allergome cross-reference (so
unreviewed TrEMBL allergens such as Fel d 7 and Fel d 8 are included). The index
currently holds 32 genes across 31 allergen molecules — the cat, dog, horse,
cow, mouse and rat danders, the headline mite (Der p 1/2/23) and birch (Bet v 1/2)
allergens, and allergens already in the repo for other reasons (e.g. human GBA1,
INS, GLA; yeast SOD2).

Cross-cohort priority (function gap × IEDB load)

With both axes populated, the index ranks the whole curated cohort. The highest-value
review targets are allergens that are both heavily IgE-targeted and of
uncertain evolved function:

allergen	IEDB epitopes (IgE)	function gap	note
Bet v 1	450 (IgE+)	yes	PR-10 promiscuous ligand carrier; true in-planta ligand unresolved
Fel d 1	127 (IgE+)	yes	secretoglobin; native cat role unknown
Can f 1	83 (IgE+)	yes	tear-lipocalin homolog; specific ligand unknown
Der p 23	8 (IgE+)	yes	peritrophin domain but does not bind chitin
Der p 1	347 (IgE+)	no	characterized cysteine protease
Der p 2	210 (IgE+)	no	NPC2/MD-2-mimic auto-adjuvant (TLR4)

The deprioritized rows (Der p 1, Der p 2) carry the largest epitope loads yet have
well-defined functions — exactly the "high data, low uncertainty" quadrant the metric
is meant to filter out. Conversely Bet v 1 rises to the top: the single
most-IgE-targeted allergen in the set whose physiological function is still unresolved.

The index now carries both axes of the prioritization metric: function_gap_flagged
(uncertainty) and the IEDB epitope counts (iedb_epitopes, iedb_has_ige —
intervention pressure; see below). The complete domestic-cat set, ranked by the two
axes together:

allergen molecule	genes (UniProt)	family	function gap?	IEDB epitopes (IgE)	priority
Fel d 1	CH1 (P30438) + CH2 (P30440)	secretoglobin	yes — native role unknown	127 (IgE+)	highest
Fel d 7	Feld7 (E5D2Z5)	lipocalin	yes — specific ligand unknown	14	high
Fel d 8	Feld8 (F6K0R4)	BPI/LBP/PLUNC	yes — ligand family-inferred	0	medium (gap, low data)
Fel d 4	Feld4 (Q5VFH6)	lipocalin	no — pheromone carrier	14 (IgE+)	low (characterized)
Fel d 3	CSTA (Q8WNR9)	cystatin	no — cysteine-protease inhibitor	6	low (characterized)
Fel d 2	ALB (P49064)	serum albumin	no — multi-ligand carrier	4	low (characterized)

(Fel d 5/6 are cat immunoglobulins, out of scope.) The ranking falls out cleanly:
Fel d 1 tops it — heavily IgE-targeted (127 epitopes) and of unknown native
function — the textbook "know before you knock out" case. The two least-characterized
members (Fel d 7, Fel d 8, both unreviewed TrEMBL) also carry gaps, while the
well-understood Fel d 2/3/4 families are deprioritized despite real epitope load.
mouse/Scgb1a1 is intentionally absent — it is the secretoglobin comparator, not a
registered allergen, so it does not appear in the membership-derived index.

Registry coverage and fetch worklist

WHO/IUIS publishes no stable API, but UniProt's Allergen keyword (KW-0020) is
a curated, API-accessible proxy: each reviewed allergen entry carries its WHO/IUIS
designation inline in its protein names as (allergen <name>)
(e.g. (allergen Fel d 1-A)). ALLERGENS/fetch_uniprot_allergens.py
snapshots that registry and cross-references it against genes/ to produce a
prioritizable backlog:

uv run python projects/ALLERGENS/fetch_uniprot_allergens.py

Outputs (UniProt release 2026_02):

ALLERGENS/uniprot_allergens.tsv — the registry
snapshot: 1020 reviewed allergen entries spanning 624 allergen molecules,
with accession, source organism/taxon, gene, WHO/IUIS name, molecule and Allergome id.
ALLERGENS/allergen_worklist.tsv — the 1014
registry members not yet fetched, each with a ready-to-run fetch-gene command.

Coverage of this reviewed registry: 6 / 1020 entries (Fel d 1 chains
P30438/P30440; cat Fel d 2/3/4; human thioredoxin). Note this differs from the
local index count (16) above: the registry/worklist tracks only reviewed
UniProt entries, whereas the local index also counts unreviewed TrEMBL allergens
(Fel d 7, Fel d 8) and Allergome-listed entries that lack the UniProt Allergen
keyword. This calls a real API (UniProt REST) and records the release for
provenance — it does not fabricate or fake-fetch a WHO/IUIS table.

The worklist is currently ordered by organism then allergen name; true
intervention-pressure ranking (IgE prevalence, epitope load) awaits the IEDB
epitope step. It is the backlog from which the cohort is grown by running the listed
fetch-gene commands and then reviewing each gene.

IEDB epitopes (intervention-pressure axis)

ALLERGENS/fetch_iedb_epitopes.py populates the
second axis of the metric from the IEDB IQ-API (query-api.iedb.org, a real
PostgREST API), writing ALLERGENS/iedb_epitopes.tsv
and merging epitope counts into the main index:

uv run python projects/ALLERGENS/fetch_iedb_epitopes.py
uv run python projects/ALLERGENS/build_allergen_index.py   # re-merge into the index

Per allergen molecule it records distinct epitope, B-cell-assay, T-cell-assay and
reference counts and an IgE flag (has_ige — the most allergy-relevant signal).

Join caveat (handled honestly): IEDB keys allergens under its own UniProt
accessions (Fel d 1 = UNIPROT:A0ABI7XLA3), which differ from the Swiss-Prot
accessions used here (P30438). IEDB does label them by WHO/IUIS allergen name,
so the ETL joins by allergen-molecule name within source taxon rather than by
accession. This works for WHO/IUIS-style names (Fel d 1); it does not match
allergens that IEDB labels by ordinary protein name (e.g. human self-allergens
catalogued here as Hom s …), which therefore show 0 — meaning not matched, not
necessarily no epitopes. The cat cohort matches cleanly:

allergen	IEDB epitopes	IgE	refs
Fel d 1	127	yes	26
Fel d 4	14	yes	2
Fel d 7	14	no	2
Fel d 3	6	no	3
Fel d 2	4	no	1
Fel d 8	0	no	0

The numbers track clinical reality (Fel d 1 dominates) and complete the metric:
crossing IEDB epitope load with the function-gap flag yields the priority column in
the cat table above.

Worklist in action — the dog cohort (Can f 1, 2, 3, 6). Working the worklist by
intervention pressure (rather than alphabetically) picked the dog allergens next.
The same two-axis ranking applies, and Can f 1 is the standout:

allergen	gene	family	function gap?	IEDB epitopes (IgE)	priority
Can f 1	Canf1 (O18873)	lipocalin (tear-lipocalin homolog)	yes — specific ligand unknown	83 (IgE+)	high
Can f 2	Canf2 (O18874)	lipocalin	no — odorant binding	9	low
Can f 6	Canf6 (H2B3G5)	lipocalin	no — odorant binding	5 (IgE+)	low
Can f 3	ALB (P49822)	serum albumin	no — multi-ligand carrier	0*	low

Can f 1, the dominant dog allergen, mirrors Fel d 1 exactly: heavily IgE-targeted yet
of unknown specific ligand/function — the highest-value review target. (*Can f 3 = 0
is the same name-join limitation: IEDB labels dog albumin epitopes under other names.)

Two robustness fixes were needed to cover dog: the ETL now paginates via the
PostgREST Range header (the API rejects offset; some taxa exceed 1000 antigens),
and matches the allergen designation by regex so embedded IEDB names
("Major allergen Can f 1", "Lipocalin Can f 6.0101") reduce correctly to the molecule.

Status

SCOPING. Architecture and first secretoglobin cohort drafted.
Curated: FELCA/CH1, FELCA/CH2, mouse/Scgb1a1.
Done: allergen→UniProt index (molecule↔gene bridge, now 16 genes / 15 molecules)
and a UniProt-KW-0020 registry snapshot + fetch worklist (6/1020 reviewed-registry
covered). The full domestic-cat allergen set (Fel d 1, 2, 3, 4, 7, 8) is curated.
Done: IEDB epitope ETL — both axes of the prioritization metric are now live
(function-gap flag × IEDB epitope/IgE load), realized in the cat priority ranking.
Done: worked the worklist by priority — dog cohort (Can f 1/2/3/6), mammalian
inhalants (horse Equ c 1/2/3/4, cow Bos d 2, mouse Mus m 1, rat Rat n 1), and the
headline environmental allergens (mite Der p 1/2/23, birch Bet v 1/2). 32 genes /
31 molecules now in the index with full IEDB load.
Notable findings: Der p 2 is an MD-2-mimic TLR4 auto-adjuvant (direct Fel d 1
parallel); MUP (Mus m 1 / Rat n 1) metabolic annotations are ISS over-propagation;
Bet v 1 ABA-receptor annotations are fold-based over-propagation; Der p 23 chitin
binding is correctly a negated GOA annotation.
Next: extend the IEDB name-join to protein-name-labelled allergens (human Hom s …),
then continue the backlog (other pollens, foods, molds, insects) by priority.