Behaviour Annotation Project

IN_PROGRESS PIPELINEFLAGSHIP

Species: mouse, human, rat, worm, yeast, DANRE, DROME, DAPPU

Genes: App, STAT3, nphp-1, Casp3, Drd1, CRY, lov-1, pkd-2, GCG, daf-2, trpm7, Tuba1a, Agtr1a, Mtor, Fyn

Behaviour Annotation Project

When a knockout changes how an animal behaves, the gene gets annotated to
behavior (GO:0007610) — even when its molecular function lives many causal
steps upstream. This is a textbook over-annotation scenario, and this project
characterises it across the review corpus.

Motivation

The Gene Ontology behavior branch (GO:0007610) sits at the extreme
organismal end of the biological-process hierarchy. A behaviour is an
integrated, whole-animal output — locomotion, feeding, mating, grooming,
circadian rhythm, a fear response. Almost any perturbation that reaches the
nervous system, or that compromises development, metabolism, or basic
cell biology, can shift one of these readouts.

The annotation chain that produces a behaviour term is the same convergent,
distal pattern flagged in ASSAY_TO_FUNCTION and
OVER_ANNOTATION_PATTERNS:

perturb gene G → animal behaves differently → annotate G to behaviour B

The evidence codes that dominate behaviour annotations (IMP from a mutant,
IGI from a genetic interaction) record that the phenotype is real and
reproducible
— but they say nothing about how proximal the gene is to the
behaviour. A tubulin, a lysosomal peptidase, an angiotensin receptor, and a
ciliary scaffold can all earn a locomotory behavior annotation from a
phenotype assay, yet none of them is a "behaviour gene" in any mechanistic
sense. The behaviour is a downstream readout of a much more specific molecular
defect.

So the goal here is not to delete behaviour annotations wholesale. A
behaviour phenotype is legitimate evidence, and for a handful of genes
(neurotransmitter receptors, neuropeptides, circadian clock components) a
behaviour term genuinely is close to the core function. The goal is to
separate the core from the consequence: keep well-supported behaviour
annotations as non-core context, flag the distal ones as
over-annotations, and reserve REMOVE for the cases that are actually
contradicted (wrong paralog, wrong gene) rather than merely distal.

What the corpus already shows

Mined with BEHAVIOR/mine_behavior.py over every
*-goa.tsv (source annotations) and every *-ai-review.yaml (reviewer
decisions). The full tables regenerate into
BEHAVIOR/reports/REPORT.md.

Source surface. Behaviour terms in the corpus GOA files are overwhelmingly
phenotype-driven: IMP + IGI account for the large majority of behaviour
annotations, with only a few IDA (direct assay) annotations. The most common
terms are the broad ones — locomotory behavior (GO:0007626) by a wide margin,
followed by behavioral response to pain, mating behavior, social behavior, circadian behavior, and adult locomotory behavior.

Reviewer decisions. Of the behaviour annotations reviewers have adjudicated
as core-vs-not (146, excluding the 9 NEW proposed terms, which add rather than
downgrade), ~81% were downgraded — kept as non-core, marked as
over-annotated, or removed — and only a minority were ACCEPTed as a core
function:

Action Meaning for a behaviour term Share
KEEP_AS_NON_CORE Real phenotype, distal to molecular function dominant (~60%)
ACCEPT Behaviour genuinely near the core (e.g. receptors, clock genes) minority
MARK_AS_OVER_ANNOTATED Too broad / too distal to be useful small
REMOVE Contradicted — wrong gene/paralog or not supported small

This is exactly the signature of a benign-but-pervasive over-annotation
pattern: the annotations are mostly not wrong, but they are mostly not core.

Exemplars from completed reviews

These are real decisions already in the corpus — concrete illustrations of the
"keep as non-core, it's a downstream readout" call:

The contrast between Tuba1a/tpp1 (distal → non-core) and Agtr1a (wrong paralog →
remove) is the core curation distinction this project sharpens.

Curation guidance (working rubric)

For a behavior (GO:0007610-descendant) annotation, ask:

  1. Is it contradicted? Wrong gene, wrong paralog, or the cited evidence
    actually attributes the behaviour elsewhere → REMOVE. (Verify the
    paralog/organism before claiming this — see CLAUDE.md; do not REMOVE an
    experimental annotation just because the cached abstract foregrounds another
    gene.)
  2. Is the gene proximal to the behaviour? Neurotransmitter receptor,
    neuropeptide/hormone, ion channel, or circadian-clock component acting
    directly in the relevant circuit → behaviour may be near-core → ACCEPT
    or capture a more specific behaviour term.
  3. Is it a real but distal phenotype? (the common case — structural,
    metabolic, developmental, or ciliary gene whose knockout perturbs behaviour
    indirectly) → KEEP_AS_NON_CORE, with reason naming the proximal
    molecular defect the behaviour is downstream of.
  4. Is the term uselessly broad (adult behavior, behavior) or the
    phenotype barely connected?
    MARK_AS_OVER_ANNOTATED.

The default for a phenotype-driven behaviour annotation on a
molecular/structural gene is KEEP_AS_NON_CORE, not REMOVE: the phenotype
is genuine data about the gene, just not its core function.

Reproducing the analysis

uv run python projects/BEHAVIOR/mine_behavior.py \
    --genes-dir genes --out-dir projects/BEHAVIOR/reports

Outputs (regenerated, not hand-edited):

Spot-check of the ACCEPTed annotations

Applying the rubric to every behaviour annotation that a reviewer had ACCEPTed
as a core function sorts them cleanly into genuinely-proximal cases and missed
downgrades.

Genuinely proximal — ACCEPT upheld:

Missed downgrades — corrected to KEEP_AS_NON_CORE:

This moved 9 annotations from core to non-core, raising the downgrade rate among
adjudicated behaviour annotations from ~81% to 87% (127 of 146; now only 19
ACCEPTed as core). Borderline cases left as-is (documented, not changed): daf-2
feeding/eating (the pleiotropic insulin receptor — feeding is one of many
outputs) and trpm7 swimming (a channel-kinase whose swimming phenotype is
plausibly a distal developmental consequence) — defensible either way and not
clear-cut enough to overturn.

ASSAY_TO_FUNCTION frames over-annotation risk on two
axes — proximity (does the readout measure the gene product's own molecular
activity, or a downstream cellular consequence?) and convergence (is the
readout a specific signature of process P, or a hub that many inputs feed into?).
A whole-animal behaviour is the maximal phenotypic + high-convergence readout:
it integrates the entire nervous system plus development, metabolism and basic
cell biology, so almost any perturbation can move it. That is exactly why ~87% of
adjudicated behaviour annotations are downgraded.

Behaviour has now been added as a first-class readout in that project's catalogue
(BEHAVIORAL_ASSAY in
readout_catalog.yaml), with the test
names — Morris Water Maze, open field, rotarod, fear conditioning, … — as match
patterns. The Casp3 swimming behavior case above is the emblematic failure
mode: the assay modality is mistaken for the gene's function. The Morris Water
Maze is a swimming-based test of spatial memory; a gene merely measured in it
(caspase-3, as an apoptosis marker) gets mis-annotated to swimming behavior.

Standardized behavioural-assay resources

There is no single canonical ontology that maps a behavioural assay to the GO
process it licenses
(which is the gap readout_catalog.yaml fills by hand).
The landscape is split across three complementary layers:

For this project's purposes the practical takeaway is that an assay (IMPReSS/OBI)
reports a phenotype (MP/NBO), which is at best weak, non-core evidence for a GO
process (NBO/GO) — and never for a molecular function. Tightening the
behaviour branch's GO↔NBO alignment, and recording which assay drove each
behaviour annotation, would let the over-annotation check run automatically.

The IMPReSS standardized assay battery is ingested under
BEHAVIOR/impress/: a reproducible pull across 5
IMPC pipelines (287 procedures → 15 canonical behavioural/neurological assay
types
, including Rotarod, Hole-board, Hot Plate, Tail Suspension, Von Frey and
Sleep-Wake, which the core pipeline omits), plus a hand-curated
behavioural_assay_go_map.yaml
mapping each assay to the GO behaviour term it can support as KEEP_AS_NON_CORE
(QuickGO-verified ids). That map closes the missing assay→GO link and fences off
the traps — Grip Strength (neuromuscular, no behaviour term), Tail Suspension (no
GO term for depression-like immobility) and Auditory Brain Stem Response
(electrophysiology, a hearing term at most, not auditory behavior).

The map is wired into the over-annotation mining two ways: a BEHAVIORAL_ASSAY
readout class in ASSAY_TO_FUNCTION/readout_catalog.yaml
(generic readout↔action cross-tab via mine_readouts.py), and a dedicated
check_behaviour_assays.py that
verifies the specific GO term against the specific assay named in an
annotation's evidence. The checker independently re-derived the Casp3 swimming behavior over-annotation (Morris Water Maze is a spatial-memory test; swimming
is only the modality) — confirming that fix from the assay side.

Status & next steps