Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override
read the original abstract
Glossaries, technical specifications, and system prompts routinely ask language models to use familiar words in unfamiliar ways. When this works, the local rule does not overwrite the old meaning; the pretrained prior keeps operating underneath, and its strength still shows through. We test this with a Stroop-style paradigm: a remapping rule (doctor means forest) pitted against the query word's lexical-prior distractor (hospital), with matched neutral controls. Across 11 open-weight models spanning four families and 1B-9B parameters, lexical-prior strength predicts interference even after item-level controls for answer prior, frequency, tokenization, and prompt wording. Activation patching on five models then locates where the override is repaired internally. Restoring three source positions that carry the redefinition (the definition subject, its new target, and the query word) almost fully recovers the effect (aggregate $R \in [0.92, 1.06]$). The repair works by protecting the contextual target rather than by silencing the prior; the distractor's probability falls whenever these positions are perturbed, but the target survives only when the redefinition is restored intact. Behavior and mechanism converge on the same channel: the prior's strength both predicts which overrides fail and marks where the causal repair lands.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.