Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
Pith reviewed 2026-05-21 09:17 UTC · model grok-4.3
The pith
Sparse entity-selective neurons causally support factual recall for specific entities in language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that ranking MLP neurons by their activation consistency across varied prompts about the same entity isolates entity cells whose activity is necessary and sufficient for factual recall. Suppressing a localized cell selectively erases recall for its matched entity, and activating a single cell suffices to recover correct knowledge for most entities even when the entity is absent from the context. These cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, indicating they encode canonical entity identity rather than surface token patterns.
What carries the argument
Entity cell, a sparse entity-selective MLP neuron localized via activation consistency ranking, that carries causal responsibility for retrieving facts tied to one specific entity.
If this is right
- Suppressing one localized cell erases recall selectively for its matched entity while leaving others intact.
- Activating one cell recovers correct knowledge for most entities even absent from the input context.
- The same cells respond to aliases, acronyms, misspellings, and multilingual forms of the entity.
- Localized cells predominantly cluster in early layers of the models.
- Causal effects differ across model families.
Where Pith is reading between the lines
- If these cells prove general, targeted activation or suppression could allow fine-grained control over model knowledge.
- The early layer concentration points to entity processing happening before deeper integration.
- Stability through tuning suggests the cells capture core identity representations.
- Extending the localization method to other knowledge types could map broader internal structures.
Load-bearing premise
That consistent activation across varied prompts reliably identifies neurons causally responsible for factual recall rather than those merely associated with entity-related language.
What would settle it
A test where suppressing the top-ranked neuron for an entity shows no selective reduction in recall accuracy for that entity compared to others.
Figures
read the original abstract
How do language models retrieve entity-specific facts from their parameters? We investigate this question by searching for sparse, entity-selective MLP neurons - which we call entity cells, by analogy to the "grandmother cell" hypothesis in neuroscience - and testing whether they play a causal role in factual recall. We localize candidate entity cells by ranking MLP neurons for activation consistency across varied prompts about the same entity, applying this procedure across seven models on a curated subset of PopQA. In all models, localized neurons cluster predominantly in early layers, an empirical pattern not imposed by the architecture. Using Qwen2.5-7B base as a model organism, we find the clearest causal evidence: suppressing a localized cell selectively erases recall for its matched entity while leaving others intact, and activating a single cell is sufficient to recover correct knowledge for most entities - even when the entity is absent from the context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, suggesting they encode canonical entity identity rather than surface token patterns. Causal signals vary across model families, pointing to architectural differences in how entity knowledge is organized. These findings offer concrete, interpretable access points for understanding, controlling, and correcting factual knowledge in language models, and draw a surprising empirical parallel to longstanding questions in neuroscience about sparse coding of concepts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper localizes sparse 'entity cells' (MLP neurons selective for specific entities) in language models by ranking neurons according to activation consistency across varied prompts about the same entity. The procedure is applied to seven models on a curated PopQA subset; neurons cluster in early layers across models. On Qwen2.5-7B, causal interventions demonstrate that suppressing a localized cell selectively erases recall for its matched entity while sparing others, and that activating a single cell recovers correct knowledge for most entities even when the entity is absent from context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual forms and remain stable after instruction tuning.
Significance. If the central causal claims hold, the work supplies concrete, interpretable access points for factual knowledge in LMs and a direct empirical parallel to the grandmother-cell hypothesis. The multi-model localization, stability across surface forms, and successful single-cell activation even without entity context are notable strengths; the causal interventions on Qwen2.5-7B provide falsifiable, mechanistic evidence that strengthens the contribution beyond correlational localization.
major comments (2)
- [§4] §4 (causal interventions on Qwen2.5-7B): the claim that consistency-ranked neurons are causally responsible for entity recall is load-bearing, yet the manuscript reports intervention outcomes only for these ranked neurons. No results are shown for random neurons in the same layers or for neurons ranked by alternative statistics (mean activation magnitude, prompt-specific gradients). Without these baselines it remains possible that the observed selective erasure and recovery effects are not special to the consistency metric.
- [§3] §3 (localization procedure) and Appendix on PopQA curation: the subset is described as 'curated' but exclusion criteria, number of entities retained, and any filtering on prompt difficulty or entity frequency are not specified. This affects the generalizability of the early-layer clustering pattern and the cross-model comparison.
minor comments (2)
- [Figure 3] Figure 3 and associated text: error bars or confidence intervals are not shown for the activation-consistency scores or intervention success rates; adding them would clarify the reliability of the reported patterns.
- [§2] Notation: the term 'entity cell' is introduced by analogy but the precise operational definition (e.g., exact ranking threshold, number of top neurons retained per entity) should be stated explicitly in the main text rather than only in the appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report, including the positive assessment of the work's potential significance. We address each major comment below and describe the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (causal interventions on Qwen2.5-7B): the claim that consistency-ranked neurons are causally responsible for entity recall is load-bearing, yet the manuscript reports intervention outcomes only for these ranked neurons. No results are shown for random neurons in the same layers or for neurons ranked by alternative statistics (mean activation magnitude, prompt-specific gradients). Without these baselines it remains possible that the observed selective erasure and recovery effects are not special to the consistency metric.
Authors: We agree that the lack of explicit baselines for random neurons and alternative ranking statistics limits the strength of the causal claim. The consistency metric is central to our localization procedure because it directly tests for neurons that respond reliably across surface variations of the same entity. Nevertheless, we recognize that additional controls are needed to demonstrate that the observed selective effects are not produced by any sufficiently active neuron. In the revised manuscript we will add intervention results for randomly selected neurons from the same layers as well as for neurons ranked by mean activation magnitude. These baselines will be reported in §4 alongside the existing results. revision: yes
-
Referee: [§3] §3 (localization procedure) and Appendix on PopQA curation: the subset is described as 'curated' but exclusion criteria, number of entities retained, and any filtering on prompt difficulty or entity frequency are not specified. This affects the generalizability of the early-layer clustering pattern and the cross-model comparison.
Authors: We agree that a more explicit description of the curation process is required for readers to evaluate the generalizability of the early-layer clustering and cross-model results. In the revised manuscript we will expand the relevant paragraph in §3 and the appendix to state the exclusion criteria, the number of entities retained after curation, and any filters applied on the basis of prompt difficulty or entity frequency. revision: yes
Circularity Check
No significant circularity; empirical interventions remain independent of localization metric
full rationale
The paper defines entity cells via a ranking procedure on activation consistency across prompts for a given entity, then reports separate causal intervention results (suppression erasing recall, activation recovering knowledge) on held-out prompts and under surface-form variations. These intervention outcomes are measured directly and do not reduce by any equation or self-citation to quantities defined from the same consistency scores used for ranking. The derivation chain consists of an empirical localization step followed by independent experimental tests; no load-bearing premise collapses into a fitted parameter or prior self-citation that encodes the target result. The work is therefore self-contained against external benchmarks of factual recall.
Axiom & Free-Parameter Ledger
free parameters (1)
- activation consistency ranking threshold
axioms (1)
- domain assumption Consistent activation across varied prompts indicates the neuron encodes entity identity rather than surface features.
invented entities (1)
-
entity cell
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Analyzing transformers in embedding space. Preprint, arXiv:2209.02535. Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yun- tao Bai, Deep Ganguli, Liane Lovitt, and 14 others
-
[2]
Sheridan Feucht, David Atkinson, Byron Wallace, and David Bau
Softmax linear units.Transformer Circuits Thread. Sheridan Feucht, David Atkinson, Byron Wallace, and David Bau. 2024. Token erasure as a footprint of implicit vocabulary items in llms.arXiv preprint arXiv:2406.20086. Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama. Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. ...
-
[3]
Finding neurons in a haystack: Case studies with sparse probing.Preprint, arXiv:2305.01610. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.