pith. sign in

arxiv: 2604.01404 · v2 · pith:HU7LQKBWnew · submitted 2026-04-01 · 💻 cs.CL · cs.AI

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

Pith reviewed 2026-05-21 09:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords entity cellsfactual recallMLP neuronsneuron localizationlanguage model interpretabilityknowledge localizationsparse representationsgrandmother cells
0
0 comments X

The pith

Sparse entity-selective neurons causally support factual recall for specific entities in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper locates sparse neurons in language model MLP layers that activate selectively for facts about particular entities. It identifies candidate cells by ranking neurons on activation consistency over many prompt variations for one entity. In intervention tests, suppressing one such cell removes the ability to recall facts about its entity while leaving other entities unaffected. Activating the cell alone restores the correct information for most entities, even when the entity name is omitted from the prompt. The cells remain effective across aliases, misspellings, and other languages, and persist after instruction tuning.

Core claim

The central discovery is that ranking MLP neurons by their activation consistency across varied prompts about the same entity isolates entity cells whose activity is necessary and sufficient for factual recall. Suppressing a localized cell selectively erases recall for its matched entity, and activating a single cell suffices to recover correct knowledge for most entities even when the entity is absent from the context. These cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, indicating they encode canonical entity identity rather than surface token patterns.

What carries the argument

Entity cell, a sparse entity-selective MLP neuron localized via activation consistency ranking, that carries causal responsibility for retrieving facts tied to one specific entity.

If this is right

  • Suppressing one localized cell erases recall selectively for its matched entity while leaving others intact.
  • Activating one cell recovers correct knowledge for most entities even absent from the input context.
  • The same cells respond to aliases, acronyms, misspellings, and multilingual forms of the entity.
  • Localized cells predominantly cluster in early layers of the models.
  • Causal effects differ across model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If these cells prove general, targeted activation or suppression could allow fine-grained control over model knowledge.
  • The early layer concentration points to entity processing happening before deeper integration.
  • Stability through tuning suggests the cells capture core identity representations.
  • Extending the localization method to other knowledge types could map broader internal structures.

Load-bearing premise

That consistent activation across varied prompts reliably identifies neurons causally responsible for factual recall rather than those merely associated with entity-related language.

What would settle it

A test where suppressing the top-ranked neuron for an entity shows no selective reduction in recall accuracy for that entity compared to others.

Figures

Figures reproduced from arXiv: 2604.01404 by Dan Barzilay, Itay Yona, Michael Karasik, Mor Geva.

Figure 1
Figure 1. Figure 1: We identify sparse, entity-selective MLP neurons, termed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer of the top localized cell for each [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Entity-specific amnesia under negative abla [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Variant robustness for “Barack Obama”: most [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Acronym robustness (FBI): variants localize [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Factual modification via latent steering [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qwen2.5-7B-Instruct replication of [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qwen2.5-7B-Instruct replication of Fig￾ure 3. Negative ablation of the localized entity cell again causes a strong drop for the target entity while leaving the control entity comparatively stable, sup￾porting preservation of the same causal pattern after post-training. Entity Present Mean Entity Correct Cell Wrong Cell 0.0 0.2 0.4 0.6 0.8 1.0 Pass@5 Accuracy Entity Injection Qwen2.5-7B-Instruct [PITH_FUL… view at source ↗
Figure 13
Figure 13. Figure 13: Qwen2.5-7B-Instruct replication of [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qwen3-8B within-family replication of Fig￾ure 2. As in the main-paper Qwen2.5 result, top local￾ized cells remain concentrated in early layers, suggest￾ing that the coarse localization pattern persists within the Qwen family. 1 0 1 2 3 Multiplier for Neuron 3037 (Layer 0) 0.0 0.2 0.4 0.6 0.8 1.0 Relative Knowledge Score Entity-Specific Amnesia via Neuron Ablation Qwen/Qwen3-8B Target (London) Control (Par… view at source ↗
Figure 14
Figure 14. Figure 14: Qwen2.5-7B-Instruct replication of [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qwen3-8B within-family replication of Fig￾ure 6. The acronym and expanded form still localize to closely aligned cells, supporting partial preservation of the entity-cell map within the Qwen family. 0.0 0.5 1.0 Rel. Stability L2-10547 L2-9535 L2-6175 L2-6539 L2-1543 L2-7963 1.00 0.12 0.09 0.08 0.07 0.07 Paris (Latin) 0.0 0.5 1.0 Rel. Stability L35-6856 L4-2573 L2-12223 L2-10547 L35-2212 L0-7861 1.00 0.58 … view at source ↗
Figure 20
Figure 20. Figure 20: Qwen3-8B within-family replication of Fig￾ure 7. Cross-script forms continue to recover similar top cells, though the pattern is noisier than in Qwen2.5. Entity Layer Neuron Notes Obama 0 2883 localized (injection weak in this subset) Trump 3 9290 localized; injection success Paris 2 10547 localized London 0 3037 localized; strong ablation drop Beijing 4 5431 localized; strong ablation drop Tokyo 3 223 lo… view at source ↗
Figure 22
Figure 22. Figure 22: Controlled injection on trustworthy cells [PITH_FULL_IMAGE:figures/full_fig_p017_22.png] view at source ↗
Figure 21
Figure 21. Figure 21: provides a localization-only compari￾son across four non-Qwen model families. Relative to the dedicated Qwen-family plots and to the main￾paper localization result in [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗
Figure 24
Figure 24. Figure 24: OLMo-7B cross-family replication of Fig￾ure 3. Negative ablation reduces the target-entity curve, but the control curve is also affected, so this is not the same clean entity-specific amnesia pattern seen in Qwen2.5. This suggests that OLMo’s localized cells may participate in retrieval differently, or less selec￾tively, than the Qwen-family cells. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_24.png] view at source ↗
read the original abstract

How do language models retrieve entity-specific facts from their parameters? We investigate this question by searching for sparse, entity-selective MLP neurons - which we call entity cells, by analogy to the "grandmother cell" hypothesis in neuroscience - and testing whether they play a causal role in factual recall. We localize candidate entity cells by ranking MLP neurons for activation consistency across varied prompts about the same entity, applying this procedure across seven models on a curated subset of PopQA. In all models, localized neurons cluster predominantly in early layers, an empirical pattern not imposed by the architecture. Using Qwen2.5-7B base as a model organism, we find the clearest causal evidence: suppressing a localized cell selectively erases recall for its matched entity while leaving others intact, and activating a single cell is sufficient to recover correct knowledge for most entities - even when the entity is absent from the context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, suggesting they encode canonical entity identity rather than surface token patterns. Causal signals vary across model families, pointing to architectural differences in how entity knowledge is organized. These findings offer concrete, interpretable access points for understanding, controlling, and correcting factual knowledge in language models, and draw a surprising empirical parallel to longstanding questions in neuroscience about sparse coding of concepts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper localizes sparse 'entity cells' (MLP neurons selective for specific entities) in language models by ranking neurons according to activation consistency across varied prompts about the same entity. The procedure is applied to seven models on a curated PopQA subset; neurons cluster in early layers across models. On Qwen2.5-7B, causal interventions demonstrate that suppressing a localized cell selectively erases recall for its matched entity while sparing others, and that activating a single cell recovers correct knowledge for most entities even when the entity is absent from context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual forms and remain stable after instruction tuning.

Significance. If the central causal claims hold, the work supplies concrete, interpretable access points for factual knowledge in LMs and a direct empirical parallel to the grandmother-cell hypothesis. The multi-model localization, stability across surface forms, and successful single-cell activation even without entity context are notable strengths; the causal interventions on Qwen2.5-7B provide falsifiable, mechanistic evidence that strengthens the contribution beyond correlational localization.

major comments (2)
  1. [§4] §4 (causal interventions on Qwen2.5-7B): the claim that consistency-ranked neurons are causally responsible for entity recall is load-bearing, yet the manuscript reports intervention outcomes only for these ranked neurons. No results are shown for random neurons in the same layers or for neurons ranked by alternative statistics (mean activation magnitude, prompt-specific gradients). Without these baselines it remains possible that the observed selective erasure and recovery effects are not special to the consistency metric.
  2. [§3] §3 (localization procedure) and Appendix on PopQA curation: the subset is described as 'curated' but exclusion criteria, number of entities retained, and any filtering on prompt difficulty or entity frequency are not specified. This affects the generalizability of the early-layer clustering pattern and the cross-model comparison.
minor comments (2)
  1. [Figure 3] Figure 3 and associated text: error bars or confidence intervals are not shown for the activation-consistency scores or intervention success rates; adding them would clarify the reliability of the reported patterns.
  2. [§2] Notation: the term 'entity cell' is introduced by analogy but the precise operational definition (e.g., exact ranking threshold, number of top neurons retained per entity) should be stated explicitly in the main text rather than only in the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report, including the positive assessment of the work's potential significance. We address each major comment below and describe the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (causal interventions on Qwen2.5-7B): the claim that consistency-ranked neurons are causally responsible for entity recall is load-bearing, yet the manuscript reports intervention outcomes only for these ranked neurons. No results are shown for random neurons in the same layers or for neurons ranked by alternative statistics (mean activation magnitude, prompt-specific gradients). Without these baselines it remains possible that the observed selective erasure and recovery effects are not special to the consistency metric.

    Authors: We agree that the lack of explicit baselines for random neurons and alternative ranking statistics limits the strength of the causal claim. The consistency metric is central to our localization procedure because it directly tests for neurons that respond reliably across surface variations of the same entity. Nevertheless, we recognize that additional controls are needed to demonstrate that the observed selective effects are not produced by any sufficiently active neuron. In the revised manuscript we will add intervention results for randomly selected neurons from the same layers as well as for neurons ranked by mean activation magnitude. These baselines will be reported in §4 alongside the existing results. revision: yes

  2. Referee: [§3] §3 (localization procedure) and Appendix on PopQA curation: the subset is described as 'curated' but exclusion criteria, number of entities retained, and any filtering on prompt difficulty or entity frequency are not specified. This affects the generalizability of the early-layer clustering pattern and the cross-model comparison.

    Authors: We agree that a more explicit description of the curation process is required for readers to evaluate the generalizability of the early-layer clustering and cross-model results. In the revised manuscript we will expand the relevant paragraph in §3 and the appendix to state the exclusion criteria, the number of entities retained after curation, and any filters applied on the basis of prompt difficulty or entity frequency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical interventions remain independent of localization metric

full rationale

The paper defines entity cells via a ranking procedure on activation consistency across prompts for a given entity, then reports separate causal intervention results (suppression erasing recall, activation recovering knowledge) on held-out prompts and under surface-form variations. These intervention outcomes are measured directly and do not reduce by any equation or self-citation to quantities defined from the same consistency scores used for ranking. The derivation chain consists of an empirical localization step followed by independent experimental tests; no load-bearing premise collapses into a fitted parameter or prior self-citation that encodes the target result. The work is therefore self-contained against external benchmarks of factual recall.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical procedure for identifying entity cells and the assumption that activation consistency plus causal intervention effects demonstrate encoding of canonical entity identity. No free parameters are explicitly fitted in the abstract summary, but neuron ranking thresholds and prompt curation choices function as implicit selection parameters.

free parameters (1)
  • activation consistency ranking threshold
    Used to select candidate entity cells from MLP neurons; exact cutoff not specified in abstract.
axioms (1)
  • domain assumption Consistent activation across varied prompts indicates the neuron encodes entity identity rather than surface features.
    Invoked when interpreting the localization results as evidence for canonical entity cells.
invented entities (1)
  • entity cell no independent evidence
    purpose: Sparse MLP neuron hypothesized to encode a specific entity for factual recall.
    New term introduced by analogy to grandmother cell hypothesis; no independent evidence outside the localization and intervention experiments.

pith-pipeline@v0.9.0 · 5783 in / 1337 out tokens · 40319 ms · 2026-05-21T09:17:09.043618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    Preprint, arXiv:2209.02535

    Analyzing transformers in embedding space. Preprint, arXiv:2209.02535. Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yun- tao Bai, Deep Ganguli, Liane Lovitt, and 14 others

  2. [2]

    Sheridan Feucht, David Atkinson, Byron Wallace, and David Bau

    Softmax linear units.Transformer Circuits Thread. Sheridan Feucht, David Atkinson, Byron Wallace, and David Bau. 2024. Token erasure as a footprint of implicit vocabulary items in llms.arXiv preprint arXiv:2406.20086. Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama. Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. ...

  3. [3]

    org/abs/2305.01610

    Finding neurons in a haystack: Case studies with sparse probing.Preprint, arXiv:2305.01610. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut...