Exemplar Partitioning for Mechanistic Interpretability
Pith reviewed 2026-05-20 20:47 UTC · model grok-4.3
The pith
Exemplar Partitioning builds activation dictionaries for language models as Voronoi regions around observed exemplars, supporting interpretability and causal interventions with far less data than sparse autoencoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exemplar Partitioning forms dictionaries as Voronoi partitions of activation space through leader clustering of streamed activations at a chosen distance threshold, with each region anchored by an observed exemplar that acts simultaneously as membership criterion and intervention vector. In Gemma-2-2B, dictionary regions are interpretable, refusal concentrates in one region whose exemplar ablation collapses held-out refusal, and cross-checkpoint matching distinguishes directions preserved through instruction tuning from those added by it. EP one-hot probes retain about 97 percent of raw-activation probe accuracy at l0 equal to 1 and reach mean AUROC 0.881 on AxBench at p1, outperforming the
What carries the argument
The exemplar: an observed activation vector that anchors a Voronoi cell, serving as both the geometric center for assigning new activations to the region and the direction vector for causal interventions.
If this is right
- Dictionaries built from the same data stream become directly comparable across layers, models, and training checkpoints without additional alignment steps.
- Refusal behavior can be localized to a specific region and removed via exemplar ablation while leaving other behaviors intact.
- Approximately 20 percent of EP regions match SAE features at F1 greater than 0.5, indicating partial overlap in how the two methods decompose activation space.
- Nearest-exemplar distance at inference time supplies an immediate out-of-distribution signal without extra machinery.
- EP one-hot probes achieve nearly the accuracy of full raw activations at sparsity level 1 and exceed the canonical GemmaScope SAE on AxBench.
Where Pith is reading between the lines
- The fixed-threshold construction could allow systematic tracking of how individual regions persist or split across successive training checkpoints.
- The same leader-clustering procedure might be applied to activations from non-transformer architectures to test whether similar geometric partitions emerge.
- Because exemplars remain tied to concrete data points, the method could support direct editing of model behavior by swapping or perturbing specific observed vectors rather than learned directions.
- Extending the approach to multiple distance thresholds in parallel might capture both coarse and fine-grained features within a single dictionary.
Load-bearing premise
A single fixed distance threshold for leader clustering on streamed activations produces regions that are both geometrically coherent and causally meaningful for model behavior without post-hoc adjustment to observed results.
What would settle it
Ablating the exemplar direction of the identified refusal region in a fresh set of held-out refusal prompts and measuring whether refusal rates drop substantially would directly test the causal claim.
Figures
read the original abstract
We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^3\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim$20% of EP regions match an SAE feature at $F_1 > 0.5$, and EP one-hot probes retain $\sim$97% of raw-activation probe accuracy at $\ell_0 = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_1$ reaches mean AUROC 0.881, +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE-A's 0.911, at $\sim 10^3\times$ less build compute.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from LLM activations. EP builds a Voronoi partition of activation space via leader-clustering of streamed activations within a single distance threshold; each region is anchored by an observed exemplar that serves as both membership criterion and intervention direction. Dictionary size emerges from the activation geometry rather than being prespecified. The work demonstrates that EP regions in Gemma-2-2B are interpretable, support causal interventions (refusal concentrates in one region whose exemplar ablation collapses held-out refusal), enable cross-checkpoint matching between base and instruction-tuned models, and achieve competitive performance on AxBench (mean AUROC 0.881 at p1, +0.126 over the canonical GemmaScope SAE) while using ~10^3× fewer tokens than comparable SAEs. EP one-hot probes retain ~97% of raw-activation probe accuracy at ℓ0=1, and nearest-exemplar distance supplies a free OOD signal.
Significance. If the central claims hold, EP offers a computationally lightweight, directly comparable alternative to SAEs for mechanistic interpretability. Strengths include the use of observed exemplars (enabling cross-model and cross-checkpoint matching without learned parameters), explicit causal intervention results on refusal behavior, and a head-to-head numeric benchmark against GemmaScope SAEs on AxBench. These properties could lower the barrier to building and validating feature dictionaries at scale.
major comments (2)
- [Methods (threshold selection procedure)] The distance threshold for leader clustering is stated to be 'determined by the activation geometry,' yet no a-priori selection rule, sensitivity analysis, or independent geometric criterion (e.g., target region count, median nearest-neighbor distance, or cross-validated reconstruction error) is supplied. This choice is load-bearing for the unsupervised claim and for the reported causal concentration of refusal behavior; without an explicit protocol it remains possible that the threshold was adjusted after inspecting interpretability or intervention outcomes.
- [AxBench evaluation (results paragraph)] The AxBench benchmark reports mean AUROC 0.881 for EP at p1 (+0.126 over the canonical GemmaScope SAE) but provides no details on statistical significance testing, controls for multiple comparisons, or variance across runs. Because the threshold itself is the sole free parameter, these omissions make it difficult to assess whether the performance margin is robust.
minor comments (2)
- [Abstract and AxBench section] Define the precise meaning of 'p1' and its relation to the distance threshold used for the reported AUROC.
- [Comparison with GemmaScope SAE] Clarify whether the ~20% overlap (F1 > 0.5) between EP regions and SAE features is computed on the same activation stream used to build both dictionaries or on held-out data.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential of Exemplar Partitioning as a lightweight, comparable alternative to SAEs. We address the two major comments below and will incorporate revisions to improve methodological transparency and statistical reporting.
read point-by-point responses
-
Referee: [Methods (threshold selection procedure)] The distance threshold for leader clustering is stated to be 'determined by the activation geometry,' yet no a-priori selection rule, sensitivity analysis, or independent geometric criterion (e.g., target region count, median nearest-neighbor distance, or cross-validated reconstruction error) is supplied. This choice is load-bearing for the unsupervised claim and for the reported causal concentration of refusal behavior; without an explicit protocol it remains possible that the threshold was adjusted after inspecting interpretability or intervention outcomes.
Authors: We agree that the manuscript does not currently supply an explicit a-priori protocol, which leaves open the possibility of post-hoc adjustment and weakens the unsupervised framing. The threshold in the reported experiments was chosen on the basis of a small pilot sample to yield a dictionary size on the order of typical SAE feature counts while respecting the observed nearest-neighbor distance distribution; however, this heuristic was not documented. In revision we will add a dedicated subsection that (i) states a reproducible geometric rule (median nearest-neighbor distance computed on a fixed pilot stream, scaled by a constant factor), (ii) reports a sensitivity sweep over a range of thresholds bracketing the chosen value, and (iii) shows that the refusal-region concentration and AxBench AUROC remain qualitatively stable across that range. These additions will make the selection procedure transparent and demonstrate that the central claims do not hinge on a single tuned threshold. revision: yes
-
Referee: [AxBench evaluation (results paragraph)] The AxBench benchmark reports mean AUROC 0.881 for EP at p1 (+0.126 over the canonical GemmaScope SAE) but provides no details on statistical significance testing, controls for multiple comparisons, or variance across runs. Because the threshold itself is the sole free parameter, these omissions make it difficult to assess whether the performance margin is robust.
Authors: We concur that the AxBench paragraph lacks the statistical detail needed to evaluate robustness. Because the distance threshold is the only free parameter, we will augment the results with (i) AUROC means and standard deviations obtained by constructing EP dictionaries on three independent activation streams drawn from the same data distribution, (ii) bootstrap confidence intervals for the reported mean AUROC, and (iii) a paired statistical test (with multiplicity correction if additional metrics are added) against the GemmaScope baseline. These quantities will be included in a revised results table and accompanying text, allowing readers to judge whether the observed margin is reliable given the single hyperparameter. revision: yes
Circularity Check
No significant circularity; construction and evaluation remain independent
full rationale
The EP method builds Voronoi regions directly from streamed activations via leader clustering at a single distance threshold whose value is stated to be fixed by activation geometry rather than by any downstream interpretability or causal metric. Reported results (refusal ablation on held-out prompts, AxBench AUROC, probe accuracy retention, cross-checkpoint matching) are measured on separate data or external benchmarks and do not reduce, by any equation or self-citation in the provided text, to quantities fitted on the same activations used to select exemplars or the threshold. No load-bearing step equates a claimed prediction to its own input by construction, and the single distance threshold is presented as an a-priori geometric choice rather than a post-selected hyperparameter.
Axiom & Free-Parameter Ledger
free parameters (1)
- distance threshold
axioms (1)
- domain assumption Leader clustering on streamed activations yields regions whose exemplars serve as both membership criteria and causally relevant intervention directions.
Reference graph
Works this paper leans on
-
[1]
Emergence of simple-cell receptive field properties by learning a sparse code for natural images , author=. Nature , volume=
- [2]
-
[3]
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=
work page 2023
-
[4]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , note=
work page 2023
-
[5]
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , note=
work page 2024
- [6]
-
[7]
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author=. 2024 , eprint=
work page 2024
-
[8]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. 2024 , eprint=
work page 2024
-
[9]
Karvonen, Adam and Rager, Can and Lin, Johnny and Tigges, Curt and Bloom, Joseph and Chanin, David and Lau, Yeu-Tong and Farrell, Eoin and McDougall, Callum and Ayonrinde, Kola and Till, Demian and Wearden, Matthew and Conmy, Arthur and Marks, Samuel and Nanda, Neel , booktitle=. 2025 , note=
work page 2025
-
[10]
and Potts, Christopher , booktitle=
Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , booktitle=. 2025 , note=
work page 2025
-
[11]
IEEE Conference on Computer Vision and Pattern Recognition , year=
Network Dissection: Quantifying Interpretability of Deep Visual Representations , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[12]
International Conference on Machine Learning , year=
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors , author=. International Conference on Machine Learning , year=
-
[13]
International Conference on Machine Learning , year=
Concept Bottleneck Models , author=. International Conference on Machine Learning , year=
-
[14]
Advances in Neural Information Processing Systems , year=
Towards Automatic Concept-based Explanations , author=. Advances in Neural Information Processing Systems , year=
-
[15]
Fel, Thomas and Picard, Agustin and Bethune, Louis and Boissin, Thibaut and Vigouroux, David and Colin, Julien and Cadene, Remi and Serre, Thomas , booktitle=
-
[16]
International Conference on Machine Learning , year=
A Multimodal Automated Interpretability Agent , author=. International Conference on Machine Learning , year=
-
[17]
Clustering Algorithms , author=
-
[18]
IEEE Transactions on Information Theory , volume=
Least Squares Quantization in PCM , author=. IEEE Transactions on Information Theory , volume=
-
[19]
Advances in Neural Information Processing Systems , year=
Neural Discrete Representation Learning , author=. Advances in Neural Information Processing Systems , year=
- [20]
-
[21]
International Conference on Learning Representations , year=
Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=
-
[22]
International Conference on Learning Representations Workshop , year=
Understanding intermediate layers using linear classifier probes , author=. International Conference on Learning Representations Workshop , year=
-
[23]
Computational Linguistics , volume=
Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=
-
[24]
Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=
work page 2023
-
[25]
Annual Meeting of the Association for Computational Linguistics , year=
Steering Llama 2 via Contrastive Activation Addition , author=. Annual Meeting of the Association for Computational Linguistics , year=
-
[26]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
work page 2023
-
[27]
Advances in Neural Information Processing Systems , year=
Refusal in Language Models Is Mediated by a Single Direction , author=. Advances in Neural Information Processing Systems , year=
-
[28]
Advances in Neural Information Processing Systems , year=
A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , author=. Advances in Neural Information Processing Systems , year=
-
[29]
Advances in Neural Information Processing Systems , year=
Energy-based Out-of-distribution Detection , author=. Advances in Neural Information Processing Systems , year=
-
[30]
The Geometry of Concepts: Sparse Autoencoder Feature Structure , author=. Entropy , volume=
- [31]
-
[32]
and Ghosh, Joydeep and Sra, Suvrit , journal=
Banerjee, Arindam and Dhillon, Inderjit S. and Ghosh, Joydeep and Sra, Suvrit , journal=. Clustering on the Unit Hypersphere using von
-
[33]
International Conference on Machine Learning , year=
The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning , year=
- [34]
- [35]
-
[36]
Interpretability in the Wild: a Circuit for Indirect Object Identification in
Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in
-
[37]
Advances in Neural Information Processing Systems , year=
Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=
-
[38]
Attribution Patching Outperforms Automated Circuit Discovery , author=. 2023 , eprint=
work page 2023
-
[39]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner
Sparse Autoencoders Trained on the Same Data Learn Different Features , author=. arXiv preprint arXiv:2501.16615 , year=. 2501.16615 , archivePrefix=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.