pith. sign in

arxiv: 2605.14347 · v2 · pith:2SQRZQWVnew · submitted 2026-05-14 · 💻 cs.LG

Exemplar Partitioning for Mechanistic Interpretability

Pith reviewed 2026-05-20 20:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords exemplar partitioningmechanistic interpretabilitysparse autoencodersactivation space partitioningcausal interventionsvoronoi regionsllm feature dictionariesrefusal behavior
2
0 comments X

The pith

Exemplar Partitioning builds activation dictionaries for language models as Voronoi regions around observed exemplars, supporting interpretability and causal interventions with far less data than sparse autoencoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Exemplar Partitioning creates feature dictionaries from LLM activations by streaming data and applying leader clustering within a fixed distance threshold to form a Voronoi partition. Each region is defined by a real observed activation exemplar that determines membership and serves as an intervention direction. In Gemma-2-2B experiments, these regions prove interpretable and enable targeted edits, such as ablating a single region to collapse refusal behavior on held-out prompts. The construction allows direct matching of regions across model checkpoints to separate preserved and fine-tuning-induced directions. On concept detection benchmarks, the approach matches or approaches SAE performance while using roughly a thousand times fewer tokens to build the dictionary.

Core claim

Exemplar Partitioning forms dictionaries as Voronoi partitions of activation space through leader clustering of streamed activations at a chosen distance threshold, with each region anchored by an observed exemplar that acts simultaneously as membership criterion and intervention vector. In Gemma-2-2B, dictionary regions are interpretable, refusal concentrates in one region whose exemplar ablation collapses held-out refusal, and cross-checkpoint matching distinguishes directions preserved through instruction tuning from those added by it. EP one-hot probes retain about 97 percent of raw-activation probe accuracy at l0 equal to 1 and reach mean AUROC 0.881 on AxBench at p1, outperforming the

What carries the argument

The exemplar: an observed activation vector that anchors a Voronoi cell, serving as both the geometric center for assigning new activations to the region and the direction vector for causal interventions.

If this is right

  • Dictionaries built from the same data stream become directly comparable across layers, models, and training checkpoints without additional alignment steps.
  • Refusal behavior can be localized to a specific region and removed via exemplar ablation while leaving other behaviors intact.
  • Approximately 20 percent of EP regions match SAE features at F1 greater than 0.5, indicating partial overlap in how the two methods decompose activation space.
  • Nearest-exemplar distance at inference time supplies an immediate out-of-distribution signal without extra machinery.
  • EP one-hot probes achieve nearly the accuracy of full raw activations at sparsity level 1 and exceed the canonical GemmaScope SAE on AxBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-threshold construction could allow systematic tracking of how individual regions persist or split across successive training checkpoints.
  • The same leader-clustering procedure might be applied to activations from non-transformer architectures to test whether similar geometric partitions emerge.
  • Because exemplars remain tied to concrete data points, the method could support direct editing of model behavior by swapping or perturbing specific observed vectors rather than learned directions.
  • Extending the approach to multiple distance thresholds in parallel might capture both coarse and fine-grained features within a single dictionary.

Load-bearing premise

A single fixed distance threshold for leader clustering on streamed activations produces regions that are both geometrically coherent and causally meaningful for model behavior without post-hoc adjustment to observed results.

What would settle it

Ablating the exemplar direction of the identified refusal region in a fresh set of held-out refusal prompts and measuring whether refusal rates drop substantially would directly test the causal claim.

Figures

Figures reproduced from arXiv: 2605.14347 by Jessica Rumbelow.

Figure 1
Figure 1. Figure 1: An Exemplar Partitioning dictionary built from Gemma-2-2B L12 activations at [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Partition neighbourhood between two anchor partitions across three resolutions of the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-(layer, domain) saturation under a single Pile-calibrated threshold (p [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Single-seed (seed 0) refusal-ablation ∆ versus calibration percentile, Gemma-2-2B-it L20, K = 1 region projected out, evaluated on the held-out n = 50 harmful set (baseline refusal 0.98). Ex￾emplar basis (red) outperforms mean-member basis (blue) by 0.4–0.6 across the working range. Two failure modes: p = 8 fragmentation (cluster split across multiple sub-cones); p = 20 contamination (single region broader… view at source ↗
Figure 5
Figure 5. Figure 5: EP-region correspondence to GemmaScope canonical 16k SAE on Gemma-2-2B L12, [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
read the original abstract

We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^3\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim$20% of EP regions match an SAE feature at $F_1 > 0.5$, and EP one-hot probes retain $\sim$97% of raw-activation probe accuracy at $\ell_0 = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_1$ reaches mean AUROC 0.881, +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE-A's 0.911, at $\sim 10^3\times$ less build compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from LLM activations. EP builds a Voronoi partition of activation space via leader-clustering of streamed activations within a single distance threshold; each region is anchored by an observed exemplar that serves as both membership criterion and intervention direction. Dictionary size emerges from the activation geometry rather than being prespecified. The work demonstrates that EP regions in Gemma-2-2B are interpretable, support causal interventions (refusal concentrates in one region whose exemplar ablation collapses held-out refusal), enable cross-checkpoint matching between base and instruction-tuned models, and achieve competitive performance on AxBench (mean AUROC 0.881 at p1, +0.126 over the canonical GemmaScope SAE) while using ~10^3× fewer tokens than comparable SAEs. EP one-hot probes retain ~97% of raw-activation probe accuracy at ℓ0=1, and nearest-exemplar distance supplies a free OOD signal.

Significance. If the central claims hold, EP offers a computationally lightweight, directly comparable alternative to SAEs for mechanistic interpretability. Strengths include the use of observed exemplars (enabling cross-model and cross-checkpoint matching without learned parameters), explicit causal intervention results on refusal behavior, and a head-to-head numeric benchmark against GemmaScope SAEs on AxBench. These properties could lower the barrier to building and validating feature dictionaries at scale.

major comments (2)
  1. [Methods (threshold selection procedure)] The distance threshold for leader clustering is stated to be 'determined by the activation geometry,' yet no a-priori selection rule, sensitivity analysis, or independent geometric criterion (e.g., target region count, median nearest-neighbor distance, or cross-validated reconstruction error) is supplied. This choice is load-bearing for the unsupervised claim and for the reported causal concentration of refusal behavior; without an explicit protocol it remains possible that the threshold was adjusted after inspecting interpretability or intervention outcomes.
  2. [AxBench evaluation (results paragraph)] The AxBench benchmark reports mean AUROC 0.881 for EP at p1 (+0.126 over the canonical GemmaScope SAE) but provides no details on statistical significance testing, controls for multiple comparisons, or variance across runs. Because the threshold itself is the sole free parameter, these omissions make it difficult to assess whether the performance margin is robust.
minor comments (2)
  1. [Abstract and AxBench section] Define the precise meaning of 'p1' and its relation to the distance threshold used for the reported AUROC.
  2. [Comparison with GemmaScope SAE] Clarify whether the ~20% overlap (F1 > 0.5) between EP regions and SAE features is computed on the same activation stream used to build both dictionaries or on held-out data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of Exemplar Partitioning as a lightweight, comparable alternative to SAEs. We address the two major comments below and will incorporate revisions to improve methodological transparency and statistical reporting.

read point-by-point responses
  1. Referee: [Methods (threshold selection procedure)] The distance threshold for leader clustering is stated to be 'determined by the activation geometry,' yet no a-priori selection rule, sensitivity analysis, or independent geometric criterion (e.g., target region count, median nearest-neighbor distance, or cross-validated reconstruction error) is supplied. This choice is load-bearing for the unsupervised claim and for the reported causal concentration of refusal behavior; without an explicit protocol it remains possible that the threshold was adjusted after inspecting interpretability or intervention outcomes.

    Authors: We agree that the manuscript does not currently supply an explicit a-priori protocol, which leaves open the possibility of post-hoc adjustment and weakens the unsupervised framing. The threshold in the reported experiments was chosen on the basis of a small pilot sample to yield a dictionary size on the order of typical SAE feature counts while respecting the observed nearest-neighbor distance distribution; however, this heuristic was not documented. In revision we will add a dedicated subsection that (i) states a reproducible geometric rule (median nearest-neighbor distance computed on a fixed pilot stream, scaled by a constant factor), (ii) reports a sensitivity sweep over a range of thresholds bracketing the chosen value, and (iii) shows that the refusal-region concentration and AxBench AUROC remain qualitatively stable across that range. These additions will make the selection procedure transparent and demonstrate that the central claims do not hinge on a single tuned threshold. revision: yes

  2. Referee: [AxBench evaluation (results paragraph)] The AxBench benchmark reports mean AUROC 0.881 for EP at p1 (+0.126 over the canonical GemmaScope SAE) but provides no details on statistical significance testing, controls for multiple comparisons, or variance across runs. Because the threshold itself is the sole free parameter, these omissions make it difficult to assess whether the performance margin is robust.

    Authors: We concur that the AxBench paragraph lacks the statistical detail needed to evaluate robustness. Because the distance threshold is the only free parameter, we will augment the results with (i) AUROC means and standard deviations obtained by constructing EP dictionaries on three independent activation streams drawn from the same data distribution, (ii) bootstrap confidence intervals for the reported mean AUROC, and (iii) a paired statistical test (with multiplicity correction if additional metrics are added) against the GemmaScope baseline. These quantities will be included in a revised results table and accompanying text, allowing readers to judge whether the observed margin is reliable given the single hyperparameter. revision: yes

Circularity Check

0 steps flagged

No significant circularity; construction and evaluation remain independent

full rationale

The EP method builds Voronoi regions directly from streamed activations via leader clustering at a single distance threshold whose value is stated to be fixed by activation geometry rather than by any downstream interpretability or causal metric. Reported results (refusal ablation on held-out prompts, AxBench AUROC, probe accuracy retention, cross-checkpoint matching) are measured on separate data or external benchmarks and do not reduce, by any equation or self-citation in the provided text, to quantities fitted on the same activations used to select exemplars or the threshold. No load-bearing step equates a claimed prediction to its own input by construction, and the single distance threshold is presented as an a-priori geometric choice rather than a post-selected hyperparameter.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on geometric assumptions about activation space and the utility of a single distance threshold; no new physical entities are postulated.

free parameters (1)
  • distance threshold
    Controls region size and dictionary cardinality; chosen to produce useful partitions but its selection procedure is not detailed in the abstract.
axioms (1)
  • domain assumption Leader clustering on streamed activations yields regions whose exemplars serve as both membership criteria and causally relevant intervention directions.
    Invoked when claiming that exemplar ablation collapses refusal behavior.

pith-pipeline@v0.9.0 · 5867 in / 1405 out tokens · 63949 ms · 2026-05-20T20:47:02.225337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Nature , volume=

    Emergence of simple-cell receptive field properties by learning a sparse code for natural images , author=. Nature , volume=

  2. [2]

    2022 , note=

    Toy Models of Superposition , author=. 2022 , note=

  3. [3]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  4. [4]

    2023 , note=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , note=

  5. [5]

    2024 , note=

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , note=

  6. [6]

    2024 , eprint=

    Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

  7. [7]

    2024 , eprint=

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author=. 2024 , eprint=

  8. [8]

    2024 , eprint=

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. 2024 , eprint=

  9. [9]

    2025 , note=

    Karvonen, Adam and Rager, Can and Lin, Johnny and Tigges, Curt and Bloom, Joseph and Chanin, David and Lau, Yeu-Tong and Farrell, Eoin and McDougall, Callum and Ayonrinde, Kola and Till, Demian and Wearden, Matthew and Conmy, Arthur and Marks, Samuel and Nanda, Neel , booktitle=. 2025 , note=

  10. [10]

    and Potts, Christopher , booktitle=

    Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , booktitle=. 2025 , note=

  11. [11]

    IEEE Conference on Computer Vision and Pattern Recognition , year=

    Network Dissection: Quantifying Interpretability of Deep Visual Representations , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=

  12. [12]

    International Conference on Machine Learning , year=

    Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors , author=. International Conference on Machine Learning , year=

  13. [13]

    International Conference on Machine Learning , year=

    Concept Bottleneck Models , author=. International Conference on Machine Learning , year=

  14. [14]

    Advances in Neural Information Processing Systems , year=

    Towards Automatic Concept-based Explanations , author=. Advances in Neural Information Processing Systems , year=

  15. [15]

    Fel, Thomas and Picard, Agustin and Bethune, Louis and Boissin, Thibaut and Vigouroux, David and Colin, Julien and Cadene, Remi and Serre, Thomas , booktitle=

  16. [16]

    International Conference on Machine Learning , year=

    A Multimodal Automated Interpretability Agent , author=. International Conference on Machine Learning , year=

  17. [17]

    Clustering Algorithms , author=

  18. [18]

    IEEE Transactions on Information Theory , volume=

    Least Squares Quantization in PCM , author=. IEEE Transactions on Information Theory , volume=

  19. [19]

    Advances in Neural Information Processing Systems , year=

    Neural Discrete Representation Learning , author=. Advances in Neural Information Processing Systems , year=

  20. [20]

    Distill , year=

    Activation Atlas , author=. Distill , year=

  21. [21]

    International Conference on Learning Representations , year=

    Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=

  22. [22]

    International Conference on Learning Representations Workshop , year=

    Understanding intermediate layers using linear classifier probes , author=. International Conference on Learning Representations Workshop , year=

  23. [23]

    Computational Linguistics , volume=

    Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

  24. [24]

    2023 , eprint=

    Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

  25. [25]

    Annual Meeting of the Association for Computational Linguistics , year=

    Steering Llama 2 via Contrastive Activation Addition , author=. Annual Meeting of the Association for Computational Linguistics , year=

  26. [26]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  27. [27]

    Advances in Neural Information Processing Systems , year=

    Refusal in Language Models Is Mediated by a Single Direction , author=. Advances in Neural Information Processing Systems , year=

  28. [28]

    Advances in Neural Information Processing Systems , year=

    A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , author=. Advances in Neural Information Processing Systems , year=

  29. [29]

    Advances in Neural Information Processing Systems , year=

    Energy-based Out-of-distribution Detection , author=. Advances in Neural Information Processing Systems , year=

  30. [30]

    Entropy , volume=

    The Geometry of Concepts: Sparse Autoencoder Feature Structure , author=. Entropy , volume=

  31. [31]

    2000 , publisher=

    Directional Statistics , author=. 2000 , publisher=

  32. [32]

    and Ghosh, Joydeep and Sra, Suvrit , journal=

    Banerjee, Arindam and Dhillon, Inderjit S. and Ghosh, Joydeep and Sra, Suvrit , journal=. Clustering on the Unit Hypersphere using von

  33. [33]

    International Conference on Machine Learning , year=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning , year=

  34. [34]

    Interpreting

    nostalgebraist , year=. Interpreting

  35. [35]

    2022 , note=

    In-context Learning and Induction Heads , author=. 2022 , note=

  36. [36]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in

  37. [37]

    Advances in Neural Information Processing Systems , year=

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=

  38. [38]

    2023 , eprint=

    Attribution Patching Outperforms Automated Circuit Discovery , author=. 2023 , eprint=

  39. [39]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

    Sparse Autoencoders Trained on the Same Data Learn Different Features , author=. arXiv preprint arXiv:2501.16615 , year=. 2501.16615 , archivePrefix=