Structuring Sparsity: Block-Sparse Featurizers Capture Visual Concept Manifolds

Aiden Swann; Atticus Geiger; Can Rager; Curt Tigges; Daniel Wurgaft; Dron Hazra; Ekdeep Singh Lubana; Fenil Doshi; Jack Merullo; Lee Sharkey

arxiv: 2606.25234 · v1 · pith:WCHWDGAHnew · submitted 2026-06-23 · 💻 cs.CV

Structuring Sparsity: Block-Sparse Featurizers Capture Visual Concept Manifolds

Thomas Fel , Matthew Kowal , Mozes Jacobs , Dron Hazra , Usha Bhalla , Lee Sharkey , Lucius Bushnaq , Satchel Grant

show 17 more authors

Tal Haklay Thomas Icard Can Rager Michael Pearce Daniel Wurgaft Aiden Swann Fenil Doshi Siddharth Boppana Curt Tigges Nick Cammarata Thomas Serre Vasudev Shyam Owen Lewis Thomas McGrath Jack Merullo Ekdeep Singh Lubana Atticus Geiger

This is my paper

Pith reviewed 2026-06-25 23:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords block sparsityvisual concept manifoldsneural interpretabilitystructured sparsityactivation space geometrymanifold steeringdiffusion models

0 comments

The pith

Block-sparse featurizers recover visual concepts as low-dimensional manifolds instead of isolated directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that treating concepts as single directions in neural activations overlooks their structure as low-dimensional manifolds. Block sparsity provides a prior that groups directions into blocks, allowing representations to be modeled as sparse sums of these manifolds. This aligns with the idea from visual neuroscience that features are carried by coordinated groups of neurons. Using minimum description length analysis, the authors demonstrate that block-sparse featurizers describe the activations more compactly, with most concepts spanning two to four dimensions. They apply this to reinterpret existing detectors, find new ones, and steer generation in diffusion models.

Core claim

Block sparsity, which groups directions into blocks, is the prior matched to a generative model in which a representation is a sparse sum of low-dimensional manifolds. This is the modern, learned form of a classical idea in visual neuroscience, where a visual feature is carried by a coordinated group of neurons rather than a single tuned one. All three variants of block-sparse featurizers describe activations more compactly than direction-based featurizers, with the recovered concepts typically two- to four-dimensional.

What carries the argument

Block-sparse featurizers that enforce sparsity over blocks of directions to model low-dimensional manifolds in activation space.

If this is right

Prior curve detectors in InceptionV1 are shown to read from a single continuous curve manifold.
New manifolds for shadows and lighting are discovered in DINOv3.
Manifold steering provides interpretable control over image generation in SDXL diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that interpretability methods should shift from individual directions to structured groups.
Similar structured sparsity approaches might improve analysis in non-visual domains.
Testing other block sizes or sparsity patterns could yield even more efficient descriptions of activations.

Load-bearing premise

The minimum-description-length comparison between block-sparse and direction-based featurizers directly measures which generative model better matches the true structure of the activations.

What would settle it

If direction-based featurizers were found to yield lower minimum description lengths than block-sparse featurizers across the tested models and layers, the superiority of the block-sparse approach would be challenged.

Figures

Figures reproduced from arXiv: 2606.25234 by Aiden Swann, Atticus Geiger, Can Rager, Curt Tigges, Daniel Wurgaft, Dron Hazra, Ekdeep Singh Lubana, Fenil Doshi, Jack Merullo, Lee Sharkey, Lucius Bushnaq, Matthew Kowal, Michael Pearce, Mozes Jacobs, Nick Cammarata, Owen Lewis, Satchel Grant, Siddharth Boppana, Tal Haklay, Thomas Fel, Thomas Icard, Thomas McGrath, Thomas Serre, Usha Bhalla, Vasudev Shyam.

**Figure 1.** Figure 1: Block-sparse featurizers (BSF) capture the internal geometry of concepts. Sparse autoencoders (SAEs) are a popular method for decomposing neural representations using isolated directions as primitive atoms. On the right, we show how two SAE features activate over the image of a rabbit where darker red means a higher activation; observe that one feature picks out the rabbit’s ears and the other its face. Ye… view at source ↗

**Figure 2.** Figure 2: Activation x as a sparse sum of points drawn from a few lowdimensional regions Mi , x lives in a Minkowski sum of manifolds (Def. 1). There is a deep duality between the architecture of a featurizer and the assumed geometry of neural representations (Hindupur et al., 2025). To design a new featurizer, we first hypothesize a data-generating process (DGP) based on recent work on representation manifolds (F… view at source ↗

**Figure 3.** Figure 3: The block soft-threshold. The shrinkage operator (applied on the norm of the group) against ReLU and JumpReLU (Rajamanoharan et al., 2024). A few remarks follow. First, the penalty sees each block only through its norm ∥zg∥2 because the same point mg = zgDg can be written in any rotated frame, zgDg = (zgQ⊤)(QDg) for Q ∈ O(b), so a penalty intrinsic to the feature must be “blind” to the choice of basis, wh… view at source ↗

**Figure 4.** Figure 4: Block sparsity recovers an additive manifold superposition. A controlled instance of Def. 1: M known low-dimensional manifolds embedded in R d and summed |S| at a time (here six factors, one per row). The leftmost column is the ground-truth contribution mi ; each remaining column is the contribution recovered by a featurizer, projected into the true factor’s 3-D principal frame and colored by that frame (h… view at source ↗

**Figure 5.** Figure 5: Per-block recovery. Per-block R2 , averaged over factors. The block-sparse featurizers (purple) sit near the oracle ceiling; the two repaired variants (bottom) recover well above their original forms. The same principle repairs SMixAE and MFA. The toy also lets us probe two recent featurizers that, like ours, take a multi-dimensional unit but are built on different assumptions: SMixAE (Francel, 2026) and … view at source ↗

**Figure 6.** Figure 6: Distortion estimation and reconstruction sparsity trade-off. (top) Estimating the reconstruction fidelity each task requires. DINOv3 activations are degraded by progressive quantization and passed to a frozen linear probe; the relative task metric is plotted against the fidelity R2 (x, xq) of the corrupted features. Performance is essentially unaffected down to R2 ≈ 0.8 for classification and segmentation… view at source ↗

**Figure 7.** Figure 7: Block structure yields a lower description length than SAEs. Description length Lδ(x) (Eq. 5) for the Grassmannian BSF at the 20% distortion floor, against block dimension b, with one panel per dictionary width G and one curve per block sparsity k. Codes with b>1 describe DINOv3 activations in fewer bits than the b=1 SAE across every width and sparsity, the minimum falling at a moderate b between 2 and 4 t… view at source ↗

**Figure 8.** Figure 8: Block dimensionality stabilizes between two and four. Mean stable rank of the per-block code, against the block dimension b the featurizer was granted. Although blocks are allotted up to b = 16 coordinates, the dimension they occupy saturates near 3, indicating that DINOv3 concepts are on average between two- and four-dimensional regardless of the room made available to them [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 9.** Figure 9: A block carries a concept together with its intrinsic geometry. Six blocks from a Grassmann featurizer on DINOv3. For every block we overlay the patches it fires on a sample of images, coloring each patch by the first three principal coordinates of its contribution ug mapped to the red, green, and blue channels, so that hue reports where on the concept a patch lies rather than merely how strongly it is pre… view at source ↗

**Figure 10.** Figure 10: Feature visualization on one recovered concept manifold. A single Grassmannian BSF block trained on DINOv3 recovers a cabbage/cauliflower manifold. The central colored point cloud is a projection of the block’s active contributions, with hue encoding the first three principal coordinates of the block’s intrinsic geometry. White points mark locations sampled along paths through this cloud, and the images a… view at source ↗

**Figure 11.** Figure 11: A single block recovers the InceptionV1 curve manifold, and discovers higher-order Fourier modes. (a) Radial tuning curves on synthetic oriented stimuli: individual neurons (Cammarata et al., 2020) and Top-k SAE atoms (Gorton, 2024) each fire for a narrow wedge of orientations, shattering the family into many petals, whereas a single BSF block covers all orientations as one connected region. (b) Decompos… view at source ↗

**Figure 12.** Figure 12: Block structure yields class selective, spatially coherent concepts. Best single-concept detector F1 (left) and concept-map smoothness measured by total variation (center) and Dirichlet energy (right), against sparsity k on the native W=32,768 grid; across every k the block-SAEs (b>1) reach higher F1 and lower total variation and Dirichlet energy than the b=1 SAE, the gain widening with b [PITH_FULL_IMAG… view at source ↗

**Figure 13.** Figure 13: Recovered concepts track scene illumination, consistently across objects. A Grassmannian BSF on DINOv3 is probed on a series of twelve Blender models rendered while the sun sweeps in azimuth and elevation. (Top left) The block norm ∥zg∥ of the luminance block against sun azimuth, one curve per elevation in {15◦ , . . . , 90◦}; (Top middle) the ground-truth luminance over the same sweep, which the block n… view at source ↗

**Figure 14.** Figure 14: Steering the pretzel manifold uncovered in SDXL by BSF. We steer an SDXL block from the component that represents the concept of a pretzel. We prompt SDXL for an image of a pretzel and then run four steps of diffusion while fixing the pretzel block to a particular value. We visualize the points within a block with UMAP because the b = 16, but color the plot according to the principal components of the blo… view at source ↗

**Figure 15.** Figure 15: Steering blocks in layer down.2.1 of SDXL. Generations produced by setting the block’s contribution and fitting a Kohonen map to fit the natural geometry of the subspace ( [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: A gallery of feature visualizations on recovered concept manifolds. Representative blocks from a Grassmannian BSF trained on DINOv3, visualized with the same convention as [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: A gallery of feature visualizations on recovered concept manifolds. Follow up of Fig.16. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Reconstruction improves monotonically with every structural parameter, for all three featurizers. Held-out R2 against block sparsity ℓ, one row per featurizer (top to bottom: Block, Grassmannian, Group Lasso), one column per dictionary width G (expansions 8× to 64×), with curves colored by block dimension k (k=1, in black, is the directional SAE). Quality rises with ℓ, with G, and with k throughout, and t… view at source ↗

**Figure 19.** Figure 19: Grassmannian featurizer, full description-length landscape. Description length Lδ(x) (Eq. 5) against block dimension k, with one panel per dictionary width G and one curve per block sparsity ℓ, at each distortion level. The minimum of each curve marks the description-optimal block dimension, which sits at an interior k ≈ 3 and eases downward as G widens. C.3 Description length favours structure at every d… view at source ↗

**Figure 20.** Figure 20: Block featurizer, full description-length landscape. As in [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Group Lasso featurizer, full description-length landscape. As in [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Rank measures and utilization against block dimension. (top) Mean stable rank, participation ratio, and effective rank of the per-block code, against the allotted block dimension k, for the three featurizers; the dashed line is full occupancy (rank = k). All three measures grow far more slowly than the diagonal, so a block uses only a fraction of the dimensions it is given. (bottom) Utilization, the effec… view at source ↗

**Figure 23.** Figure 23: The higher harmonics survive a matched null. Share of the AC variance of the recovered curve manifold mi(θ) carried by each Fourier mode k as the stimulus orientation sweeps [0, 360◦ ). The measured manifold (blue) retains 18.3% and 12.0% of its variance at k=2 and k=3, while the two nulls leave these modes empty: a pure first-harmonic input (pink, k=1 only) places all of its power on the orientation circ… view at source ↗

**Figure 24.** Figure 24: Linear Probe Evaluation. We train linear probes on codes extracted by Vanilla SAEs and BSFs (Block-SAEs) on ImageNet-1k (classification), ADE20k (semantic segmentation) and NYUv2 (monocular depth estimation). We report validation set performance on each dataset. 1 2 3 4 group size K 0.1 0.2 0.3 0.4 0.5 0.6 0.7 cos(probe direction, k-atom subspace) Classification 1 2 3 4 group size K Segmentation 1 2 3 4 g… view at source ↗

**Figure 25.** Figure 25: Probe Recovery. We treat linear probes as activations and measure how well different SAEs can reconstruct them. 8 16 32 64 sparsity L0 (active groups/token) 4 5 6 7 Total Variation 4,096 concepts 8 16 32 64 sparsity L0 (active groups/token) 8,192 concepts 8 16 32 64 sparsity L0 (active groups/token) 16,384 concepts 8 16 32 64 sparsity L0 (active groups/token) 32,768 concepts SAE (K=1) K=2 K=3 K=4 K=6 K=8 … view at source ↗

**Figure 26.** Figure 26: BSFs learn more spatially coherent concepts. We show total variation as a function of number of concepts predicted for a sweep of 96 SAEs. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗

**Figure 27.** Figure 27: Minimal description length comparison between Grassmannian BSFs and MFAs. MFAs are naturally dense, resulting in high description lengths during inference. Hard top-k thresholding can be applied to MFAs to induce sparsity, as shown by the open circles (sparsity of k=1), significantly reducing the MDL without impacting R2 significantly. Adapting SMixAE and MFA. SMixAE encodes through a rectified expert sp… view at source ↗

read the original abstract

What is the geometry of a visual percept? The most widely used protocols for decomposing neural network representations into interpretable parts treat concepts as isolated directions, yet recent work shows that concepts are often realized as geometric structures in low dimensional regions of activation space. We turn to the literature of Structured sparsity to close this gap, and show that block sparsity, which groups directions into blocks, is the prior matched to a generative model in which a representation is a sparse sum of low-dimensional manifolds: the modern, learned form of a classical idea in visual neuroscience, where a visual feature is carried by a coordinated group of neurons rather than a single tuned one. We implement three variants of block-sparse featurizers (BSFs) and, through a minimum-description-length analysis, show that all three describe activations more compactly than direction-based featurizers, with the recovered concepts typically two- to four-dimensional. We then use BSFs to (i) recontextualize prior work, showing that curve detectors in InceptionV1 actually read from a single continuous curve manifold, (ii) discover novel manifolds including shadows and lighting in DINOv3, and (iii) support interpretable control of image generation in diffusion models (SDXL) via manifold steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Block-sparse featurizers recover 2-4D structures more compactly than directions via MDL, with useful recontextualizations and a control demo, though the manifold generative model claim rests on suggestive rather than isolating evidence.

read the letter

The key takeaway is that block-sparse featurizers recover low-dimensional manifold structures in visual representations more compactly than direction-based methods, according to minimum description length comparisons.

The paper introduces three variants of these featurizers and applies them to models like InceptionV1, DINOv3, and SDXL. It shows that the blocks are usually 2 to 4 dimensional, reinterprets curve detectors as reading from a single curve manifold, identifies new manifolds for shadows and lighting, and uses manifold steering for interpretable control in image generation.

This approach builds on structured sparsity literature and connects to classical ideas in visual neuroscience about features being carried by groups of neurons. The empirical MDL results and the practical control demo are solid contributions that move beyond abstract claims.

The softer part is the link to the specific generative model of sparse sums of low-dimensional manifolds. The MDL advantage shows better compression, but it does not directly test whether the data follows that process versus other forms of local structure or correlations that blocks might capture more efficiently. Without experiments on synthetic data generated from each model, the claim that block sparsity is the matched prior remains suggestive rather than conclusive.

This work is for researchers in mechanistic interpretability who care about the geometry of concepts rather than just linear directions. Readers working on vision models or representation learning could find the methods and examples useful. It has enough new empirical content and applications to merit a serious referee, even if some theoretical framing needs more support.

I would send it to peer review.

Referee Report

1 major / 0 minor

Summary. The paper claims that block sparsity is the prior matched to a generative model of neural activations as sparse sums of low-dimensional manifolds. It implements three block-sparse featurizer (BSF) variants and shows via minimum-description-length (MDL) analysis that they yield lower description lengths than direction-based featurizers, with recovered blocks typically 2-4 dimensional. Applications include recontextualizing curve detectors in InceptionV1 as reading from a single continuous manifold, discovering novel manifolds (e.g., shadows/lighting) in DINOv3, and enabling manifold-based steering in SDXL diffusion models.

Significance. If the central MDL result holds and the generative-model interpretation is supported, the work would advance interpretability by supplying a structured-sparsity prior aligned with manifold geometry rather than isolated directions, extending classical neuroscience ideas to modern networks. The cross-model applications and control demonstration would be concrete strengths.

major comments (1)

[MDL analysis] MDL analysis section: the claim that lower description length establishes that block sparsity is 'the prior matched to' the sparse-sum-of-manifolds generative model is not isolated by the reported experiments. The comparison is consistent with BSFs simply capturing local correlations or clustered directions more efficiently; a controlled test on synthetic data generated from each process is needed to distinguish the two stories.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: MDL analysis section: the claim that lower description length establishes that block sparsity is 'the prior matched to' the sparse-sum-of-manifolds generative model is not isolated by the reported experiments. The comparison is consistent with BSFs simply capturing local correlations or clustered directions more efficiently; a controlled test on synthetic data generated from each process is needed to distinguish the two stories.

Authors: We agree that the MDL comparison performed on real network activations does not isolate the sparse-sum-of-manifolds generative model from alternative explanations such as efficient capture of local correlations or clustered directions. The manuscript's theoretical section motivates block sparsity from the manifold model, and the recovered blocks are observed to be low-dimensional (typically 2-4D) with applications that align with manifold geometry, but these observations remain correlational. A controlled synthetic experiment comparing description lengths under data generated from each process would strengthen the causal claim. We will add such an experiment to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; central claim rests on independent empirical MDL comparison

full rationale

The paper advances its claim that block sparsity matches a sparse-sum-of-manifolds generative model solely via an empirical minimum-description-length analysis comparing BSF variants to direction-based featurizers on real activations. No equations, fitting procedures, or self-citations are shown that would reduce any prediction or uniqueness result to the inputs by construction. The MDL metric functions as an external benchmark rather than a self-referential definition, and the recovered block dimensionalities (2-4) are reported outcomes rather than fitted inputs renamed as predictions. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central modeling choice is the generative story of sparse sums of low-dimensional manifolds; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Concepts are realized as geometric structures in low dimensional regions of activation space rather than isolated directions
Stated as the premise drawn from recent work that the block-sparse prior is intended to match.

pith-pipeline@v0.9.1-grok · 5855 in / 1088 out tokens · 13239 ms · 2026-06-25T23:52:22.312789+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

152 extracted references · 14 canonical work pages

[1]

and Gillis, N

Abdolali, M. and Gillis, N. Beyond linear subspace clustering: A comparative study of nonlinear manifold clustering algorithms. Computer Science Review, 2021

2021
[2]

Sanity checks for saliency maps

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. Advances in Neural Information Processing Systems (NIPS), 2018

2018
[3]

Adelson, E. H. and Bergen, J. R. Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2 0 (2): 0 284--299, 1985. doi:10.1364/JOSAA.2.000284

work page doi:10.1364/josaa.2.000284 1985
[4]

T., and Sharkey, L

Ayonrinde, K., Pearce, M. T., and Sharkey, L. Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes. arXiv preprint arXiv:2410.11179, 2024

arXiv 2024
[5]

Structured sparsity through convex optimization

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Structured sparsity through convex optimization. Statistical Science, 2012

2012
[6]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. Public Library of Science (PloS One), 2015

2015
[7]

Bao, P., She, L., McGill, M., and Tsao, D. Y. A map of object space in primate inferotemporal cortex. Nature, 583 0 (7814): 0 103--108, 2020. doi:10.1038/s41586-020-2350-5

work page doi:10.1038/s41586-020-2350-5 2020
[8]

G., Cevher, V., Duarte, M

Baraniuk, R. G., Cevher, V., Duarte, M. F., and Hegde, C. Model-based compressive sensing. IEEE Transactions on information theory, 2010

2010
[9]

Barlow, H. B. et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1961

1961
[10]

and Niyogi, P

Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems (NeurIPS), 2001

2001
[11]

Do sparse autoencoders capture concept manifolds? A r X iv e-print , 2026 a

Bhalla, U., Fel, T., Rager, C., Feucht, S., Haklay, T., Wurgaft, D., Boppana, S., Kowal, M., Shyam, V., Merullo, J., et al. Do sparse autoencoders capture concept manifolds? A r X iv e-print , 2026 a

2026
[12]

M., Lakkaraju, H., and Calmon, F

Bhalla, U., Oesterling, A., Verdun, C. M., Lakkaraju, H., and Calmon, F. P. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability. 2026 b . URL https://arxiv.org/abs/2511.05541

arXiv 2026
[13]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

2023
[14]

Batchtopk sparse autoencoders

Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders. A r X iv e-print , 2024

2024
[15]

Cadieu, C. F. and Olshausen, B. A. Learning intermediate-level representations of form and motion from natural movies. Neural Computation, 24 0 (4): 0 827--866, 2012. doi:10.1162/NECO_a_00247

work page doi:10.1162/neco_a_00247 2012
[16]

Curve detectors

Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M., and Olah, C. Curve detectors. Distill.pub, 2020

2020
[17]

Penalized regression, standard errors, and bayesian lassos

Casella, G., Ghosh, M., Gill, J., and Kyung, M. Penalized regression, standard errors, and bayesian lassos. 2010

2010
[18]

and Tsao, D

Chang, L. and Tsao, D. Y. The code for facial identity in the primate brain. Cell, 169 0 (6): 0 1013--1028, 2017. doi:10.1016/j.cell.2017.05.011

work page doi:10.1016/j.cell.2017.05.011 2017
[19]

and Abbott, L

Chung, S. and Abbott, L. F. Neural population geometry: An approach for understanding biological and artificial neural networks. Current opinion in neurobiology, 70: 0 137--144, 2021

2021
[20]

D., and Sompolinsky, H

Chung, S., Lee, D. D., and Sompolinsky, H. Classification and geometry of general perceptual manifolds. Physical Review X, 8 0 (3): 0 031003, 2018

2018
[21]

M., Cunningham, J

Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Foster, J. D., Nuyujukian, P., Ryu, S. I., and Shenoy, K. V. Neural population dynamics during reaching. Nature, 2012

2012
[22]

Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 2006

2006
[23]

What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods

Colin, J., Fel, T., Cad \`e ne, R., and Serre, T. What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods. Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[24]

S., Tolooshams, B., and Ba, D

Costa, V., Fel, T., Lubana, E. S., Tolooshams, B., and Ba, D. From flat to hierarchical: Extracting sparse representations with matching pursuit. arXiv preprint arXiv:2506.03093, 2025

arXiv 2025
[25]

Sparse autoencoders find highly interpretable features in language models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

Pith/arXiv arXiv 2023
[26]

Dalili, S. A. and Mahdavi, M. Subspace-aware sparse autoencoders for effective mechanistic interpretability. A r X iv e-print , 2026

2026
[27]

Donoho, D. L. and Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 2003

2003
[28]

Doshi, F. R. and Konkle, T. Cortical topographic motifs emerge in a self-organized map of object space. Science Advances, 2023

2023
[29]

and Irofti, P

Dumitrescu, B. and Irofti, P. Dictionary learning algorithms and applications. 2018

2018
[30]

Ebitz, R. B. and Hayden, B. Y. The population doctrine in cognitive neuroscience. Neuron, 2021

2021
[31]

Sparse and redundant representations: from theory to applications in signal and image processing

Elad, M. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010

2010
[32]

Eldar, Y. C. and Bolcskei, H. Block-sparsity: Coherence and efficient recovery. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

2009
[33]

and Vidal, R

Elhamifar, E. and Vidal, R. Sparse manifold clustering and embedding. Advances in Neural Information Processing Systems (NeurIPS), 2011

2011
[34]

and Vidal, R

Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 2013

2013
[35]

J., Liao, I., Gurnee, W., and Tegmark, M

Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M. Not all language model features are one-dimensionally linear. arXiv preprint arXiv:2405.14860, 2024

arXiv 2024
[36]

Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis

Fel, T., Cadene, R., Chalvidal, M., Cord, M., Vigouroux, D., and Serre, T. Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis. Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[37]

Unlocking feature visualization for deeper networks with magnitude constrained optimization

Fel, T., Boissin, T., Boutin, V., Picard, A., Novello, P., Colin, J., Linsley, D., Rousseau, T., Cadène, R., Gardes, L., and Serre, T. Unlocking feature visualization for deeper networks with magnitude constrained optimization. Advances in Neural Information Processing Systems (NeurIPS), 2023 a

2023
[38]

A holistic approach to unifying automatic concept extraction and concept importance estimation

Fel, T., Boutin, V., Moayeri, M., Cadene, R., Bethune, L., Chalvidal, M., and Serre, T. A holistic approach to unifying automatic concept extraction and concept importance estimation. Advances in Neural Information Processing Systems (NeurIPS), 2023 b

2023
[39]

Craft: Concept recursive activation factorization for explainability

Fel, T., Picard, A., Bethune, L., Boissin, T., Vigouroux, D., Colin, J., Cadène, R., and Serre, T. Craft: Concept recursive activation factorization for explainability. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023 c

2023
[40]

A., Kowal, M., Lee, A., Balestriero, R., Joseph, S., Lubana, E

Fel, T., Wang, B., Lepori, M. A., Kowal, M., Lee, A., Balestriero, R., Joseph, S., Lubana, E. S., Konkle, T., Ba, D., et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638, 2025

Pith/arXiv arXiv 2025
[41]

S., et al

Feucht, S., Haklay, T., Bhalla, U., Wurgaft, D., Rager, C., Sarfati, R., Merullo, J., McGrath, T., Lewis, O., Lubana, E. S., et al. Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts. A r X iv e-print , 2026

2026
[42]

and Endres, D

Foldiak, P. and Endres, D. M. Sparse coding. A r X iv e-print , 2008

2008
[43]

Fong, R. C. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

2017
[44]

Smixae: Towards unsupervised manifold discovery in language models

Francel, C. Smixae: Towards unsupervised manifold discovery in language models. A r X iv e-print , 2026

2026
[45]

Convergent evolution: How different language models learn similar number representations

Fu, D., Zhou, T., Belkin, M., Sharan, V., and Jia, R. Convergent evolution: How different language models learn similar number representations. A r X iv e-print , 2026

2026
[46]

K., and Rigotti, M

Fusi, S., Miller, E. K., and Rigotti, M. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37: 0 66--74, 2016. doi:10.1016/j.conb.2016.01.010

work page doi:10.1016/j.conb.2016.01.010 2016
[47]

A., Perich, M

Gallego, J. A., Perich, M. G., Miller, L. E., and Solla, S. A. Neural manifolds for the control of movement. Neuron, 2017

2017
[48]

D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

Pith/arXiv arXiv 2024
[49]

S., Fel, T., Merullo, J., Lewis, O., and McGrath, T

Geiger, A., Lubana, E. S., Fel, T., Merullo, J., Lewis, O., and McGrath, T. The world inside neural networks: How neural geometry will unlock understanding and control of ai. Goodfire, May 2026

2026
[50]

P., Schwartz, A

Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. Neuronal population coding of movement direction. Science, 233 0 (4771): 0 1416--1419, 1986

1986
[51]

Interpretation of neural networks is fragile

Ghorbani, A., Abid, A., and Zou, J. Interpretation of neural networks is fragile. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017

2017
[52]

Y., and Kim, B

Ghorbani, A., Wexler, J., Zou, J. Y., and Kim, B. Towards automatic concept-based explanations. Advances in Neural Information Processing Systems (NeurIPS), 2019

2019
[53]

The missing curve detectors of inceptionv1: Applying sparse autoencoders to inceptionv1 early vision

Gorton, L. The missing curve detectors of inceptionv1: Applying sparse autoencoders to inceptionv1 early vision. A r X iv e-print , 2024

2024
[54]

and LeCun, Y

Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. Proceedings of the International Conference on Machine Learning (ICML), 2010

2010
[55]

When models manipulate manifolds: The geometry of a counting task

Gurnee, W., Ameisen, E., Kauvar, I., Tarng, J., Pearce, A., Olah, C., and Batson, J. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/linebreaks/index.html

2025
[56]

Guthikonda, S. M. Kohonen self-organizing maps. Wittenberg University, 2005

2005
[57]

The out-of-distribution problem in explainability and search methods for feature importance explanations

Hase, P., Xie, H., and Bansal, M. The out-of-distribution problem in explainability and search methods for feature importance explanations. Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[58]

Hindupur, S. S. R., Lubana, E. S., Fel, T., and Ba, D. Projecting assumptions: The duality between sparse autoencoders and concept geometry. arXiv preprint arXiv:2503.01822, 2025

arXiv 2025
[59]

Evaluations and methods for explanation through robustness analysis

Hsieh, C.-Y., Yeh, C.-K., Liu, X., Ravikumar, P., Kim, S., Kumar, S., and Hsieh, C.-J. Evaluations and methods for explanation through robustness analysis. Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[60]

Learning with structured sparsity

Huang, J., Zhang, T., and Metaxas, D. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.\ 417--424, 2009

2009
[61]

and Hoyer, P

Hyv \"a rinen, A. and Hoyer, P. Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12 0 (7): 0 1705--1720, 2000. doi:10.1162/089976600300015312

work page doi:10.1162/089976600300015312 2000
[62]

O., and Inki, M

Hyv \"a rinen, A., Hoyer, P. O., and Inki, M. Topographic independent component analysis. Neural Computation, 13 0 (7): 0 1527--1558, 2001. doi:10.1162/089976601750264992

work page doi:10.1162/089976601750264992 2001
[63]

Structured sparse principal component analysis

Jenatton, R., Obozinski, G., and Bach, F. Structured sparse principal component analysis. International Conference on Artificial Intelligence and Statistics, 2010

2010
[64]

Deep subspace clustering networks

Ji, P., Zhang, T., Li, H., Salzmann, M., and Reid, I. Deep subspace clustering networks. Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[65]

and Tegmark, M

Kantamneni, S. and Tegmark, M. Language models use trigonometry to do addition. arXiv preprint arXiv:2502.00873, 2025 a

arXiv 2025
[66]

and Tegmark, M

Kantamneni, S. and Tegmark, M. Language models use trigonometry to do addition. A r X iv e-print , 2025 b

2025
[67]

J., Nava, A., Wyart, M., and Bahri, Y

Karkada, D., Korchinski, D. J., Nava, A., Wyart, M., and Bahri, Y. Symmetry in language statistics shapes the geometry of model representations. arXiv preprint arXiv:2602.15029, 2026

arXiv 2026
[68]

and Lewicki, M

Karklin, Y. and Lewicki, M. S. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457 0 (7225): 0 83--86, 2009. doi:10.1038/nature07481

work page doi:10.1038/nature07481 2009
[69]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp.\ 2668--2677. PMLR, 2018

2018
[70]

Identifying interpretable visual features in artificial and biological neural systems

Klindt, D., Sanborn, S., Acosta, F., Poitevin, F., and Miolane, N. Identifying interpretable visual features in artificial and biological neural systems. A r X iv e-print , 2023

2023
[71]

From superposition to sparse codes: interpretable representations in neural networks

Klindt, D., O'Neill, C., Reizinger, P., Maurer, H., and Miolane, N. From superposition to sparse codes: interpretable representations in neural networks. arXiv preprint arXiv:2503.01824, 2025

arXiv 2025
[72]

Emergent organization of multiple visuotopic maps without a feature hierarchy

Konkle, T. Emergent organization of multiple visuotopic maps without a feature hierarchy. bioRxiv, 2021

2021
[73]

G., and Tokmakov, P

Kowal, M., Dave, A., Ambrus, R., Gaidon, A., Derpanis, K. G., and Tokmakov, P. Understanding video transformers via universal concept discovery. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 a

2024
[74]

P., and Derpanis, K

Kowal, M., Wildes, R. P., and Derpanis, K. G. Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 b

2024
[75]

Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework

Li, C.-G., You, C., and Vidal, R. Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework. IEEE Transactions on Image Processing, 2017

2017
[76]

Li, Z., Chen, Y., LeCun, Y., and Sommer, F. T. Neural manifold clustering and embedding. A r X iv e-print , 2022

2022
[77]

Robust subspace segmentation by low-rank representation

Liu, G., Lin, Z., and Yu, Y. Robust subspace segmentation by low-rank representation. Proceedings of the International Conference on Machine Learning (ICML), 2010

2010
[78]

Robust recovery of subspace structures by low-rank representation

Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 2012

2012
[79]

S., Rager, C., Hindupur, S

Lubana, E. S., Rager, C., Hindupur, S. S. R., Costa, V., Tuckute, G., Patel, O., Murthy, S. K., Fel, T., Wurgaft, D., Bigelow, E. J., et al. Priors in time: Missing inductive biases for language model interpretability. arXiv preprint arXiv:2511.01836, 2025

arXiv 2025
[80]

Sparse modeling for image and vision processing

Mairal, J., Bach, F., and Ponce, J. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision, 2014

2014

Showing first 80 references.

[1] [1]

and Gillis, N

Abdolali, M. and Gillis, N. Beyond linear subspace clustering: A comparative study of nonlinear manifold clustering algorithms. Computer Science Review, 2021

2021

[2] [2]

Sanity checks for saliency maps

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. Advances in Neural Information Processing Systems (NIPS), 2018

2018

[3] [3]

Adelson, E. H. and Bergen, J. R. Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2 0 (2): 0 284--299, 1985. doi:10.1364/JOSAA.2.000284

work page doi:10.1364/josaa.2.000284 1985

[4] [4]

T., and Sharkey, L

Ayonrinde, K., Pearce, M. T., and Sharkey, L. Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes. arXiv preprint arXiv:2410.11179, 2024

arXiv 2024

[5] [5]

Structured sparsity through convex optimization

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Structured sparsity through convex optimization. Statistical Science, 2012

2012

[6] [6]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. Public Library of Science (PloS One), 2015

2015

[7] [7]

Bao, P., She, L., McGill, M., and Tsao, D. Y. A map of object space in primate inferotemporal cortex. Nature, 583 0 (7814): 0 103--108, 2020. doi:10.1038/s41586-020-2350-5

work page doi:10.1038/s41586-020-2350-5 2020

[8] [8]

G., Cevher, V., Duarte, M

Baraniuk, R. G., Cevher, V., Duarte, M. F., and Hegde, C. Model-based compressive sensing. IEEE Transactions on information theory, 2010

2010

[9] [9]

Barlow, H. B. et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1961

1961

[10] [10]

and Niyogi, P

Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems (NeurIPS), 2001

2001

[11] [11]

Do sparse autoencoders capture concept manifolds? A r X iv e-print , 2026 a

Bhalla, U., Fel, T., Rager, C., Feucht, S., Haklay, T., Wurgaft, D., Boppana, S., Kowal, M., Shyam, V., Merullo, J., et al. Do sparse autoencoders capture concept manifolds? A r X iv e-print , 2026 a

2026

[12] [12]

M., Lakkaraju, H., and Calmon, F

Bhalla, U., Oesterling, A., Verdun, C. M., Lakkaraju, H., and Calmon, F. P. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability. 2026 b . URL https://arxiv.org/abs/2511.05541

arXiv 2026

[13] [13]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

2023

[14] [14]

Batchtopk sparse autoencoders

Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders. A r X iv e-print , 2024

2024

[15] [15]

Cadieu, C. F. and Olshausen, B. A. Learning intermediate-level representations of form and motion from natural movies. Neural Computation, 24 0 (4): 0 827--866, 2012. doi:10.1162/NECO_a_00247

work page doi:10.1162/neco_a_00247 2012

[16] [16]

Curve detectors

Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M., and Olah, C. Curve detectors. Distill.pub, 2020

2020

[17] [17]

Penalized regression, standard errors, and bayesian lassos

Casella, G., Ghosh, M., Gill, J., and Kyung, M. Penalized regression, standard errors, and bayesian lassos. 2010

2010

[18] [18]

and Tsao, D

Chang, L. and Tsao, D. Y. The code for facial identity in the primate brain. Cell, 169 0 (6): 0 1013--1028, 2017. doi:10.1016/j.cell.2017.05.011

work page doi:10.1016/j.cell.2017.05.011 2017

[19] [19]

and Abbott, L

Chung, S. and Abbott, L. F. Neural population geometry: An approach for understanding biological and artificial neural networks. Current opinion in neurobiology, 70: 0 137--144, 2021

2021

[20] [20]

D., and Sompolinsky, H

Chung, S., Lee, D. D., and Sompolinsky, H. Classification and geometry of general perceptual manifolds. Physical Review X, 8 0 (3): 0 031003, 2018

2018

[21] [21]

M., Cunningham, J

Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Foster, J. D., Nuyujukian, P., Ryu, S. I., and Shenoy, K. V. Neural population dynamics during reaching. Nature, 2012

2012

[22] [22]

Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 2006

2006

[23] [23]

What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods

Colin, J., Fel, T., Cad \`e ne, R., and Serre, T. What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods. Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[24] [24]

S., Tolooshams, B., and Ba, D

Costa, V., Fel, T., Lubana, E. S., Tolooshams, B., and Ba, D. From flat to hierarchical: Extracting sparse representations with matching pursuit. arXiv preprint arXiv:2506.03093, 2025

arXiv 2025

[25] [25]

Sparse autoencoders find highly interpretable features in language models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

Pith/arXiv arXiv 2023

[26] [26]

Dalili, S. A. and Mahdavi, M. Subspace-aware sparse autoencoders for effective mechanistic interpretability. A r X iv e-print , 2026

2026

[27] [27]

Donoho, D. L. and Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 2003

2003

[28] [28]

Doshi, F. R. and Konkle, T. Cortical topographic motifs emerge in a self-organized map of object space. Science Advances, 2023

2023

[29] [29]

and Irofti, P

Dumitrescu, B. and Irofti, P. Dictionary learning algorithms and applications. 2018

2018

[30] [30]

Ebitz, R. B. and Hayden, B. Y. The population doctrine in cognitive neuroscience. Neuron, 2021

2021

[31] [31]

Sparse and redundant representations: from theory to applications in signal and image processing

Elad, M. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010

2010

[32] [32]

Eldar, Y. C. and Bolcskei, H. Block-sparsity: Coherence and efficient recovery. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

2009

[33] [33]

and Vidal, R

Elhamifar, E. and Vidal, R. Sparse manifold clustering and embedding. Advances in Neural Information Processing Systems (NeurIPS), 2011

2011

[34] [34]

and Vidal, R

Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 2013

2013

[35] [35]

J., Liao, I., Gurnee, W., and Tegmark, M

Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M. Not all language model features are one-dimensionally linear. arXiv preprint arXiv:2405.14860, 2024

arXiv 2024

[36] [36]

Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis

Fel, T., Cadene, R., Chalvidal, M., Cord, M., Vigouroux, D., and Serre, T. Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis. Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[37] [37]

Unlocking feature visualization for deeper networks with magnitude constrained optimization

Fel, T., Boissin, T., Boutin, V., Picard, A., Novello, P., Colin, J., Linsley, D., Rousseau, T., Cadène, R., Gardes, L., and Serre, T. Unlocking feature visualization for deeper networks with magnitude constrained optimization. Advances in Neural Information Processing Systems (NeurIPS), 2023 a

2023

[38] [38]

A holistic approach to unifying automatic concept extraction and concept importance estimation

Fel, T., Boutin, V., Moayeri, M., Cadene, R., Bethune, L., Chalvidal, M., and Serre, T. A holistic approach to unifying automatic concept extraction and concept importance estimation. Advances in Neural Information Processing Systems (NeurIPS), 2023 b

2023

[39] [39]

Craft: Concept recursive activation factorization for explainability

Fel, T., Picard, A., Bethune, L., Boissin, T., Vigouroux, D., Colin, J., Cadène, R., and Serre, T. Craft: Concept recursive activation factorization for explainability. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023 c

2023

[40] [40]

A., Kowal, M., Lee, A., Balestriero, R., Joseph, S., Lubana, E

Fel, T., Wang, B., Lepori, M. A., Kowal, M., Lee, A., Balestriero, R., Joseph, S., Lubana, E. S., Konkle, T., Ba, D., et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638, 2025

Pith/arXiv arXiv 2025

[41] [41]

S., et al

Feucht, S., Haklay, T., Bhalla, U., Wurgaft, D., Rager, C., Sarfati, R., Merullo, J., McGrath, T., Lewis, O., Lubana, E. S., et al. Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts. A r X iv e-print , 2026

2026

[42] [42]

and Endres, D

Foldiak, P. and Endres, D. M. Sparse coding. A r X iv e-print , 2008

2008

[43] [43]

Fong, R. C. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

2017

[44] [44]

Smixae: Towards unsupervised manifold discovery in language models

Francel, C. Smixae: Towards unsupervised manifold discovery in language models. A r X iv e-print , 2026

2026

[45] [45]

Convergent evolution: How different language models learn similar number representations

Fu, D., Zhou, T., Belkin, M., Sharan, V., and Jia, R. Convergent evolution: How different language models learn similar number representations. A r X iv e-print , 2026

2026

[46] [46]

K., and Rigotti, M

Fusi, S., Miller, E. K., and Rigotti, M. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37: 0 66--74, 2016. doi:10.1016/j.conb.2016.01.010

work page doi:10.1016/j.conb.2016.01.010 2016

[47] [47]

A., Perich, M

Gallego, J. A., Perich, M. G., Miller, L. E., and Solla, S. A. Neural manifolds for the control of movement. Neuron, 2017

2017

[48] [48]

D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

Pith/arXiv arXiv 2024

[49] [49]

S., Fel, T., Merullo, J., Lewis, O., and McGrath, T

Geiger, A., Lubana, E. S., Fel, T., Merullo, J., Lewis, O., and McGrath, T. The world inside neural networks: How neural geometry will unlock understanding and control of ai. Goodfire, May 2026

2026

[50] [50]

P., Schwartz, A

Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. Neuronal population coding of movement direction. Science, 233 0 (4771): 0 1416--1419, 1986

1986

[51] [51]

Interpretation of neural networks is fragile

Ghorbani, A., Abid, A., and Zou, J. Interpretation of neural networks is fragile. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017

2017

[52] [52]

Y., and Kim, B

Ghorbani, A., Wexler, J., Zou, J. Y., and Kim, B. Towards automatic concept-based explanations. Advances in Neural Information Processing Systems (NeurIPS), 2019

2019

[53] [53]

The missing curve detectors of inceptionv1: Applying sparse autoencoders to inceptionv1 early vision

Gorton, L. The missing curve detectors of inceptionv1: Applying sparse autoencoders to inceptionv1 early vision. A r X iv e-print , 2024

2024

[54] [54]

and LeCun, Y

Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. Proceedings of the International Conference on Machine Learning (ICML), 2010

2010

[55] [55]

When models manipulate manifolds: The geometry of a counting task

Gurnee, W., Ameisen, E., Kauvar, I., Tarng, J., Pearce, A., Olah, C., and Batson, J. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/linebreaks/index.html

2025

[56] [56]

Guthikonda, S. M. Kohonen self-organizing maps. Wittenberg University, 2005

2005

[57] [57]

The out-of-distribution problem in explainability and search methods for feature importance explanations

Hase, P., Xie, H., and Bansal, M. The out-of-distribution problem in explainability and search methods for feature importance explanations. Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[58] [58]

Hindupur, S. S. R., Lubana, E. S., Fel, T., and Ba, D. Projecting assumptions: The duality between sparse autoencoders and concept geometry. arXiv preprint arXiv:2503.01822, 2025

arXiv 2025

[59] [59]

Evaluations and methods for explanation through robustness analysis

Hsieh, C.-Y., Yeh, C.-K., Liu, X., Ravikumar, P., Kim, S., Kumar, S., and Hsieh, C.-J. Evaluations and methods for explanation through robustness analysis. Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[60] [60]

Learning with structured sparsity

Huang, J., Zhang, T., and Metaxas, D. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.\ 417--424, 2009

2009

[61] [61]

and Hoyer, P

Hyv \"a rinen, A. and Hoyer, P. Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12 0 (7): 0 1705--1720, 2000. doi:10.1162/089976600300015312

work page doi:10.1162/089976600300015312 2000

[62] [62]

O., and Inki, M

Hyv \"a rinen, A., Hoyer, P. O., and Inki, M. Topographic independent component analysis. Neural Computation, 13 0 (7): 0 1527--1558, 2001. doi:10.1162/089976601750264992

work page doi:10.1162/089976601750264992 2001

[63] [63]

Structured sparse principal component analysis

Jenatton, R., Obozinski, G., and Bach, F. Structured sparse principal component analysis. International Conference on Artificial Intelligence and Statistics, 2010

2010

[64] [64]

Deep subspace clustering networks

Ji, P., Zhang, T., Li, H., Salzmann, M., and Reid, I. Deep subspace clustering networks. Advances in Neural Information Processing Systems (NeurIPS), 2017

2017

[65] [65]

and Tegmark, M

Kantamneni, S. and Tegmark, M. Language models use trigonometry to do addition. arXiv preprint arXiv:2502.00873, 2025 a

arXiv 2025

[66] [66]

and Tegmark, M

Kantamneni, S. and Tegmark, M. Language models use trigonometry to do addition. A r X iv e-print , 2025 b

2025

[67] [67]

J., Nava, A., Wyart, M., and Bahri, Y

Karkada, D., Korchinski, D. J., Nava, A., Wyart, M., and Bahri, Y. Symmetry in language statistics shapes the geometry of model representations. arXiv preprint arXiv:2602.15029, 2026

arXiv 2026

[68] [68]

and Lewicki, M

Karklin, Y. and Lewicki, M. S. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457 0 (7225): 0 83--86, 2009. doi:10.1038/nature07481

work page doi:10.1038/nature07481 2009

[69] [69]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp.\ 2668--2677. PMLR, 2018

2018

[70] [70]

Identifying interpretable visual features in artificial and biological neural systems

Klindt, D., Sanborn, S., Acosta, F., Poitevin, F., and Miolane, N. Identifying interpretable visual features in artificial and biological neural systems. A r X iv e-print , 2023

2023

[71] [71]

From superposition to sparse codes: interpretable representations in neural networks

Klindt, D., O'Neill, C., Reizinger, P., Maurer, H., and Miolane, N. From superposition to sparse codes: interpretable representations in neural networks. arXiv preprint arXiv:2503.01824, 2025

arXiv 2025

[72] [72]

Emergent organization of multiple visuotopic maps without a feature hierarchy

Konkle, T. Emergent organization of multiple visuotopic maps without a feature hierarchy. bioRxiv, 2021

2021

[73] [73]

G., and Tokmakov, P

Kowal, M., Dave, A., Ambrus, R., Gaidon, A., Derpanis, K. G., and Tokmakov, P. Understanding video transformers via universal concept discovery. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 a

2024

[74] [74]

P., and Derpanis, K

Kowal, M., Wildes, R. P., and Derpanis, K. G. Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 b

2024

[75] [75]

Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework

Li, C.-G., You, C., and Vidal, R. Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework. IEEE Transactions on Image Processing, 2017

2017

[76] [76]

Li, Z., Chen, Y., LeCun, Y., and Sommer, F. T. Neural manifold clustering and embedding. A r X iv e-print , 2022

2022

[77] [77]

Robust subspace segmentation by low-rank representation

Liu, G., Lin, Z., and Yu, Y. Robust subspace segmentation by low-rank representation. Proceedings of the International Conference on Machine Learning (ICML), 2010

2010

[78] [78]

Robust recovery of subspace structures by low-rank representation

Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 2012

2012

[79] [79]

S., Rager, C., Hindupur, S

Lubana, E. S., Rager, C., Hindupur, S. S. R., Costa, V., Tuckute, G., Patel, O., Murthy, S. K., Fel, T., Wurgaft, D., Bigelow, E. J., et al. Priors in time: Missing inductive biases for language model interpretability. arXiv preprint arXiv:2511.01836, 2025

arXiv 2025

[80] [80]

Sparse modeling for image and vision processing

Mairal, J., Bach, F., and Ponce, J. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision, 2014

2014