pith. sign in

arxiv: 2509.03738 · v4 · submitted 2025-09-03 · 💻 cs.LG · cs.AI· eess.SP· stat.ML

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

Pith reviewed 2026-05-18 18:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIeess.SPstat.ML
keywords sparse autoencodersneural operatorsmechanistic interpretabilityfunctional representationsFourier neural operatorsconcept sparsitydomain sparsity
0
0 comments X

The pith

Sparse autoencoder neural operators represent concepts as functions to capture where and how they appear across an input domain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces sparse autoencoder neural operators that work directly in function space instead of fixed vectors. They formalize the idea that data arise from sparse combinations of structured functions, and implement this by parameterizing each concept as a function that can vary over the domain. Using Fourier neural operators as the base, the models apply joint sparsity to choose active concepts and to select where each one is expressed. On vision data these SAE-FNOs learn localized patterns, require fewer active concepts, and keep concept properties stable when sparsity changes. They also adjust automatically to new domain sizes and continue to work at resolutions never seen in training, settings where ordinary sparse autoencoders stop functioning.

Core claim

Moving from vector-valued to functional parameterizations, together with joint concept and domain sparsity, extends sparse autoencoders from merely indicating concept presence to modeling the structured spatial or spectral expression of those concepts, as shown by improved localization, efficiency, stability, and generalization across discretizations on vision tasks.

What carries the argument

SAE-FNOs, which instantiate sparse autoencoders with Fourier neural operators so that each concept is an integral operator in the Fourier domain, controlled by separate sparsity penalties on which concepts activate and where they act across the input domain.

If this is right

  • SAE-FNOs learn localized patterns on vision data.
  • They activate fewer concepts than standard SAEs while maintaining performance.
  • Concept properties remain stable when the sparsity level is varied.
  • The models automatically adapt when the size of the input domain changes.
  • They continue to operate correctly at grid resolutions higher than those used during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same functional approach could be applied to time-series or physical simulation data where spatial or frequency structure is central.
  • Standard vector SAEs may be fundamentally limited when the underlying data vary continuously across a domain.
  • Choosing different operator bases could let practitioners inject domain knowledge directly into the interpretability model.

Load-bearing premise

Data are generated by sparse compositions of structured functions rather than by scalar activations inside a fixed-dimensional vector space.

What would settle it

A direct comparison on the same vision benchmark in which SAE-FNOs either fail to generalize to resolutions outside the training grid or require as many or more active concepts as a standard SAE to reach the same reconstruction quality.

Figures

Figures reproduced from arXiv: 2509.03738 by Ailsa Shen, Anima Anandkumar, Bahareh Tolooshams.

Figure 1
Figure 1. Figure 1: Model Recovery with SAEs. a) Architectural comparison of SAE, lifted SAE, and SAE Neural Operators. b) Learning in sampled Euclidean spaces vs. function spaces. conditions do networks and operators recover equivalent representations, and when do operators offer advantages? iii) How does lifting affect recovery dynamics? Our Contributions We address these questions by extending SAEs to lifted SAEs (L-SAEs),… view at source ↗
Figure 2
Figure 2. Figure 2: SAE-CNN vs. SAE-FNO. a) Lifting accelerates learning. b) SAE-FNO’s superiority in recovering smooth concepts via truncated Fourier modes. c) Equivalent learning when SAE-FNO uses all Fourier modes and matched spatial receptive field of SAE-CNNs. SAE-FNO We examined model recovery in function spaces. Our results show that: i) the lifting￾induced preconditioning effect extends to L-SAE-FNO (Fig. 2a, Prop. D.… view at source ↗
Figure 3
Figure 3. Figure 3: SAE-FNO Upsampling Robustness Across Resolutions. SAE-FNO successfully infers the underlying sparse representations and reconstructs data across multiple discretization levels. The left panels show inference of 1-sparse code supports across 5 kernels, and the right panels display spatial-domain signal reconstruction (see also [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Lifting as a preconditioner. Lifting accelerates learning. 0 2500 5000 7500 10000 12500 15000 Training Iterations 0.0 0.1 0.2 0.3 Dictionary Recovery Error SAE-CNN L-SAE-CNN [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lifting. When the lifting operator satisfies the orthogonal condition L⊤L = I, the lifted SAE-CNN (L-SAE-CNN) exhibits equivalent learning dynamics to the SAE-CNN . 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SAEs vs. L-SAEs. Learning the lifting operator L accelerates model recovery. (a) Dictionary recovery error converges faster for L-SAE-MLP, confirming the preconditioning effect of lifting; (b) Reconstruction loss follows similar convergence trends as dictionary recovery error; (c) Lifting encourages the effective dictionary D to learn more orthogonal (less correlated) atoms early in training, creating a mo… view at source ↗
Figure 7
Figure 7. Figure 7: SAE-FNO Upsampling Robustness Across Resolutions. SAE-FNO successfully infers the underlying sparse representations and reconstructs data across multiple discretization levels. The left panels show inference of 1-sparse code supports across 5 kernels, while the right panels display spatial-domain signal reconstruction. (a) Original resolution (1×): Baseline performance at training resolution. (b-d) Higher … view at source ↗
read the original abstract

We introduce sparse autoencoder neural operators (SAE-NOs), a new class of sparse autoencoders that operate in function spaces rather than fixed-dimensional Euclidean representations. We formalize the functional representation hypothesis, where data are explained through sparse compositions of structured functions. Unlike standard SAEs that represent concepts with scalar activations, SAE-NOs parameterize concepts as functions, enabling representations that capture not only a concept's presence, but also how and where it is expressed across the input domain. We achieve this through joint sparsity: concept sparsity selects active concepts, while domain sparsity governs where they are expressed. We instantiate this framework using Fourier neural operators (SAE-FNOs), parameterizing concepts as integral operators in the Fourier domain. This functional and spectral parameterization is particularly advantageous when data exhibit spatial structure across scales or when concepts are frequency-structured. We characterize SAE-FNO on vision data and demonstrate that it learns localized patterns, uses concepts more efficiently, and exhibits stable concept characteristics across sparsity levels. We further show that SAE-FNO adapts to changes in domain size and generalizes across discretizations, operating at resolutions beyond those seen during training, where standard SAEs fail. We also introduce lifting into SAEs and show theoretically and empirically that it acts as a preconditioner that accelerates optimization. Overall, our results show that moving from vector-valued to functional parameterizations, with concept and domain sparsity, extends SAEs from representing concept presence to modeling structured concept expression, highlighting the importance of parameterization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces sparse autoencoder neural operators (SAE-NOs), a new class of sparse autoencoders that operate in function spaces. It formalizes the functional representation hypothesis, positing that data are explained through sparse compositions of structured functions rather than scalar activations. The framework employs joint sparsity (concept sparsity to select active concepts and domain sparsity to control where they are expressed) and instantiates the approach via Fourier neural operators as SAE-FNOs, parameterizing concepts as integral operators in the Fourier domain. Empirical results on vision data claim that SAE-FNOs learn localized patterns, use concepts more efficiently, exhibit stable concept characteristics across sparsity levels, adapt to domain size changes, and generalize across discretizations where standard SAEs fail. The work also introduces lifting into SAEs and provides theoretical and empirical support that it acts as a preconditioner accelerating optimization.

Significance. If the results hold, the work offers a substantive extension of mechanistic interpretability by shifting from vector-valued to functional representations, enabling modeling of structured concept expression across domains. This is especially relevant for spatially structured data. Strengths include the explicit parameterization choice, the joint sparsity mechanism, and the theoretical motivation for lifting as a preconditioner. The reported resolution-invariance properties, if quantitatively validated, would distinguish the approach from standard SAEs and support broader applicability in scientific machine learning.

major comments (3)
  1. [Abstract and Results section] Abstract and Results section: The central claims that SAE-FNOs adapt to domain size changes and generalize across discretizations (where standard SAEs fail) are load-bearing for the contribution, yet rest on qualitative observations without reported quantitative metrics such as reconstruction error, concept stability scores, or cross-resolution ablation results. This weakens the ability to evaluate whether the Fourier parameterization and joint sparsity truly deliver the asserted invariance.
  2. [§3 (Method, joint sparsity and SAE-FNO definition)] §3 (Method, joint sparsity and SAE-FNO definition): The interaction between the domain sparsity mask and the discrete Fourier integral operator is not shown to preserve resolution invariance under changes in grid size or sampling; if the learned spectral coefficients encode grid-specific artifacts via the FFT implementation, the cross-discretization generalization would be an artifact rather than a property of the functional form. A concrete test (e.g., explicit quadrature or mode truncation analysis) is needed.
  3. [Theoretical section on lifting] Theoretical section on lifting: The claim that lifting acts as a preconditioner is presented as both theoretical and empirical, but the specific derivation (e.g., the relevant equation showing the preconditioning effect on the optimization landscape) is not clearly isolated, making it hard to verify the acceleration result independently of the empirical curves.
minor comments (2)
  1. [Introduction] Clarify the precise distinction and notation between the general SAE-NO framework and the specific SAE-FNO instantiation at first use to improve readability.
  2. [Figures] Add error bars or statistical details to any quantitative plots in the experimental figures, even when the primary emphasis is qualitative.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the presentation and empirical support for our claims. We address each major comment below and have revised the manuscript to incorporate quantitative evidence, additional analysis, and clearer derivations where needed.

read point-by-point responses
  1. Referee: [Abstract and Results section] Abstract and Results section: The central claims that SAE-FNOs adapt to domain size changes and generalize across discretizations (where standard SAEs fail) are load-bearing for the contribution, yet rest on qualitative observations without reported quantitative metrics such as reconstruction error, concept stability scores, or cross-resolution ablation results. This weakens the ability to evaluate whether the Fourier parameterization and joint sparsity truly deliver the asserted invariance.

    Authors: We agree that quantitative metrics are necessary to substantiate the resolution-invariance claims. In the revised manuscript we have added reconstruction error tables across multiple grid resolutions, concept stability scores computed via functional cosine similarity between concepts learned at different discretizations, and explicit cross-resolution ablation experiments comparing SAE-FNOs against standard SAEs. These additions demonstrate that the observed generalization is accompanied by measurable improvements in reconstruction fidelity and concept consistency, directly supporting the contribution of the Fourier parameterization and joint sparsity. revision: yes

  2. Referee: [§3 (Method, joint sparsity and SAE-FNO definition)] §3 (Method, joint sparsity and SAE-FNO definition): The interaction between the domain sparsity mask and the discrete Fourier integral operator is not shown to preserve resolution invariance under changes in grid size or sampling; if the learned spectral coefficients encode grid-specific artifacts via the FFT implementation, the cross-discretization generalization would be an artifact rather than a property of the functional form. A concrete test (e.g., explicit quadrature or mode truncation analysis) is needed.

    Authors: We appreciate this observation and have added a dedicated analysis subsection in §3. Because concepts are represented by a fixed set of Fourier modes whose coefficients are learned independently of the spatial grid, the parameterization is theoretically resolution-invariant; the domain sparsity mask is applied in the spatial domain after the inverse Fourier transform and therefore does not introduce grid-dependent artifacts into the spectral coefficients. We now include a mode-truncation study and quadrature-error bounds showing that reconstruction error remains stable under grid refinement, together with an empirical test that trains on one discretization and evaluates on another with different sampling density. These results indicate that the generalization arises from the functional form rather than implementation artifacts. revision: yes

  3. Referee: [Theoretical section on lifting] Theoretical section on lifting: The claim that lifting acts as a preconditioner is presented as both theoretical and empirical, but the specific derivation (e.g., the relevant equation showing the preconditioning effect on the optimization landscape) is not clearly isolated, making it hard to verify the acceleration result independently of the empirical curves.

    Authors: We acknowledge that the derivation of the preconditioning effect was not sufficiently isolated. In the revised theoretical section we have extracted and numbered the key equations that demonstrate how lifting improves the conditioning of the loss landscape (specifically, how it reduces the Lipschitz constant of the gradient with respect to the encoder parameters). A step-by-step derivation is now provided, showing the relationship between the lifted representation and the Hessian spectrum, followed by the empirical curves that corroborate the predicted acceleration. This separation allows readers to verify the theoretical argument independently. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical tests of a design choice rather than definitional reduction

full rationale

The paper defines SAE-NOs by extending existing SAE and FNO architectures with joint sparsity and a functional representation hypothesis that is explicitly formalized within the work. Generalization across discretizations is presented as an empirical outcome demonstrated on vision data (training on one resolution, testing on others), leveraging the known resolution-invariance properties of Fourier neural operators rather than deriving it tautologically from fitted parameters or self-citations. No equation or result is shown to equal its inputs by construction; the lifting preconditioner claim is supported by both theory and experiments. The derivation chain remains self-contained against external benchmarks of neural operator behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the newly stated functional representation hypothesis and the choice of Fourier parameterization; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Functional representation hypothesis: data are explained through sparse compositions of structured functions.
    Stated in the abstract as the foundation for moving from scalar to functional concept representations.
invented entities (1)
  • SAE-NO / SAE-FNO no independent evidence
    purpose: Sparse autoencoder that operates in function space with joint concept and domain sparsity.
    New class introduced by the paper; no independent evidence outside the work itself.

pith-pipeline@v0.9.0 · 5809 in / 1308 out tokens · 31826 ms · 2026-05-18T18:56:12.309628+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We formalize the functional representation hypothesis, where data are explained through sparse compositions of structured functions... SAE-FNOs... parameterizing concepts as integral operators in the Fourier domain... lifting... acts as a preconditioner that accelerates optimization.

  • IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Lifting... has the effective update... L⊤L acts as a preconditioner... SAE-FNO with truncated modes exhibits an inductive bias that favours recovery of smooth concepts

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 6 internal anchors

  1. [1]

    Brain-score: Which artificial neural network for object recognition is most brain-like?,

    M. Schrimpf, J. Kubilius, H. Hong, N. J. Majaj, R. Rajalingham, E. B. Issa, K. Kar, P. Bashivan, J. Prescott-Roy, F. Geiger,et al., “Brain-score: Which artificial neural network for object recognition is most brain-like?,”BioRxiv, p. 407007, 2018

  2. [2]

    The topology and geometry of neural representations,

    B. Lin and N. Kriegeskorte, “The topology and geometry of neural representations,”Proceedings of the National Academy of Sciences, vol. 121, no. 42, p. e2317881121, 2024

  3. [3]

    High-level visual representations in the human brain are aligned with large language models,

    A. Doerig, T. C. Kietzmann, E. Allen, Y . Wu, T. Naselaris, K. Kay, and I. Charest, “High-level visual representations in the human brain are aligned with large language models,”Nature Machine Intelligence, pp. 1–15, 2025

  4. [4]

    Stabilization of a brain–computer interface via the alignment of low-dimensional spaces of neural activity,

    A. D. Degenhart, W. E. Bishop, E. R. Oby, E. C. Tyler-Kabara, S. M. Chase, A. P. Batista, and B. M. Yu, “Stabilization of a brain–computer interface via the alignment of low-dimensional spaces of neural activity,”Nature biomedical engineering, vol. 4, no. 7, pp. 672–685, 2020

  5. [5]

    Universality and individuality in neural dynamics across large populations of recurrent networks,

    N. Maheswaranathan, A. H. Williams, M. D. Golub, S. Ganguli, and D. Sussillo, “Universality and individuality in neural dynamics across large populations of recurrent networks,”Advances in Neural Information Processing Systems, vol. 32, 2019

  6. [6]

    Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis,

    A. H. Williams, “Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis,” inProceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models, pp. 10–23, PMLR, 2024

  7. [7]

    Soft matching distance: A metric on neural representations that captures single-neuron tuning,

    M. Khosla and A. H. Williams, “Soft matching distance: A metric on neural representations that captures single-neuron tuning,” inProceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, pp. 326–341, PMLR, 2024

  8. [8]

    Representation topology divergence: A method for comparing neural network representations.,

    S. Barannikov, I. Trofimov, N. Balabin, and E. Burnaev, “Representation topology divergence: A method for comparing neural network representations.,” inInternational Conference on Machine Learning, pp. 1607–1626, PMLR, 2022

  9. [9]

    Representational similarity analysis–connecting the branches of systems neuroscience,

    N. Kriegeskorte, M. Mur, and P. A. Bandettini, “Representational similarity analysis–connecting the branches of systems neuroscience,”Frontiers in Systems Neuroscience, vol. 2, p. 4, 2008

  10. [10]

    Position: The platonic representation hypothesis,

    M. Huh, B. Cheung, T. Wang, and P. Isola, “Position: The platonic representation hypothesis,” inForty-first International Conference on Machine Learning, 2024

  11. [11]

    Proof of a perfect platonic representation hypothesis,

    L. Ziyin and I. Chuang, “Proof of a perfect platonic representation hypothesis,”arXiv preprint arXiv:2507.01098, 2025

  12. [12]

    Neural Operator: Graph Kernel Network for Partial Differential Equations

    Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Neural operator: Graph kernel network for partial differential equations,”arXiv preprint arXiv:2003.03485, 2020

  13. [13]

    Fourier neural operator for parametric partial differential equations,

    Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar, “Fourier neural operator for parametric partial differential equations,” inInternational Conference on Learning Representations, 2021

  14. [14]

    Neural operator: Learning maps between function spaces with applications to pdes,

    N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Neural operator: Learning maps between function spaces with applications to pdes,”Journal of Machine Learning Research, vol. 24, no. 89, pp. 1–97, 2023

  15. [15]

    Neural operators for accelerating scientific simulations and design,

    K. Azizzadenesheli, N. Kovachki, Z. Li, M. Liu-Schiaffini, J. Kossaifi, and A. Anandkumar, “Neural operators for accelerating scientific simulations and design,”Nature Reviews Physics, pp. 1–9, 2024

  16. [16]

    Vars-fusi: Variable sampling for fast and efficient functional ultrasound imaging using neural operators,

    B. Tolooshams, L. Lydia, T. Callier, J. Wang, S. Pal, A. Chandrashekar, C. Rabut, Z. Li, C. Blagden, S. L. Norman, K. Azizzadenesheli, C. Liu, M. G. Shapiro, R. A. Andersen, and A. Anandkumar, “Vars-fusi: Variable sampling for fast and efficient functional ultrasound imaging using neural operators,”bioRxiv, pp. 2025–04, 2025. 6

  17. [17]

    Noble–neural operator with biologically-informed latent embeddings to capture experimental variability in biological neuron models,

    L. Ghafourpour, V . Duruisseaux*, B. Tolooshams*, P. H. Wong, C. A. Anastassiou, and A. Anandkumar, “Noble–neural operator with biologically-informed latent embeddings to capture experimental variability in biological neuron models,”arXiv:2506.04536, 2025

  18. [18]

    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

    J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli,et al., “Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators,”arXiv preprint arXiv:2202.11214, 2022

  19. [19]

    Geometry-informed neural operator for large-scale 3d pdes,

    Z. Li, N. Kovachki, C. Choy, B. Li, J. Kossaifi, S. Otta, M. A. Nabian, M. Stadler, C. Hundt, K. Azizzadenesheli,et al., “Geometry-informed neural operator for large-scale 3d pdes,”Ad- vances in Neural Information Processing Systems, vol. 36, 2024

  20. [20]

    Unify- ing subsampling pattern variations for compressed sensing mri with neural operators,

    A. S. Jatyani, J. Wang, Z. Wu, M. Liu-Schiaffini, B. Tolooshams, and A. Anandkumar, “Unify- ing subsampling pattern variations for compressed sensing mri with neural operators,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  21. [21]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav),

    B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas,et al., “Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav),” inInternational conference on machine learning, pp. 2668–2677, PMLR, 2018

  22. [22]

    Emergence of simple-cell receptive field properties by learning a sparse code for natural images,

    B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,”Nature, vol. 381, no. 6583, pp. 607–609, 1996

  23. [23]

    Sparse coding with an overcomplete basis set: A strategy employed by v1?,

    B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?,”Vision research, vol. 37, no. 23, pp. 3311–3325, 1997

  24. [24]

    Toy Models of Superposition

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen,et al., “Toy models of superposition,”arXiv:2209.10652, 2022

  25. [25]

    Sparse autoencoders find highly interpretable features in language models,

    R. Huben, H. Cunningham, L. R. Smith, A. Ewart, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” inThe Twelfth International Conference on Learning Representations, 2023

  26. [26]

    Towards monosemanticity: Decomposing language models with dictionary learning,

    T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell,et al., “Towards monosemanticity: Decomposing language models with dictionary learning,”Transformer Circuits Thread, vol. 2, 2023

  27. [27]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V . Varma, J. Kramár, and N. Nanda, “Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders,”arXiv preprint arXiv:2407.14435, 2024

  28. [28]

    The linear representation hypothesis and the geometry of large language models,

    K. Park, Y . J. Choe, and V . Veitch, “The linear representation hypothesis and the geometry of large language models,” inInternational Conference on Machine Learning, pp. 39643–39666, PMLR, 2024

  29. [29]

    Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet,

    A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan, “Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet,”Transfo...

  30. [30]

    Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2,

    T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V . Varma, J. Kramar, A. Dragan, R. Shah, and N. Nanda, “Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2,” inThe 7th BlackboxNLP Workshop, 2024

  31. [31]

    Scaling and evaluating sparse autoencoders,

    L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu, “Scaling and evaluating sparse autoencoders,” inThe Thirteenth International Conference on Learning Representations, 2025

  32. [32]

    Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models

    T. Fel, E. S. Lubana, J. S. Prince, M. Kowal, V . Boutin, I. Papadimitriou, B. Wang, M. Wat- tenberg, D. Ba, and T. Konkle, “Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models,”arXiv preprint arXiv:2502.12892, 2025. 7

  33. [33]

    Sparse feature circuits: Discovering and editing interpretable causal graphs in language models,

    S. Marks, C. Rager, E. J. Michaud, Y . Belinkov, D. Bau, and A. Mueller, “Sparse feature circuits: Discovering and editing interpretable causal graphs in language models,” inThe Thirteenth International Conference on Learning Representations, 2025

  34. [34]

    SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability,

    A. Karvonen, C. Rager, J. Lin, C. Tigges, J. I. Bloom, D. Chanin, Y .-T. Lau, E. Farrell, C. S. Mc- Dougall, K. Ayonrinde, D. Till, M. Wearden, A. Conmy, S. Marks, and N. Nanda, “SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability,” in Forty-second International Conference on Machine Learning, 2025

  35. [35]

    C. W. Groetsch and C. Groetsch,Inverse problems in the mathematical sciences, vol. 52. Springer, 1993

  36. [36]

    Hastie, R

    T. Hastie, R. Tibshirani, and M. Wainwright,Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015

  37. [37]

    Compressed sensing,

    D. L. Donoho, “Compressed sensing,”IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006

  38. [38]

    Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,

    E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,”IEEE Transactions on information theory, vol. 52, no. 2, pp. 489–509, 2006

  39. [39]

    An introduction to compressive sampling,

    E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,”IEEE signal processing magazine, vol. 25, no. 2, pp. 21–30, 2008

  40. [40]

    Estimating unknown sparsity in compressed sensing,

    M. Lopes, “Estimating unknown sparsity in compressed sensing,” inInternational Conference on Machine Learning, pp. 217–225, PMLR, 2013

  41. [41]

    Online dictionary learning for sparse coding,

    J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proceedings of the 26th annual international conference on machine learning, pp. 689–696, 2009

  42. [42]

    Learning sparsely used overcomplete dictionaries via alternating minimization,

    A. Agarwal, A. Anandkumar, P. Jain, and P. Netrapalli, “Learning sparsely used overcomplete dictionaries via alternating minimization,”SIAM Journal on Optimization, vol. 26, no. 4, pp. 2775–2799, 2016

  43. [43]

    Alternating minimization for dictionary learning: Local convergence guarantees,

    N. S. Chatterji and P. L. Bartlett, “Alternating minimization for dictionary learning: Local convergence guarantees,”arXiv preprint arXiv:1711.03634, 2017

  44. [44]

    Tolooshams,Deep Learning for Inverse Problems in Engineering and Science

    B. Tolooshams,Deep Learning for Inverse Problems in Engineering and Science. PhD thesis, Harvard University, 2023

  45. [45]

    Learning fast approximations of sparse coding,

    K. Gregor and Y . LeCun, “Learning fast approximations of sparse coding,” inProceedings of international conference on international conference on machine learning, pp. 399–406, 2010

  46. [46]

    Learning step sizes for unfolded sparse coding,

    P. Ablin, T. Moreau, M. Massias, and A. Gramfort, “Learning step sizes for unfolded sparse coding,” inProceedings of Advances in Neural Information Processing Systems, vol. 32, pp. 1– 11, 2019

  47. [47]

    Understanding approximate and unrolled dictio- nary learning for pattern recovery,

    B. Malézieux, T. Moreau, and M. Kowalski, “Understanding approximate and unrolled dictio- nary learning for pattern recovery,” inInternational Conference on Learning Representations, 2022

  48. [48]

    Stable and interpretable unrolled dictionary learning,

    B. Tolooshams and D. E. Ba, “Stable and interpretable unrolled dictionary learning,”Transac- tions on Machine Learning Research, 2022

  49. [49]

    On the dynamics of gradient descent for autoen- coders,

    T. V . Nguyen, R. K. Wong, and C. Hegde, “On the dynamics of gradient descent for autoen- coders,” inProceedings of International Conference on Artificial Intelligence and Statistics, pp. 2858–2867, PMLR, 2019

  50. [50]

    Simple, efficient, and neural algorithms for sparse coding,

    S. Arora, R. Ge, T. Ma, and A. Moitra, “Simple, efficient, and neural algorithms for sparse coding,” inProceedings of Conference on Learning Theory(P. Grünwald, E. Hazan, and S. Kale, eds.), vol. 40 ofProceedings of Machine Learning Research, (Paris, France), pp. 113–149, PMLR, 03–06 Jul 2015. 8

  51. [51]

    Theoretical linear convergence of unfolded ista and its practical weights and thresholds,

    X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ista and its practical weights and thresholds,” inProceedings of Advances in Neural Information Processing Systems, vol. 31, pp. 1–11, 2018

  52. [52]

    Sparse coding and autoencoders,

    A. Rangamani, A. Mukherjee, A. Basu, A. Arora, T. Ganapathi, S. Chin, and T. D. Tran, “Sparse coding and autoencoders,” inProceedings of IEEE International Symposium on Information Theory (ISIT), pp. 36–40, 2018

  53. [53]

    Convolutional dictionary learning based auto-encoders for natural exponential-family distributions,

    B. Tolooshams, A. Song, S. Temereanca, and D. Ba, “Convolutional dictionary learning based auto-encoders for natural exponential-family distributions,” inProceedings of the 37th Interna- tional Conference on Machine Learning(H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 9493–9503, PMLR, 7 2020

  54. [54]

    Noodl: Provable online dictionary learning and sparse coding,

    S. Rambhatla, X. Li, and J. Haupt, “Noodl: Provable online dictionary learning and sparse coding,” inProceedings of International Conference on Learning Representations, pp. 1–11, 2018

  55. [55]

    Projecting assumptions: The duality between sparse autoencoders and concept geometry,

    S. S. R. Hindupur, E. S. Lubana, T. Fel, and D. Ba, “Projecting assumptions: The duality between sparse autoencoders and concept geometry,”arXiv preprint arXiv:2503.01822, 2025

  56. [56]

    Elad,Sparse and redundant representations: from theory to applications in signal and image processing

    M. Elad,Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010

  57. [57]

    K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,

    M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,”IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006

  58. [58]

    Efficient generation of transcrip- tomic profiles by random composite measurements,

    B. Cleary, L. Cong, A. Cheung, E. S. Lander, and A. Regev, “Efficient generation of transcrip- tomic profiles by random composite measurements,”Cell, vol. 171, no. 6, pp. 1424–1436.e18, 2017

  59. [59]

    Compressed sensing for highly efficient imaging transcriptomics,

    B. Cleary, B. Simonton, J. Bezney, E. Murray, S. Alam, A. Sinha, E. Habibi, J. Marshall, E. S. Lander, F. Chen,et al., “Compressed sensing for highly efficient imaging transcriptomics,” Nature Biotechnology, pp. 1–7, 2021

  60. [60]

    Regression shrinkage and selection via the lasso,

    R. Tibshirani, “Regression shrinkage and selection via the lasso,”Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996

  61. [61]

    Atomic decomposition by basis pursuit,

    S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”SIAM review, vol. 43, no. 1, pp. 129–159, 2001

  62. [62]

    Proximal algorithms,

    N. Parikh and S. Boyd, “Proximal algorithms,”Foundations and Trends in optimization, vol. 1, no. 3, pp. 127–239, 2014

  63. [63]

    An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,

    I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,”Communications on Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004

  64. [64]

    A fast iterative shrinkage-thresholding algorithm for linear inverse problems,

    A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,”SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009

  65. [65]

    Efficient learning of sparse representations with an energy-based model,

    M. a. Ranzato, C. Poultney, S. Chopra, and Y . Cun, “Efficient learning of sparse representations with an energy-based model,” inAdvances in Neural Information Processing Systems, vol. 19, MIT Press, 2007

  66. [66]

    Sparse feature learning for deep belief networks,

    M. a. Ranzato, Y .-l. Boureau, and Y . Cun, “Sparse feature learning for deep belief networks,” inProceedings of Advances in Neural Information Processing Systems(J. Platt, D. Koller, Y . Singer, and S. Roweis, eds.), vol. 20, 2008

  67. [67]

    Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

    J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,”preprint arXiv:1409.2574, 2014

  68. [68]

    Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,

    V . Monga, Y . Li, and Y . C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,”IEEE Signal Processing Magazine, vol. 38, no. 2, pp. 18–44, 2021. 9

  69. [69]

    Convolutional neural networks analyzed via convolutional sparse coding,

    V . Papyan, Y . Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,”Journal of Machine Learning Research, vol. 18, no. 83, pp. 1–52, 2017

  70. [70]

    Working locally thinking globally: Theoretical guarantees for convolutional sparse coding,

    V . Papyan, J. Sulam, and M. Elad, “Working locally thinking globally: Theoretical guarantees for convolutional sparse coding,”IEEE Transactions on Signal Processing, vol. 65, no. 21, pp. 5687–5701, 2017

  71. [71]

    Deeply-sparse signal representations (ds2p),

    D. Ba, “Deeply-sparse signal representations (ds2p),”IEEE Transactions on Signal Processing, vol. 68, pp. 4727–4742, 2020

  72. [72]

    Towards A Rigorous Science of Interpretable Machine Learning

    F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” preprint arXiv:1702.08608, 2017. 10 A Appendix - Acknowledgments A.S. conducted this work as a Dale and Suzanne Burger SURF Fellow through the Summer Under- graduate Research Fellowship (SURF) program at Caltech and gratefully acknowledges its funding. A.A. was supp...

  73. [73]

    From Proposition D.4, the architectural inference of an SAE-FNO is equivalent to SAE-CNN

  74. [74]

    From Proposition D.1, the architectural inference of an SAE-CNN is equivalent to L-SAE-CNN

  75. [75]

    From Proposition D.5, the architectural inference of a L-SAE-CNN is equivalent to L-SAE-FNO. By the transitive property of these equivalences, we can establish a direct architectural inference equivalence between SAE-FNO and L-SAE-FNO under the same lifting-projection conditions.■ 23