A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Dianbo Liu; Harshvardhan Saini; Jingyi Cui; Mengnan Du; Yiming Tang; Yisen Wang; Yizhen Liao; Zhaoqian Yao; Zheng Lin

arxiv: 2512.05534 · v6 · submitted 2025-12-05 · 💻 cs.LG · cs.AI

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Yiming Tang , Harshvardhan Saini , Zhaoqian Yao , Zheng Lin , Yizhen Liao , Jingyi Cui , Yisen Wang , Mengnan Du

show 1 more author

Dianbo Liu

This is my paper

Pith reviewed 2026-05-17 01:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse dictionary learningmechanistic interpretabilitypiecewise biconvexityspurious minimafeature absorptiondead neuronsidentifiabilitylinear representation bench

0 comments

The pith

All major sparse dictionary learning methods reduce to one piecewise biconvex optimization problem that explains their spurious solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper unifies sparse autoencoders, transcoders, and crosscoders under a single theoretical model by showing they all amount to piecewise biconvex optimization. This framing lets the authors map out the full set of global solutions, identify sources of non-identifiability, and locate the spurious local minima that produce polysemantic features. The same analysis supplies direct accounts for why feature absorption and dead neurons appear during training. To test these predictions with full ground-truth access, the authors release the Linear Representation Bench and introduce feature anchoring as a practical correction that restores identifiability.

Core claim

We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

What carries the argument

the piecewise biconvex optimization problem that unifies all major sparse dictionary learning variants and exposes their global solution set

If this is right

Feature absorption arises as a direct consequence of the non-identifiable global solution set.
Dead neurons correspond to particular spurious optima in the biconvex landscape.
The Linear Representation Bench provides ground-truth evaluation of recovery quality.
Feature anchoring restores identifiability and improves recovery on both synthetic and real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same biconvex structure may appear in other auxiliary models used for representation analysis.
Training schedules could be redesigned to steer away from the identified spurious basins.
The framework offers a route to compare SDL variants by their induced solution sets rather than by empirical performance alone.

Load-bearing premise

Real training dynamics and failure modes in neural networks are captured without major distortion by the piecewise biconvex formulation.

What would settle it

A trained sparse dictionary on a known linear representation where the observed features violate the predicted spurious minima or absorption patterns derived from the piecewise biconvex model.

Figures

Figures reproduced from arXiv: 2512.05534 by Dianbo Liu, Harshvardhan Saini, Jingyi Cui, Mengnan Du, Yiming Tang, Yisen Wang, Yizhen Liao, Zhaoqian Yao, Zheng Lin.

**Figure 1.** Figure 1: Sparse Autoencoder: encoder WE maps xp to sparse latents xq, decoder WD reconstructs from xq. Transcoders. Transcoders (Dunefsky et al., 2024; Paulo et al., 2025) capture interpretable features in layer-to-layer transformations. Unlike SAEs, transcoders approximate the input-output function of a target component, such as a MLP, using a sparse bottleneck. In our proposed theoretical framework, transcoders s… view at source ↗

**Figure 2.** Figure 2: Transcoder: encoder WE maps xp(s) to sparse latents xq(s), decoder WD gives xr(s) as a prediction of MLP’s output. Crosscoders. Crosscoders (Lindsey et al., 2024) discover shared features across multiple representation sources by jointly encoding and reconstructing concatenated representations. In our framework, crosscoders set xp = [x (1) p ; . . . ; x (m) p ] and xr = [x (1) r ; . . . ; x (m) r ] where… view at source ↗

**Figure 3.** Figure 3: Crosscoder: encoder WE maps concatenated multi-layer input xp to xq, decoder WD reconstructs multi-layer output xr. Variants of SDL Methods. Various SDL methods fit into our theoretical framework but differ in their choices of activation functions and loss designs. Bricken et al. (2023) and Templeton et al. (2024) use ReLU activation σReLU(z) = max(0, z) with L1 regularization on latents: L = Es∼D ∥xr(s) −… view at source ↗

**Figure 4.** Figure 4: Zero reconstruction loss without recovering ground-truth features. We design the Linear Representation Bench that enable full knowledge of the ground truth features to study SDL methods. We observe one concerning phenomenon that these methods can achieve zero loss without recovering any ground truth features. Left: Four ground-truth feature directions. Middle: Learned encoder directions fail to align with … view at source ↗

**Figure 5.** Figure 5: Feature absorption emerges from hierarchical concept structure. Left: Ideal SDL features without absorption. Right: hierarchical concept structure exists and only a proportion of the sub-concepts of ”Dog” can activate the SDL feature. where K ⊂ [n] is a randomly selected subset of size k. Subpopulation Mean Embeddings. For real-world datasets where ground-truth features are unknown, we identify semantic s… view at source ↗

**Figure 6.** Figure 6: Feature resampling accelerates convergence and improves final loss. Training curves on Llama 3.1 8B comparing standard SAE training (blue) with periodic dead neuron resampling after 5000 steps (in the plot x axis is scaled by 100). 6. Conclusion We develop the first unified theoretical framework for Sparse Dictionary Learning in mechanistic interpretability, demonstrating how diverse SDL methods instanti… view at source ↗

**Figure 7.** Figure 7: Hierarchical taxonomy of Sparse Dictionary Learning research in Mechanistic Interpretability. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the Linear Representation Bench. The figure illustrates a D = 3 dimensional representation space generated using N = 4 feature directions (Wp). C.1. Ground-Truth Feature Matrix Generation We construct the feature matrix Wp ∈ R np×n to satisfy the unit-norm and bounded interference conditions (Assumption 2.4). Initialization. Initialize Wp with random Gaussian entries and normalize each col… view at source ↗

**Figure 9.** Figure 9: Features learned with feature anchoring exhibit monosemanticity. Each row shows the top-activating images for a single feature from Matryoshka SAE trained on CLIP embeddings with feature anchoring. Feature from Matryoshka SAE without Feature Anchoring: Various Boxes (dishwaser, file, chest). Feature from Matryoshka SAE without Feature Anchoring: Various Screens (slot, scoreboard, television). Feature from … view at source ↗

**Figure 10.** Figure 10: Features learned without feature anchoring exhibit polysemanticity. Each row shows the top-activating images for a single feature from Matryoshka SAE trained without feature anchoring. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Unified piecewise biconvex framing for SDL methods explains some failure modes and adds anchoring, but real-network deviations remain the key uncertainty.

read the letter

This paper's main contribution is casting major sparse dictionary learning methods—sparse autoencoders, transcoders, and crosscoders—as a single piecewise biconvex optimization problem. From there it characterizes the global solution set, non-identifiability, and spurious optima, which it uses to give explanations for feature absorption and dead neurons. They back this with a new Linear Representation Bench that allows full ground-truth testing and introduce feature anchoring as a technique to restore identifiability, with reported improvements on synthetic and real data. What works well is the extension of theory to the broader family of methods. Prior work was mostly stuck on tied-weight autoencoders, so this unification is a genuine step if the math checks out. The bench is a practical tool for the field, and the anchoring idea is a direct application of the theory. The soft spot is the connection to actual neural network training. The piecewise biconvex model is linear and simplified. Real setups involve non-linear activations in the source model and gradient flow through the full network, which can introduce higher-order terms or layer-specific effects not in the formulation. This could mean the characterized spurious minima do not exactly match what happens in practice, weakening the explanations for the observed issues. The paper would benefit from more discussion or experiments on how robust the results are to these deviations. This is for mechanistic interpretability folks who use or develop dictionary learning methods and want theoretical insight into why they fail sometimes. A reader interested in formal analysis of these tools would find it worthwhile. It deserves serious peer review because it brings new formal structure to an empirical-heavy area. Even with the applicability questions, the core ideas are worth referee scrutiny. Recommendation: send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript develops the first unified theoretical framework for sparse dictionary learning (SDL) methods in mechanistic interpretability, including sparse autoencoders, transcoders, and crosscoders. It casts all major variants as instances of a single piecewise biconvex optimization problem, characterizes the global solution set along with non-identifiability and spurious optima, and uses this analysis to explain feature absorption and dead neurons. The authors introduce the Linear Representation Bench to evaluate under full ground-truth access and propose feature anchoring as a technique to restore identifiability, with supporting experiments on synthetic benchmarks and real neural representations.

Significance. If the central claims hold, the work supplies the first formal grounding for the full family of SDL methods beyond tied-weight autoencoders and supplies principled accounts of two widely observed failure modes. The Linear Representation Bench and feature-anchoring method constitute concrete, testable contributions that could improve feature recovery; the paper also ships reproducible synthetic benchmarks and explicit optimization formulations that facilitate direct verification.

major comments (2)

[§3.2, Eq. (8)–(11)] §3.2, Eq. (8)–(11): the claim that every major SDL variant reduces exactly to the stated piecewise biconvex objective is load-bearing for the global-solution and spurious-minima characterizations, yet the derivation does not explicitly incorporate the non-linear activations or layer-wise gradient flow present in the source models; any higher-order terms would shift the location of the identified spurious minima.
[Theorem 5.1 and §6.3] Theorem 5.1 and §6.3: the explanation that dead neurons arise from the spurious minima of the piecewise biconvex problem assumes the training dynamics on real representations match the linear formulation exactly; the Linear Representation Bench experiments use synthetic linear data, so it remains open whether the same minima dominate under the non-linear statistics of actual network activations.

minor comments (2)

[§2] Notation for the different SDL variants (e.g., the precise definition of the crosscoder objective) is introduced in §2 without a side-by-side comparison to the original papers, which would aid readers.
[Figure 3] Figure 3 caption does not state the precise hyper-parameter settings used for the feature-anchoring ablation, making the quantitative gains harder to reproduce from the figure alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [§3.2, Eq. (8)–(11)] §3.2, Eq. (8)–(11): the claim that every major SDL variant reduces exactly to the stated piecewise biconvex objective is load-bearing for the global-solution and spurious-minima characterizations, yet the derivation does not explicitly incorporate the non-linear activations or layer-wise gradient flow present in the source models; any higher-order terms would shift the location of the identified spurious minima.

Authors: We agree that the derivation in §3.2 focuses on the linear components of the SDL objectives as typically formulated in the literature. The piecewise biconvexity arises from the structure of the loss with respect to the dictionary and coefficients, even when non-linear activations like ReLU are present in the encoder, because the non-linearity is applied after the linear transformation. However, we acknowledge that layer-wise gradient flow in deeper models could introduce additional complexities not captured here. In the revised manuscript, we will explicitly state the assumptions under which the equivalence holds and discuss the potential impact of higher-order terms as a limitation, without claiming exact reduction for all possible non-linear extensions. This preserves the validity of our characterizations for the standard linear and piecewise-linear cases central to the paper. revision: partial
Referee: [Theorem 5.1 and §6.3] Theorem 5.1 and §6.3: the explanation that dead neurons arise from the spurious minima of the piecewise biconvex problem assumes the training dynamics on real representations match the linear formulation exactly; the Linear Representation Bench experiments use synthetic linear data, so it remains open whether the same minima dominate under the non-linear statistics of actual network activations.

Authors: The referee correctly notes that the Linear Representation Bench is designed with synthetic linear data to enable full ground-truth access and isolate the effects predicted by the theory. We also include experiments on real neural representations from language models in §6.4, where feature anchoring improves recovery and reduces dead neurons, consistent with the theory's predictions. That said, we concede that a full verification under non-linear activation statistics would require additional benchmarks, which is beyond the current scope. In the revision, we will add a paragraph in §6.3 clarifying that while the linear model provides a principled explanation and matches observations in practice, direct confirmation on non-linear data remains an open question for future investigation. revision: partial

Circularity Check

0 steps flagged

No circularity: piecewise biconvex formulation is independently derived from SDL objectives

full rationale

The paper defines the unified piecewise biconvex optimization problem directly from the standard SDL loss functions (sparse autoencoders, transcoders, crosscoders) without reducing any claimed prediction or global solution set to a fitted parameter or prior self-citation. The characterization of non-identifiability, spurious minima, feature absorption, and dead neurons follows from analyzing the mathematical structure of this biconvex problem under the stated assumptions, which are external to the target phenomena. No load-bearing step relies on renaming a known result or importing a uniqueness theorem from overlapping authors; the Linear Representation Bench and feature anchoring are presented as downstream applications rather than definitional inputs. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that all major SDL variants share a common piecewise biconvex structure whose global properties can be characterized analytically; no free parameters or new entities are mentioned in the abstract.

axioms (1)

domain assumption All major sparse dictionary learning variants can be expressed as instances of a single piecewise biconvex optimization problem.
This modeling step is the foundation of the unified framework and the subsequent characterization of solution sets and spurious minima.

pith-pipeline@v0.9.0 · 5584 in / 1240 out tokens · 36020 ms · 2026-05-17T01:06:15.156346+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs hallucinate because visual embeddings are over-aligned to a text manifold; projecting out the top principal components of a universal linguistic subspace reduces this bias and improves benchmark per...

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

pub/2023/monosemantic-features

URL https://transformer-circuits. pub/2023/monosemantic-features. Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders, 2024. URL https://arxiv.org/ abs/2412.06410. Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse au- toencoders, 2025. URL https://arxiv.org/abs/ 2503.17547. Chani...

work page doi:10.1561/2200000058 2023
[2]

The Mythos of Model Interpretability

URL https://transformer-circuits. pub/2024/crosscoders/index.html. Lipton, Z. C. The mythos of model interpretability, 2017. URLhttps://arxiv.org/abs/1606.03490. Lundberg, S. and Lee, S.-I. A unified approach to interpret- ing model predictions, 2017. URL https://arxiv. org/abs/1705.07874. Luo, Y ., An, R., Zou, B., Tang, Y ., Liu, J., and Zhang, S. Llm a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.23915/distill.00024.001 2024
[3]

arXiv preprint arXiv:2506.19823 , year =

URL https://transformer-circuits. pub/2024/scaling-monosemanticity/. Visweswaran, V . and Floudast, C. A global optimization algorithm (gop) for certain classes of nonconvex nlps—ii. application of theory and test problems.Computers & chemical engineering, 14(12):1419–1434, 1990. Wang, M., la Tour, T. D., Watkins, O., Makelov, A., Chi, R. A., Miserendino,...

work page arXiv 2024
[4]

African Grey

established foundational methods for learning overcomplete dictionaries, while theoretical work in compressed sensing (Donoho, 2006) characterized recovery conditions, with Spielman et al. (2012) providing polynomial-time algorithms for exact reconstruction under sparsity assumptions. Safran & Shamir (2018) demonstrated that spurious local minima are comm...

work page 2006

[1] [1]

pub/2023/monosemantic-features

URL https://transformer-circuits. pub/2023/monosemantic-features. Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders, 2024. URL https://arxiv.org/ abs/2412.06410. Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse au- toencoders, 2025. URL https://arxiv.org/abs/ 2503.17547. Chani...

work page doi:10.1561/2200000058 2023

[2] [2]

The Mythos of Model Interpretability

URL https://transformer-circuits. pub/2024/crosscoders/index.html. Lipton, Z. C. The mythos of model interpretability, 2017. URLhttps://arxiv.org/abs/1606.03490. Lundberg, S. and Lee, S.-I. A unified approach to interpret- ing model predictions, 2017. URL https://arxiv. org/abs/1705.07874. Luo, Y ., An, R., Zou, B., Tang, Y ., Liu, J., and Zhang, S. Llm a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.23915/distill.00024.001 2024

[3] [3]

arXiv preprint arXiv:2506.19823 , year =

URL https://transformer-circuits. pub/2024/scaling-monosemanticity/. Visweswaran, V . and Floudast, C. A global optimization algorithm (gop) for certain classes of nonconvex nlps—ii. application of theory and test problems.Computers & chemical engineering, 14(12):1419–1434, 1990. Wang, M., la Tour, T. D., Watkins, O., Makelov, A., Chi, R. A., Miserendino,...

work page arXiv 2024

[4] [4]

African Grey

established foundational methods for learning overcomplete dictionaries, while theoretical work in compressed sensing (Donoho, 2006) characterized recovery conditions, with Spielman et al. (2012) providing polynomial-time algorithms for exact reconstruction under sparsity assumptions. Safran & Shamir (2018) demonstrated that spurious local minima are comm...

work page 2006