Recognition: unknown
Structural Instability of Feature Composition
Pith reviewed 2026-05-10 06:36 UTC · model grok-4.3
The pith
Feature unions in sparse autoencoders collapse beyond a threshold set by the statistical dimension of the signal cone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modeling the activation space as a high-dimensional sparse cone manifold, the work derives an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width of the signal cone. In the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. Experiments on structured semantic features from CLEVR confirm that hierarchical correlations accelerate the transition to collapse relative to random baselines.
What carries the argument
The sparse cone manifold representation of activation space, whose Gaussian mean width (statistical dimension) determines the asymptotic compositional-collapse threshold in the spherical dictionary model.
If this is right
- Union-based steering remains stable only up to the statistical dimension of the signal cone before interference dominates.
- ReLU introduces accumulating one-way drift from small correlations, so the instability grows with each additional composed feature.
- Hierarchical correlations in real data lower the effective collapse threshold compared with independent random features.
- Composition mechanisms must explicitly manage interference rather than relying on naive linear superposition.
Where Pith is reading between the lines
- Current SAE steering may therefore be limited to small numbers of simultaneous edits even when individual features are clean.
- The ratchet mechanism could explain why certain feature combinations fail in practice despite working alone.
- Testing the same cone-width threshold on non-ReLU activations would isolate whether the drift is activation-specific.
- Adding explicit interference cancellation during composition could extend the usable range of union steering.
Load-bearing premise
The activation space can be accurately modeled as a high-dimensional sparse cone manifold and the dictionary follows a spherical model.
What would settle it
Measure the number of features at which union steering collapses in an SAE and check whether that count scales with the Gaussian mean width of the corresponding signal cone as predicted.
Figures
read the original abstract
Sparse Autoencoders (SAEs) have emerged as a powerful paradigm for disentangling feature superposition in transformer-based architectures, enabling precise control via activation steering. However, the theoretical foundations of compositional steering -- the simultaneous activation of distinct semantic latents -- remain under-explored. The prevailing Linear Representation Hypothesis often abstracts away non-linear interference effects that arise in overcomplete dictionaries. We present a geometric framework for analyzing the instability of feature unions. Modeling the activation space as a high-dimensional sparse cone manifold, we derive an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width (statistical dimension) of the signal cone. We further show that, in the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. We validate the predicted scaling trends on structured semantic features extracted from CLEVR, where hierarchical correlations accelerate the transition relative to random baselines. Together, our results highlight geometric constraints on the scalability of union-based steering and motivate composition mechanisms that explicitly manage interference beyond naive linear superposition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a geometric framework for feature composition instability in Sparse Autoencoders. It models activation space as a high-dimensional sparse cone manifold, derives an asymptotic compositional-collapse threshold under a spherical dictionary model characterized by the Gaussian mean width (statistical dimension) of the signal cone, and shows that ReLU rectification in the high-bias regime produces a ratchet effect by converting microscopic variance fluctuations into accumulating systematic drift. Predictions are tested on structured semantic features from the CLEVR dataset, where hierarchical correlations are reported to accelerate the transition relative to random baselines.
Significance. If the derivations prove rigorous and the sparse-cone plus spherical-dictionary assumptions are shown to hold for real SAE latents, the work would supply a useful theoretical constraint on the scalability of union-based steering, connecting high-dimensional geometry to practical limits of linear superposition. The explicit use of Gaussian mean width as the characterizing quantity is a positive link to existing statistical-dimension literature. The CLEVR validation, while limited, at least demonstrates an empirical effect of hierarchical structure.
major comments (3)
- [Abstract] Abstract: the compositional-collapse threshold is stated to be 'derived' and 'characterized by the Gaussian mean width of the signal cone,' yet no explicit formula, proof sketch, or definition of the threshold itself appears. Without these, it is impossible to determine whether the threshold is obtained parameter-free or whether the mean-width expression reduces by construction to a quantity fitted from data.
- [CLEVR validation] CLEVR validation: the experiments show that hierarchical correlations accelerate the observed transition relative to random baselines, but supply no direct test of whether the specific mean-width formula governs the scaling. Consequently the results do not independently confirm that real SAE latents obey the sparse-cone manifold or spherical-dictionary geometry at the scales where the asymptotics are claimed.
- [ReLU ratchet-effect derivation] ReLU ratchet-effect derivation: the claim that ReLU rectification converts microscopic correlation-induced variance fluctuations into systematic drift in the high-bias regime is presented without the supporting equations or intermediate steps. This leaves open the possibility that the ratchet effect is defined circularly in terms of the same geometric model used for the threshold.
minor comments (2)
- [Abstract] Abstract: the terms 'compositional-collapse threshold' and 'ratchet effect' are introduced without even a one-sentence gloss, reducing accessibility for readers outside the immediate sub-area.
- [Introduction] The manuscript would benefit from a brief comparison table or paragraph situating the Gaussian-mean-width approach against prior uses of statistical dimension in sparse recovery and dictionary learning.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies important gaps in the presentation of our theoretical results and empirical validation. We address each major comment below and will make substantial revisions to the manuscript to provide the requested derivations, equations, and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the compositional-collapse threshold is stated to be 'derived' and 'characterized by the Gaussian mean width of the signal cone,' yet no explicit formula, proof sketch, or definition of the threshold itself appears. Without these, it is impossible to determine whether the threshold is obtained parameter-free or whether the mean-width expression reduces by construction to a quantity fitted from data.
Authors: We agree that the current manuscript does not include an explicit formula for the compositional-collapse threshold, a proof sketch, or a clear definition within the abstract or main text. In the revised version, we will insert a new subsection in the theoretical framework that states the threshold explicitly as a function of the Gaussian mean width (statistical dimension) of the signal cone under the spherical dictionary model. We will also provide a concise proof outline deriving the asymptotic threshold from the high-dimensional geometry of the sparse cone manifold, demonstrating that it follows parameter-free from the model assumptions rather than from data fitting. revision: yes
-
Referee: [CLEVR validation] CLEVR validation: the experiments show that hierarchical correlations accelerate the observed transition relative to random baselines, but supply no direct test of whether the specific mean-width formula governs the scaling. Consequently the results do not independently confirm that real SAE latents obey the sparse-cone manifold or spherical-dictionary geometry at the scales where the asymptotics are claimed.
Authors: The CLEVR experiments were intended to illustrate the qualitative effect of hierarchical correlations on accelerating the transition relative to random baselines, consistent with the predicted scaling trends. We acknowledge that they do not constitute a direct quantitative test of the specific mean-width formula or an independent confirmation that real SAE latents satisfy the sparse-cone manifold and spherical-dictionary assumptions at the relevant scales. In revision, we will add an explicit limitations discussion and include supplementary analysis estimating the mean width from the CLEVR feature statistics to compare against observed collapse points. We will also clarify the scope of the validation as supporting the directional predictions rather than fully validating the geometric model for arbitrary SAE latents. revision: partial
-
Referee: [ReLU ratchet-effect derivation] ReLU ratchet-effect derivation: the claim that ReLU rectification converts microscopic correlation-induced variance fluctuations into systematic drift in the high-bias regime is presented without the supporting equations or intermediate steps. This leaves open the possibility that the ratchet effect is defined circularly in terms of the same geometric model used for the threshold.
Authors: We accept that the manuscript presents the ReLU ratchet-effect claim without the supporting equations or intermediate derivation steps. In the revised manuscript, we will expand the relevant section with a self-contained derivation that begins from the high-bias regime of the ReLU activation, shows how microscopic variance fluctuations induced by feature correlations are rectified into unidirectional drift, and demonstrates the accumulation under repeated composition. This derivation will be presented prior to and independently of the threshold result to eliminate any suggestion of circularity, with all intermediate equations included. revision: yes
Circularity Check
No significant circularity; derivation applies standard geometric tools to stated model assumptions
full rationale
The paper explicitly adopts a high-dimensional sparse cone manifold model for activation space and a spherical dictionary model, then derives the compositional-collapse threshold as characterized by the Gaussian mean width (a pre-existing concept from convex geometry) of the signal cone. The ReLU ratchet effect is likewise obtained by analyzing how rectification maps microscopic fluctuations to accumulated drift under the same high-bias regime and geometry. No equation or step reduces the claimed threshold or ratchet to a fitted parameter, self-citation, or redefinition of the inputs; the CLEVR experiments are presented as separate empirical checks on scaling trends rather than part of the derivation. The chain therefore remains self-contained against external benchmarks in high-dimensional geometry and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Activation space is modeled as a high-dimensional sparse cone manifold
- domain assumption Dictionary follows a spherical model
invented entities (2)
-
compositional-collapse threshold
no independent evidence
-
ratchet effect from ReLU rectification
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Scaling Monosemanticity: Extracting Interpretable Features from
Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Dennison, Chris and others , journal =. Scaling Monosemanticity: Extracting Interpretable Features from. 2024 , url =
2024
-
[2]
International Conference on Learning Representations (ICLR) , year =
Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations (ICLR) , year =
-
[3]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author =. arXiv preprint arXiv:2408.05147 , year =
work page internal anchor Pith review arXiv
-
[4]
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability , author =. arXiv preprint arXiv:2503.09532 , year =
-
[5]
International Conference on Learning Representations (ICLR) , year =
Sparse Autoencoders Do Not Find Canonical Units of Analysis , author =. International Conference on Learning Representations (ICLR) , year =
-
[6]
International Conference on Machine Learning (ICML) , year =
The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. International Conference on Machine Learning (ICML) , year =
-
[7]
International Conference on Learning Representations (ICLR) , year =
Not All Language Model Features Are One-Dimensionally Linear , author =. International Conference on Learning Representations (ICLR) , year =
-
[8]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Birth of a Transformer: A Memory Viewpoint , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[9]
Transformer Circuits Thread , year =
Toy Models of Superposition , author =. Transformer Circuits Thread , year =
-
[10]
Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Steering Llama 2 via Contrastive Activation Addition , author =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[11]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =
work page internal anchor Pith review arXiv
-
[12]
arXiv preprint arXiv:2406.15518 , year =
Steering Without Side Effects: Improving Post-Deployment Control of Language Models , author =. arXiv preprint arXiv:2406.15518 , year =
-
[13]
Polysemanticity and Capacity in Neural Networks , author =. arXiv preprint arXiv:2210.01892 , year =
-
[14]
Transactions on Machine Learning Research , year=
Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Transactions on Machine Learning Research , year=
-
[15]
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=
Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2009 , publisher=
2009
-
[16]
Amelunxen, Dennis and Lotz, Martin and McCoy, Michael B. and Tropp, Joel A. , title =. Information and Inference: A Journal of the IMA , volume =. 2014 , month =. doi:10.1093/imaiai/iau005 , url =
-
[17]
2018 , publisher=
High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=
2018
-
[18]
On Milman's inequality and random subspaces which escape through a mesh in Rn
Gordon, Y. On Milman's inequality and random subspaces which escape through a mesh in Rn. Geometric Aspects of Functional Analysis. 1988
1988
-
[19]
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J
Analyzing the Generalization and Reliability of Steering Vectors , author =. arXiv preprint arXiv:2407.12404 , year =
-
[20]
NeurIPS 2025 (submission) , year =
From Steering Vectors to Conceptors: Compositional Affine Activation Steering for LLMs , author =. NeurIPS 2025 (submission) , year =
2025
-
[21]
arXiv preprint arXiv:2406.19384 , year=
The Remarkable Robustness of LLMs: Stages of Inference? , author =. arXiv preprint arXiv:2406.19384 , year =
-
[22]
arXiv preprint arXiv:2601.09667 , year =
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning , author =. arXiv preprint arXiv:2601.09667 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.