arxiv: 2605.05223 · v1 · submitted 2026-04-18 · 💻 cs.LG · cs.AI

Recognition: unknown

Structural Instability of Feature Composition

Yunpeng Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersfeature compositioncompositional steeringgeometric instabilityReLU ratchet effectsignal coneGaussian mean widthCLEVR features

0 comments

The pith

Feature unions in sparse autoencoders collapse beyond a threshold set by the statistical dimension of the signal cone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a geometric model of feature composition to show when simultaneously activating multiple semantic latents stops working. It treats the activation space as a sparse cone and derives a collapse threshold governed by the Gaussian mean width of that cone under spherical dictionaries. ReLU turns tiny correlation fluctuations into a one-directional drift that builds up with each added feature, producing ratchet-like interference growth. Validation on CLEVR semantic features shows that real hierarchical correlations push the collapse earlier than random baselines would predict. The result sets a concrete limit on how many features can be steered together before interference overtakes the intended signals.

Core claim

Modeling the activation space as a high-dimensional sparse cone manifold, the work derives an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width of the signal cone. In the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. Experiments on structured semantic features from CLEVR confirm that hierarchical correlations accelerate the transition to collapse relative to random baselines.

What carries the argument

The sparse cone manifold representation of activation space, whose Gaussian mean width (statistical dimension) determines the asymptotic compositional-collapse threshold in the spherical dictionary model.

If this is right

Union-based steering remains stable only up to the statistical dimension of the signal cone before interference dominates.
ReLU introduces accumulating one-way drift from small correlations, so the instability grows with each additional composed feature.
Hierarchical correlations in real data lower the effective collapse threshold compared with independent random features.
Composition mechanisms must explicitly manage interference rather than relying on naive linear superposition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current SAE steering may therefore be limited to small numbers of simultaneous edits even when individual features are clean.
The ratchet mechanism could explain why certain feature combinations fail in practice despite working alone.
Testing the same cone-width threshold on non-ReLU activations would isolate whether the drift is activation-specific.
Adding explicit interference cancellation during composition could extend the usable range of union steering.

Load-bearing premise

The activation space can be accurately modeled as a high-dimensional sparse cone manifold and the dictionary follows a spherical model.

What would settle it

Measure the number of features at which union steering collapses in an SAE and check whether that count scales with the Gaussian mean width of the corresponding signal cone as predicted.

Figures

Figures reproduced from arXiv: 2605.05223 by Yunpeng Zhou.

**Figure 1.** Figure 1: Mechanistic origin of the ReLU Ratchet. (A) In linear systems, interference is symmetric and zero-mean (E[X] = 0), allowing noise to cancel out across features. (B) ReLU rectification induces a systematic mean drift η = E[σ(X)] > 0, transforming stochastic fluctuations into a persistent geometric bias. This drift shifts the interference distribution toward the threshold, causing a faster saturation of the… view at source ↗

**Figure 2.** Figure 2: Geometry of Compositional Separation. (a) Stable regime: The signal cone KS (blue) is disjoint from the ghost polar cone K◦ J (red). (b) Collapse: As density increases, KS widens and collides with the ghost constraints, triggering the phase transition. 4.3. Derivation via Gordon’s Escape Theorem Directly computing the intersection probability of the high-dimensional cones defined in Section 4.1 is geometri… view at source ↗

**Figure 3.** Figure 3: Empirical validation of the Ratchet Mechanism. (a) ReLU rectifies stochastic interference into systematic bias η. (b) Spurious energy undergoes an abrupt transition as compositional density γ approaches the threshold γ ∗ . 10 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Phase Transition of Compositional Collapse. Theoretical prediction (blue) vs. empirical CLEVR latents (red points). The theoretical curve shows a transition at γ ∗ . Realworld structured features exhibit a correlation shift, collapsing slightly earlier than the random baseline. Detailed Protocol. Steering is implemented by adding the vector z to the residual stream with coefficient λsteer. We explicitly … view at source ↗

**Figure 5.** Figure 5: Structure of Interference. Comparison of the Gram matrix (Gij = |⟨di , dj ⟩|) for (A) a random spherical dictionary and (B) learned CLEVR features. The CLEVR features exhibit significant block-diagonal structure (semantic clusters) and off-diagonal correlations, leading to a higher effective coherence µlocal than the random baseline. This structural alignment accelerates the phase transition. D.4. Ablatio… view at source ↗

**Figure 6.** Figure 6: Empirical verification of the phase boundary. The transition exhibits a characteristic ”tail” near γ ∗ due to finite-size effects, aligning with the geometric threshold derived from Gaussian mean width. Appendix G. Extended Discussion and Limitations In this section, we expand upon the implications of our theoretical findings and explicitly address the limitations of our current framework. G.1. Limitations… view at source ↗

read the original abstract

Sparse Autoencoders (SAEs) have emerged as a powerful paradigm for disentangling feature superposition in transformer-based architectures, enabling precise control via activation steering. However, the theoretical foundations of compositional steering -- the simultaneous activation of distinct semantic latents -- remain under-explored. The prevailing Linear Representation Hypothesis often abstracts away non-linear interference effects that arise in overcomplete dictionaries. We present a geometric framework for analyzing the instability of feature unions. Modeling the activation space as a high-dimensional sparse cone manifold, we derive an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width (statistical dimension) of the signal cone. We further show that, in the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. We validate the predicted scaling trends on structured semantic features extracted from CLEVR, where hierarchical correlations accelerate the transition relative to random baselines. Together, our results highlight geometric constraints on the scalability of union-based steering and motivate composition mechanisms that explicitly manage interference beyond naive linear superposition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an asymptotic collapse threshold for SAE feature unions via Gaussian mean width on a sparse cone model plus a ReLU ratchet, but the geometry is not checked against actual latents and the CLEVR runs only show faster breakdown with correlations.

read the letter

The core claim is that composing multiple SAE features hits an instability point set by the statistical dimension of the signal cone, with ReLU turning tiny correlation noise into accumulating drift in the high-bias regime. That framing is new relative to the usual linear superposition story in the SAE literature, and the CLEVR validation at least shows hierarchical correlations speed up the transition compared with random baselines. The geometric setup lets them write down a clean scaling prediction instead of just observing interference after the fact. Credit for trying to make the limit quantitative rather than hand-wavy. The assumptions do the heavy lifting. Modeling activations as a high-dimensional sparse cone manifold and the dictionary as spherical produces the mean-width threshold and the ratchet, yet nothing in the experiments confirms that real SAE latents obey those structures at the scales where the asymptotics are supposed to apply. The CLEVR results demonstrate that structured correlations accelerate some kind of breakdown, but they do not test whether the specific formula governs the transition or whether the post-hoc parameter choices drive the fit. Without independent checks on the manifold geometry or error bars on the scaling curves, the derivation stays suggestive. This is aimed at people working on activation steering and multi-feature control in interpretability. It flags a practical constraint worth worrying about even if the current evidence is preliminary. I would send it to review; the question it raises is real and the formal attempt is worth referee scrutiny, though the authors will need to tighten the link between model and data.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a geometric framework for feature composition instability in Sparse Autoencoders. It models activation space as a high-dimensional sparse cone manifold, derives an asymptotic compositional-collapse threshold under a spherical dictionary model characterized by the Gaussian mean width (statistical dimension) of the signal cone, and shows that ReLU rectification in the high-bias regime produces a ratchet effect by converting microscopic variance fluctuations into accumulating systematic drift. Predictions are tested on structured semantic features from the CLEVR dataset, where hierarchical correlations are reported to accelerate the transition relative to random baselines.

Significance. If the derivations prove rigorous and the sparse-cone plus spherical-dictionary assumptions are shown to hold for real SAE latents, the work would supply a useful theoretical constraint on the scalability of union-based steering, connecting high-dimensional geometry to practical limits of linear superposition. The explicit use of Gaussian mean width as the characterizing quantity is a positive link to existing statistical-dimension literature. The CLEVR validation, while limited, at least demonstrates an empirical effect of hierarchical structure.

major comments (3)

[Abstract] Abstract: the compositional-collapse threshold is stated to be 'derived' and 'characterized by the Gaussian mean width of the signal cone,' yet no explicit formula, proof sketch, or definition of the threshold itself appears. Without these, it is impossible to determine whether the threshold is obtained parameter-free or whether the mean-width expression reduces by construction to a quantity fitted from data.
[CLEVR validation] CLEVR validation: the experiments show that hierarchical correlations accelerate the observed transition relative to random baselines, but supply no direct test of whether the specific mean-width formula governs the scaling. Consequently the results do not independently confirm that real SAE latents obey the sparse-cone manifold or spherical-dictionary geometry at the scales where the asymptotics are claimed.
[ReLU ratchet-effect derivation] ReLU ratchet-effect derivation: the claim that ReLU rectification converts microscopic correlation-induced variance fluctuations into systematic drift in the high-bias regime is presented without the supporting equations or intermediate steps. This leaves open the possibility that the ratchet effect is defined circularly in terms of the same geometric model used for the threshold.

minor comments (2)

[Abstract] Abstract: the terms 'compositional-collapse threshold' and 'ratchet effect' are introduced without even a one-sentence gloss, reducing accessibility for readers outside the immediate sub-area.
[Introduction] The manuscript would benefit from a brief comparison table or paragraph situating the Gaussian-mean-width approach against prior uses of statistical dimension in sparse recovery and dictionary learning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies important gaps in the presentation of our theoretical results and empirical validation. We address each major comment below and will make substantial revisions to the manuscript to provide the requested derivations, equations, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the compositional-collapse threshold is stated to be 'derived' and 'characterized by the Gaussian mean width of the signal cone,' yet no explicit formula, proof sketch, or definition of the threshold itself appears. Without these, it is impossible to determine whether the threshold is obtained parameter-free or whether the mean-width expression reduces by construction to a quantity fitted from data.

Authors: We agree that the current manuscript does not include an explicit formula for the compositional-collapse threshold, a proof sketch, or a clear definition within the abstract or main text. In the revised version, we will insert a new subsection in the theoretical framework that states the threshold explicitly as a function of the Gaussian mean width (statistical dimension) of the signal cone under the spherical dictionary model. We will also provide a concise proof outline deriving the asymptotic threshold from the high-dimensional geometry of the sparse cone manifold, demonstrating that it follows parameter-free from the model assumptions rather than from data fitting. revision: yes
Referee: [CLEVR validation] CLEVR validation: the experiments show that hierarchical correlations accelerate the observed transition relative to random baselines, but supply no direct test of whether the specific mean-width formula governs the scaling. Consequently the results do not independently confirm that real SAE latents obey the sparse-cone manifold or spherical-dictionary geometry at the scales where the asymptotics are claimed.

Authors: The CLEVR experiments were intended to illustrate the qualitative effect of hierarchical correlations on accelerating the transition relative to random baselines, consistent with the predicted scaling trends. We acknowledge that they do not constitute a direct quantitative test of the specific mean-width formula or an independent confirmation that real SAE latents satisfy the sparse-cone manifold and spherical-dictionary assumptions at the relevant scales. In revision, we will add an explicit limitations discussion and include supplementary analysis estimating the mean width from the CLEVR feature statistics to compare against observed collapse points. We will also clarify the scope of the validation as supporting the directional predictions rather than fully validating the geometric model for arbitrary SAE latents. revision: partial
Referee: [ReLU ratchet-effect derivation] ReLU ratchet-effect derivation: the claim that ReLU rectification converts microscopic correlation-induced variance fluctuations into systematic drift in the high-bias regime is presented without the supporting equations or intermediate steps. This leaves open the possibility that the ratchet effect is defined circularly in terms of the same geometric model used for the threshold.

Authors: We accept that the manuscript presents the ReLU ratchet-effect claim without the supporting equations or intermediate derivation steps. In the revised manuscript, we will expand the relevant section with a self-contained derivation that begins from the high-bias regime of the ReLU activation, shows how microscopic variance fluctuations induced by feature correlations are rectified into unidirectional drift, and demonstrates the accumulation under repeated composition. This derivation will be presented prior to and independently of the threshold result to eliminate any suggestion of circularity, with all intermediate equations included. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard geometric tools to stated model assumptions

full rationale

The paper explicitly adopts a high-dimensional sparse cone manifold model for activation space and a spherical dictionary model, then derives the compositional-collapse threshold as characterized by the Gaussian mean width (a pre-existing concept from convex geometry) of the signal cone. The ReLU ratchet effect is likewise obtained by analyzing how rectification maps microscopic fluctuations to accumulated drift under the same high-bias regime and geometry. No equation or step reduces the claimed threshold or ratchet to a fitted parameter, self-citation, or redefinition of the inputs; the CLEVR experiments are presented as separate empirical checks on scaling trends rather than part of the derivation. The chain therefore remains self-contained against external benchmarks in high-dimensional geometry and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on two modeling choices presented without independent evidence in the abstract: the sparse cone manifold representation of activation space and the spherical dictionary assumption used for the asymptotic analysis. No explicit free parameters are named; the collapse threshold and ratchet effect are derived quantities rather than fitted constants.

axioms (2)

domain assumption Activation space is modeled as a high-dimensional sparse cone manifold
Invoked to derive the compositional-collapse threshold
domain assumption Dictionary follows a spherical model
Enables the asymptotic analysis characterized by Gaussian mean width

invented entities (2)

compositional-collapse threshold no independent evidence
purpose: Characterizes the point of instability for feature unions
Derived from the cone manifold and spherical dictionary
ratchet effect from ReLU rectification no independent evidence
purpose: Explains systematic drift accumulation from microscopic correlations
Described for the high-bias regime

pith-pipeline@v0.9.0 · 5478 in / 1483 out tokens · 43526 ms · 2026-05-10T06:36:42.324875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Dennison, Chris and others , journal =. Scaling Monosemanticity: Extracting Interpretable Features from. 2024 , url =

2024
[2]

International Conference on Learning Representations (ICLR) , year =

Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations (ICLR) , year =
[3]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author =. arXiv preprint arXiv:2408.05147 , year =

work page internal anchor Pith review arXiv
[4]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity.arXiv preprint arXiv:2503.09532,

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability , author =. arXiv preprint arXiv:2503.09532 , year =

work page arXiv
[5]

International Conference on Learning Representations (ICLR) , year =

Sparse Autoencoders Do Not Find Canonical Units of Analysis , author =. International Conference on Learning Representations (ICLR) , year =
[6]

International Conference on Machine Learning (ICML) , year =

The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. International Conference on Machine Learning (ICML) , year =
[7]

International Conference on Learning Representations (ICLR) , year =

Not All Language Model Features Are One-Dimensionally Linear , author =. International Conference on Learning Representations (ICLR) , year =
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Birth of a Transformer: A Memory Viewpoint , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[9]

Transformer Circuits Thread , year =

Toy Models of Superposition , author =. Transformer Circuits Thread , year =
[10]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Steering Llama 2 via Contrastive Activation Addition , author =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[11]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review arXiv
[12]

arXiv preprint arXiv:2406.15518 , year =

Steering Without Side Effects: Improving Post-Deployment Control of Language Models , author =. arXiv preprint arXiv:2406.15518 , year =

work page arXiv
[13]

2022 , archivePrefix=

Polysemanticity and Capacity in Neural Networks , author =. arXiv preprint arXiv:2210.01892 , year =

work page arXiv
[14]

Transactions on Machine Learning Research , year=

Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Transactions on Machine Learning Research , year=
[15]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=

Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2009 , publisher=

2009
[16]

and Tropp, Joel A

Amelunxen, Dennis and Lotz, Martin and McCoy, Michael B. and Tropp, Joel A. , title =. Information and Inference: A Journal of the IMA , volume =. 2014 , month =. doi:10.1093/imaiai/iau005 , url =

work page doi:10.1093/imaiai/iau005 2014
[17]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

2018
[18]

On Milman's inequality and random subspaces which escape through a mesh in Rn

Gordon, Y. On Milman's inequality and random subspaces which escape through a mesh in Rn. Geometric Aspects of Functional Analysis. 1988

1988
[19]

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J

Analyzing the Generalization and Reliability of Steering Vectors , author =. arXiv preprint arXiv:2407.12404 , year =

work page arXiv
[20]

NeurIPS 2025 (submission) , year =

From Steering Vectors to Conceptors: Compositional Affine Activation Steering for LLMs , author =. NeurIPS 2025 (submission) , year =

2025
[21]

arXiv preprint arXiv:2406.19384 , year=

The Remarkable Robustness of LLMs: Stages of Inference? , author =. arXiv preprint arXiv:2406.19384 , year =

work page arXiv
[22]

arXiv preprint arXiv:2601.09667 , year =

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning , author =. arXiv preprint arXiv:2601.09667 , year =

work page arXiv