Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
Pith reviewed 2026-05-22 11:09 UTC · model grok-4.3
The pith
Activation subspace bottlenecks in state-space models can be steered at test time by simple scalar multiplication to raise performance by an average of 8.27 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mechanistic interpretability tools reveal activation subspace bottlenecks in the Mamba family of state-space models. A test-time intervention that multiplies activations in these subspaces by a scalar improves performance by an average of 8.27 percent across seven SSMs and six diverse benchmarks with no task-specific tuning. Replacing the bottlenecks in the model definition yields Stable-Mamba, an architecture that delivers long-context gains when trained from scratch, confirming that the original subspaces were limiting performance.
What carries the argument
Activation subspace bottlenecks identified via mechanistic interpretability, steered by scalar multiplication of their activations at test time.
If this is right
- Performance gains appear across multiple models and tasks using one fixed intervention.
- The subspaces are causally linked to the observed performance limits in these models.
- Architectural changes based on the bottlenecks can produce better long-context behavior.
- Improvements require neither retraining nor task-specific tuning.
Where Pith is reading between the lines
- The same identification-and-scaling approach may apply to other sequence architectures beyond state-space models.
- Interpretability methods can serve as a direct route to both diagnosing and correcting model weaknesses.
- The technique points toward more reliable long-context modeling without quadratic attention costs.
Load-bearing premise
The subspaces located by interpretability tools are causally responsible for performance limits and a simple scalar multiplication on those directions produces reliable improvement without offsetting harms.
What would settle it
If scaling the identified subspaces at test time produces no performance gain on the benchmarks or if Stable-Mamba shows no long-context improvement after retraining from scratch, the central claim would be falsified.
Figures
read the original abstract
State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 7 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies activation subspace bottlenecks in the Mamba family of state-space models (SSMs) via mechanistic interpretability tools. It proposes a test-time steering intervention that multiplies activations along these subspaces by a scalar, reporting an average 8.27% performance improvement across 7 SSMs and 6 benchmarks without task-specific tuning. It further validates the bottlenecks by modifying them to define Stable-Mamba and shows long-context gains after retraining from scratch.
Significance. If the subspaces are shown to be causally responsible for performance limits rather than correlated with generic scaling, and if the gains prove robust, the work would meaningfully advance mechanistic interpretability and steerability for efficient SSM architectures. The inclusion of a retraining validation experiment is a constructive step toward establishing causality beyond test-time interventions.
major comments (3)
- [Abstract / Results] Abstract and experimental results: the central claim of an 8.27% average improvement lacks reported error bars, per-benchmark breakdowns, or statistical tests. Without these, it is impossible to determine whether the gain survives multiple-testing correction or is driven by a subset of benchmarks.
- [Experimental validation] Experimental validation section: the test-time scalar intervention and Stable-Mamba retraining do not include control experiments that apply identical scaling to random or orthogonal subspaces of matching dimension. This leaves open the possibility that gains arise from generic activation rescaling (e.g., implicit regularization) rather than the specific mechanistically identified directions.
- [Methods] Methods on subspace identification: details are needed on how subspaces were selected (e.g., whether selection involved any data-driven fitting that overlaps with the reported performance metrics) to assess independence from the evaluation protocol.
minor comments (2)
- [Methods] Notation for the steering scalar and subspace basis vectors should be introduced with explicit definitions early in the methods to improve readability.
- [Figures] Figure captions for any activation visualizations or performance plots should include the exact scalar values used and the number of runs.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us improve the clarity and rigor of our manuscript. We address each major comment point by point below, making revisions where the concerns are valid and providing explanations or additional evidence where appropriate.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and experimental results: the central claim of an 8.27% average improvement lacks reported error bars, per-benchmark breakdowns, or statistical tests. Without these, it is impossible to determine whether the gain survives multiple-testing correction or is driven by a subset of benchmarks.
Authors: We agree that the original presentation of the 8.27% average improvement would benefit from greater statistical detail. In the revised manuscript we now include a full per-benchmark table with mean improvements and standard deviations computed across five independent random seeds, as well as p-values from paired Wilcoxon signed-rank tests with Bonferroni correction for the six benchmarks. These additions confirm that the reported average gain is not driven by any single benchmark and remains significant after correction. revision: yes
-
Referee: [Experimental validation] Experimental validation section: the test-time scalar intervention and Stable-Mamba retraining do not include control experiments that apply identical scaling to random or orthogonal subspaces of matching dimension. This leaves open the possibility that gains arise from generic activation rescaling (e.g., implicit regularization) rather than the specific mechanistically identified directions.
Authors: This is a fair and important point about establishing specificity. We have added the requested control experiments to the revised manuscript: identical scalar scaling applied to (i) randomly chosen subspaces and (ii) subspaces orthogonal to the identified bottlenecks, both of the same dimensionality. The controls produce substantially smaller or null effects relative to the mechanistically identified directions, indicating that the performance gains are not explained by generic rescaling alone. revision: yes
-
Referee: [Methods] Methods on subspace identification: details are needed on how subspaces were selected (e.g., whether selection involved any data-driven fitting that overlaps with the reported performance metrics) to assess independence from the evaluation protocol.
Authors: We appreciate the request for explicit methodological transparency. The subspaces were identified via activation patching and causal tracing performed exclusively on a held-out validation split that shares no examples with the six evaluation benchmarks. We have expanded the Methods section with a precise description of the selection procedure, the exact activation statistics used, and an explicit statement that no performance numbers from the reported benchmarks entered the subspace identification pipeline. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper identifies subspaces via mechanistic interpretability, applies a test-time scalar steering intervention yielding 8.27% average gains across 7 SSMs and 6 benchmarks, and validates via separate retraining of a modified Stable-Mamba architecture. These elements supply independent empirical content: the intervention is applied post-identification and the retraining experiment is distinct. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or described claims. The derivation remains self-contained against external benchmarks rather than reducing to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- steering scalar
axioms (1)
- domain assumption Mechanistic interpretability methods can locate activation directions that causally limit model performance in state-space models.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar... steering factor of 5
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and orbit structure unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SPD entropy spike at Layer 20... bottleneck
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE factors sparse autoencoder decoder atoms to the native d_k x d_v cache write shape in recurrent models, provides a closed-form logit shift, and demonstrates high success in atom substitution and behavioral ed...
Reference graph
Works this paper leans on
-
[1]
doi: 10.18653/v1/2022.acl-long.581
https://transformer-circuits.pub/2023/monosemantic- features/index.html. Bushnaq, L., Braun, D., and Sharkey, L. Stochastic parameter decomposition, 2025. URL https://arxiv.org/abs/ 2506.20790. Chen, X., Hu, W., Dong, X., Lin, S., Chen, Z., Cao, M., Zhuang, Y ., Han, J., Xu, H., and Liang, X. Transmamba: Fast universal architecture adaption from transform...
-
[2]
Gurnee, W., Horsley, T., Guo, Z
URLhttps://arxiv.org/abs/2406.09546. Gurnee, W., Horsley, T., Guo, Z. C., Kheirkhah, T. R., Sun, Q., Hathaway, W., Nanda, N., and Bertsimas, D. Universal neurons in gpt2 language models, 2024. URL https://arxiv.or g/abs/2401.12181. 9 Interpreting and Steering State-Space Models Han, D., Wang, Z., Xia, Z., Han, Y ., Pu, Y ., Ge, C., Song, J., Song, S., Zhe...
-
[3]
In: Zong, C., Xia, F., Li, W., Navigli, R
https://transformer-circuits.pub/2022/in-context-learning- and-induction-heads/index.html. Paulo, G., Marshall, T., and Belrose, N. Does transformer inter- pretability transfer to rnns?, 2024. URL https://arxiv. org/abs/2404.05971. Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et a...
-
[4]
URL https://www.lesswrong.com/posts/ gQDhqXepYdxWC7gRY/a-short-project-on-mam ba-grokking-and-interpretability. Todd, E., Brinkmann, J., Gandikota, R., and Bau, D. In-context algebra.arXiv preprint arXiv:2512.16902, 2025. Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. MuSiQue: Multihop questions via single-hop question com- position.Transac...
-
[5]
Early training (epochs 1–5) -dt proj.biasrapidly explores the parameter space to identify optimal gating threshold
-
[6]
Mid training (epochs 6–20) - Gradients reduce as the optimal value is reached
-
[7]
Late training (epochs 21+) - Parameter remains effectively frozen while downstream representations adapt Consistency across neighboring layers.To verify that the Layer 20 master gate behavior is not a result of a single measurement instance, we examine attribution magnitudes (Braun et al., 2025) for dt proj.bias across adjacent layers (L19–L21) over multi...
work page 2025
-
[8]
+23% perplexity improvement on long se- quences (>1024 tokens)
-
[9]
+15% accuracy on multi-scale reasoning tasks
-
[10]
Entropy variance reduced by 31% Tweaked Mambay t =Cht +Dxt (linear) yt = P k wk(Ckh(k) t +Dkxt) (weighted ensemble)
-
[11]
Adaptive temporal resolution via learned weighting
-
[12]
0.02in Mamba Tweaked Mamba Not presenth global=SparseAttn(hlocal) output=αhglobal+ (1−α)hlocal
Top features show0.16+importance vs. 0.02in Mamba Tweaked Mamba Not presenth global=SparseAttn(hlocal) output=αhglobal+ (1−α)hlocal
-
[13]
22% reduction in bottleneck entropy (1.21 →0.94)
-
[14]
Smoother information flow in layers 19–21 3)∼0.4×cost of full Transformer attention Gating Uniform processingg= P i αiσ(Wi ·LN(x)) (ensemble gates,n= 3)
-
[15]
+683% feature usage efficiency (1.88%→ 14.7%)
-
[16]
Sparsity reduced from 98.1% to 85.3% Compression Not presentc= 0.5 + 0.5·σ(MLP(¯x)) 1) Adaptive signal preservation with reduced noise
-
[17]
Effective compression strength in range 0.2- 0.3 Gradient control Standard backpropagation ∂L ∂θ = ∂L ∂out ·λ comp· ∂out ∂θ (5 learnable scales)
-
[18]
46% reduction in gradient instability (CoV: 17.3%→9.4%)
-
[19]
Stable convergence without manual tuning Residual output=y t +xoutput= (y t +λresx)λglobal 1) Improved gradient flow in deep networks
-
[20]
Learned residual scale0.9±0.1 Table E2. Average per-layer computational cost of Stable- Mamba Component Parameters Memory Frequency SSM (×3)3d 2 = 1.77MO(d·s)Every layer Gates (×3)3d 2 = 1.77MO(d)Every layer Sparse Attention0.3sd= 47KO(0.3s 2)Every 5th layer Total / layer (avg)∼4.95MO(d·s+ 0.06s 2)- Table E3. Quantitative impact of the architectural modif...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.