pith. sign in

arxiv: 2602.22719 · v2 · pith:PRHIQCCLnew · submitted 2026-02-26 · 💻 cs.LG

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

Pith reviewed 2026-05-22 11:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords state-space modelsMambamechanistic interpretabilityactivation steeringperformance improvementlong-context modelingtest-time intervention
0
0 comments X

The pith

Activation subspace bottlenecks in state-space models can be steered at test time by simple scalar multiplication to raise performance by an average of 8.27 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper locates specific activation subspaces inside Mamba and related state-space models that act as performance bottlenecks. It demonstrates that multiplying the values along those directions by a scalar at inference time produces better outputs on language tasks. The same fixed intervention works across seven different SSMs and six benchmarks without any per-task adjustment. The authors then alter the architecture to remove those bottlenecks, producing a model called Stable-Mamba that shows measurable gains on long-context problems after retraining from scratch.

Core claim

Mechanistic interpretability tools reveal activation subspace bottlenecks in the Mamba family of state-space models. A test-time intervention that multiplies activations in these subspaces by a scalar improves performance by an average of 8.27 percent across seven SSMs and six diverse benchmarks with no task-specific tuning. Replacing the bottlenecks in the model definition yields Stable-Mamba, an architecture that delivers long-context gains when trained from scratch, confirming that the original subspaces were limiting performance.

What carries the argument

Activation subspace bottlenecks identified via mechanistic interpretability, steered by scalar multiplication of their activations at test time.

If this is right

  • Performance gains appear across multiple models and tasks using one fixed intervention.
  • The subspaces are causally linked to the observed performance limits in these models.
  • Architectural changes based on the bottlenecks can produce better long-context behavior.
  • Improvements require neither retraining nor task-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same identification-and-scaling approach may apply to other sequence architectures beyond state-space models.
  • Interpretability methods can serve as a direct route to both diagnosing and correcting model weaknesses.
  • The technique points toward more reliable long-context modeling without quadratic attention costs.

Load-bearing premise

The subspaces located by interpretability tools are causally responsible for performance limits and a simple scalar multiplication on those directions produces reliable improvement without offsetting harms.

What would settle it

If scaling the identified subspaces at test time produces no performance gain on the benchmarks or if Stable-Mamba shows no long-context improvement after retraining from scratch, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2602.22719 by Aneesha Das, Chandan Singh, Kaustubh Gupta, Vamshi Sunku Mohan.

Figure 1
Figure 1. Figure 1: Entropy across layers measured using Stochastic Parameter Decomposition [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow for identifying Activation Subspace Bottlenecks [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of SSM models with and without transferred steering parameters. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 7 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies activation subspace bottlenecks in the Mamba family of state-space models (SSMs) via mechanistic interpretability tools. It proposes a test-time steering intervention that multiplies activations along these subspaces by a scalar, reporting an average 8.27% performance improvement across 7 SSMs and 6 benchmarks without task-specific tuning. It further validates the bottlenecks by modifying them to define Stable-Mamba and shows long-context gains after retraining from scratch.

Significance. If the subspaces are shown to be causally responsible for performance limits rather than correlated with generic scaling, and if the gains prove robust, the work would meaningfully advance mechanistic interpretability and steerability for efficient SSM architectures. The inclusion of a retraining validation experiment is a constructive step toward establishing causality beyond test-time interventions.

major comments (3)
  1. [Abstract / Results] Abstract and experimental results: the central claim of an 8.27% average improvement lacks reported error bars, per-benchmark breakdowns, or statistical tests. Without these, it is impossible to determine whether the gain survives multiple-testing correction or is driven by a subset of benchmarks.
  2. [Experimental validation] Experimental validation section: the test-time scalar intervention and Stable-Mamba retraining do not include control experiments that apply identical scaling to random or orthogonal subspaces of matching dimension. This leaves open the possibility that gains arise from generic activation rescaling (e.g., implicit regularization) rather than the specific mechanistically identified directions.
  3. [Methods] Methods on subspace identification: details are needed on how subspaces were selected (e.g., whether selection involved any data-driven fitting that overlaps with the reported performance metrics) to assess independence from the evaluation protocol.
minor comments (2)
  1. [Methods] Notation for the steering scalar and subspace basis vectors should be introduced with explicit definitions early in the methods to improve readability.
  2. [Figures] Figure captions for any activation visualizations or performance plots should include the exact scalar values used and the number of runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us improve the clarity and rigor of our manuscript. We address each major comment point by point below, making revisions where the concerns are valid and providing explanations or additional evidence where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and experimental results: the central claim of an 8.27% average improvement lacks reported error bars, per-benchmark breakdowns, or statistical tests. Without these, it is impossible to determine whether the gain survives multiple-testing correction or is driven by a subset of benchmarks.

    Authors: We agree that the original presentation of the 8.27% average improvement would benefit from greater statistical detail. In the revised manuscript we now include a full per-benchmark table with mean improvements and standard deviations computed across five independent random seeds, as well as p-values from paired Wilcoxon signed-rank tests with Bonferroni correction for the six benchmarks. These additions confirm that the reported average gain is not driven by any single benchmark and remains significant after correction. revision: yes

  2. Referee: [Experimental validation] Experimental validation section: the test-time scalar intervention and Stable-Mamba retraining do not include control experiments that apply identical scaling to random or orthogonal subspaces of matching dimension. This leaves open the possibility that gains arise from generic activation rescaling (e.g., implicit regularization) rather than the specific mechanistically identified directions.

    Authors: This is a fair and important point about establishing specificity. We have added the requested control experiments to the revised manuscript: identical scalar scaling applied to (i) randomly chosen subspaces and (ii) subspaces orthogonal to the identified bottlenecks, both of the same dimensionality. The controls produce substantially smaller or null effects relative to the mechanistically identified directions, indicating that the performance gains are not explained by generic rescaling alone. revision: yes

  3. Referee: [Methods] Methods on subspace identification: details are needed on how subspaces were selected (e.g., whether selection involved any data-driven fitting that overlaps with the reported performance metrics) to assess independence from the evaluation protocol.

    Authors: We appreciate the request for explicit methodological transparency. The subspaces were identified via activation patching and causal tracing performed exclusively on a held-out validation split that shares no examples with the six evaluation benchmarks. We have expanded the Methods section with a precise description of the selection procedure, the exact activation statistics used, and an explicit statement that no performance numbers from the reported benchmarks entered the subspace identification pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper identifies subspaces via mechanistic interpretability, applies a test-time scalar steering intervention yielding 8.27% average gains across 7 SSMs and 6 benchmarks, and validates via separate retraining of a modified Stable-Mamba architecture. These elements supply independent empirical content: the intervention is applied post-identification and the retraining experiment is distinct. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or described claims. The derivation remains self-contained against external benchmarks rather than reducing to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that mechanistic interpretability tools can isolate causally relevant directions in SSM activations and that scalar scaling of those directions produces a net performance benefit; no explicit free parameters or invented entities are declared in the abstract.

free parameters (1)
  • steering scalar
    A multiplier applied to the identified subspace; the abstract states it requires no task-specific tuning, yet its selection still depends on some validation procedure not detailed here.
axioms (1)
  • domain assumption Mechanistic interpretability methods can locate activation directions that causally limit model performance in state-space models.
    Invoked when the authors identify the bottlenecks and attribute performance gains to their modification.

pith-pipeline@v0.9.0 · 5683 in / 1309 out tokens · 36868 ms · 2026-05-22T11:09:06.125639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  3. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

  4. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 7.0

    WriteSAE factors sparse autoencoder decoder atoms to the native d_k x d_v cache write shape in recurrent models, provides a closed-form logit shift, and demonstrates high success in atom substitution and behavioral ed...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper

  1. [1]

    doi: 10.18653/v1/2022.acl-long.581

    https://transformer-circuits.pub/2023/monosemantic- features/index.html. Bushnaq, L., Braun, D., and Sharkey, L. Stochastic parameter decomposition, 2025. URL https://arxiv.org/abs/ 2506.20790. Chen, X., Hu, W., Dong, X., Lin, S., Chen, Z., Cao, M., Zhuang, Y ., Han, J., Xu, H., and Liang, X. Transmamba: Fast universal architecture adaption from transform...

  2. [2]

    Gurnee, W., Horsley, T., Guo, Z

    URLhttps://arxiv.org/abs/2406.09546. Gurnee, W., Horsley, T., Guo, Z. C., Kheirkhah, T. R., Sun, Q., Hathaway, W., Nanda, N., and Bertsimas, D. Universal neurons in gpt2 language models, 2024. URL https://arxiv.or g/abs/2401.12181. 9 Interpreting and Steering State-Space Models Han, D., Wang, Z., Xia, Z., Han, Y ., Pu, Y ., Ge, C., Song, J., Song, S., Zhe...

  3. [3]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    https://transformer-circuits.pub/2022/in-context-learning- and-induction-heads/index.html. Paulo, G., Marshall, T., and Belrose, N. Does transformer inter- pretability transfer to rnns?, 2024. URL https://arxiv. org/abs/2404.05971. Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et a...

  4. [4]

    Capital of France

    URL https://www.lesswrong.com/posts/ gQDhqXepYdxWC7gRY/a-short-project-on-mam ba-grokking-and-interpretability. Todd, E., Brinkmann, J., Gandikota, R., and Bau, D. In-context algebra.arXiv preprint arXiv:2512.16902, 2025. Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. MuSiQue: Multihop questions via single-hop question com- position.Transac...

  5. [5]

    Early training (epochs 1–5) -dt proj.biasrapidly explores the parameter space to identify optimal gating threshold

  6. [6]

    Mid training (epochs 6–20) - Gradients reduce as the optimal value is reached

  7. [7]

    The small magnitudes across all entries indicate that the gate operates in a stable environment once training has progressed beyond initial feature extraction phase

    Late training (epochs 21+) - Parameter remains effectively frozen while downstream representations adapt Consistency across neighboring layers.To verify that the Layer 20 master gate behavior is not a result of a single measurement instance, we examine attribution magnitudes (Braun et al., 2025) for dt proj.bias across adjacent layers (L19–L21) over multi...

  8. [8]

    +23% perplexity improvement on long se- quences (>1024 tokens)

  9. [9]

    +15% accuracy on multi-scale reasoning tasks

  10. [10]

    Entropy variance reduced by 31% Tweaked Mambay t =Cht +Dxt (linear) yt = P k wk(Ckh(k) t +Dkxt) (weighted ensemble)

  11. [11]

    Adaptive temporal resolution via learned weighting

  12. [12]

    0.02in Mamba Tweaked Mamba Not presenth global=SparseAttn(hlocal) output=αhglobal+ (1−α)hlocal

    Top features show0.16+importance vs. 0.02in Mamba Tweaked Mamba Not presenth global=SparseAttn(hlocal) output=αhglobal+ (1−α)hlocal

  13. [13]

    22% reduction in bottleneck entropy (1.21 →0.94)

  14. [14]

    Smoother information flow in layers 19–21 3)∼0.4×cost of full Transformer attention Gating Uniform processingg= P i αiσ(Wi ·LN(x)) (ensemble gates,n= 3)

  15. [15]

    +683% feature usage efficiency (1.88%→ 14.7%)

  16. [16]

    Sparsity reduced from 98.1% to 85.3% Compression Not presentc= 0.5 + 0.5·σ(MLP(¯x)) 1) Adaptive signal preservation with reduced noise

  17. [17]

    Effective compression strength in range 0.2- 0.3 Gradient control Standard backpropagation ∂L ∂θ = ∂L ∂out ·λ comp· ∂out ∂θ (5 learnable scales)

  18. [18]

    46% reduction in gradient instability (CoV: 17.3%→9.4%)

  19. [19]

    Stable convergence without manual tuning Residual output=y t +xoutput= (y t +λresx)λglobal 1) Improved gradient flow in deep networks

  20. [20]

    Learned residual scale0.9±0.1 Table E2. Average per-layer computational cost of Stable- Mamba Component Parameters Memory Frequency SSM (×3)3d 2 = 1.77MO(d·s)Every layer Gates (×3)3d 2 = 1.77MO(d)Every layer Sparse Attention0.3sd= 47KO(0.3s 2)Every 5th layer Total / layer (avg)∼4.95MO(d·s+ 0.06s 2)- Table E3. Quantitative impact of the architectural modif...