pith. sign in

arxiv: 2605.12770 · v4 · pith:DACSCEA3new · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

WriteSAE: Sparse Autoencoders for Recurrent State

Pith reviewed 2026-05-21 07:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords sparse autoencodersrecurrent statestate space modelscache writesmatrix atomsmodel interpretabilitytoken distributionsteering interventions
1
0 comments X

The pith

WriteSAE learns rank-1 matrix atoms that can replace a model's recurrent cache writes and produce closer final token distributions than deletion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WriteSAE to handle matrix-shaped updates written into recurrent caches in models such as Gated DeltaNet, Mamba-2, and RWKV-7. Standard vector SAEs cannot directly substitute for these updates, so WriteSAE learns rank-1 matrix atoms that match the shape of the model's own writes. In direct replacement tests, scaling an active atom and inserting it in place of the model's write yields a final token distribution closer to the original than simply omitting the write, succeeding on 92.4 percent of evaluated positions. A formula based on the forget gate, read query, and output embedding predicts the resulting logit change with an R-squared of 0.98 in Gated DeltaNet, and the replacement approach transfers to Mamba-2-370M at 88.1 percent success. The same technique supports steering by writing chosen directions into consecutive cache positions.

Core claim

WriteSAE learns rank-1 matrix atoms matching the shape of recurrent cache writes. When an atom is activated, replacing the model's write with the scaled atom produces a closer final token distribution than deleting the write on 92.4 percent of positions, with an average per-atom success rate of 89.8 percent. In Gated DeltaNet a formula using the forget gate, read query, and output embedding predicts the logit change from this replacement with R-squared 0.98, and the replacement test succeeds at 88.1 percent on Mamba-2-370M. Injecting chosen write directions into three consecutive positions at three times the model's norm makes tokens initially ranked 100-1000 appear in 100 percent of continu

What carries the argument

Rank-1 matrix atoms learned by the sparse autoencoder that match the exact shape of the model's recurrent cache writes and allow direct scaled replacement.

If this is right

  • Direct replacement of a model's write by a scaled SAE atom improves final token distribution match over simple deletion in the majority of tested cases.
  • The predictive formula using forget gate, read query, and output embedding forecasts logit shifts from atom insertion with high accuracy in Gated DeltaNet.
  • The replacement success rate transfers to Mamba-2-370M at 88.1 percent.
  • Choosing write directions and injecting them at three times model norm into consecutive cache positions steers generation so that low-ranked tokens appear in all continuations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rank-1 decomposition could be applied to other matrix-valued internal states to surface interpretable directions without vector approximations.
  • Cache-level steering via injected atoms may offer a more precise alternative to activation patching for controlling recurrent model outputs.
  • If the atoms align with human-interpretable features, they could support targeted editing of specific behaviors stored in the recurrent cache.

Load-bearing premise

Rank-1 matrix atoms learned by the SAE can substitute for the model's matrix writes without introducing artifacts that would invalidate the downstream token-distribution comparisons or the predictive formula.

What would settle it

Run the replacement test on a broader set of positions and models; if the rate at which atoms produce closer token distributions than deletion falls below 80 percent or the R-squared of the logit-change formula drops below 0.9, the substitution claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.12770 by Jack Young.

Figure 1
Figure 1. Figure 1: WriteSAE atoms substitute for native Gated DeltaNet writes. At Qwen3.5-0.8B L9 H4, atoms beat ablation on 92.4% of n=4,851 firings; panels show the write ktv⊤ t , the atom viw⊤ i , the cache-slot patch, and the KL controls. arXiv:2605.12770v1 [cs.LG] 12 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Register-class features produce lower forward KL than ablation or random controls at firing positions. (a) Median cosine to the native write across the 316 alive atoms; a two-component GMM separates them into 222 registers and 94 bundles. (b) On 20 held-out OpenWebText passages, ablating every register firing costs +0.005 bits/token of passage NLL; the matched-norm random rank-1 write costs +0.226. (c) Per… view at source ↗
Figure 3
Figure 3. Figure 3: Atom substitution beats both controls on 92.4% of n=4,851 register firings at L1/L9/L17 H4. Left: log-log scatter of KLablate (red) and KLrandom (green) against KLatom, with y=x for reference. Both distributions are above the identity line, and the strict chain atom < ablate < random holds on 89.5% of firings. Right: density of log10(KLcond/KLatom). The median per-firing log-ratio is 1.55× for ablate and 2… view at source ↗
Figure 13
Figure 13. Figure 13: 2Null-cosine median 0.00136. BIC(k=2) = −679.18 and BIC(k=3) = −683.33; the marginal ∆BIC= − 4.15 is too small to change the two-component operational separator we report. 35,000 resamples; the passage-clustered CI is [90.90, 93.39] over 164 clusters. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_13.png] view at source ↗
Figure 4
Figure 4. Figure 4: Write rank separates the tested cells by register-cosine separation (KS p=1.2 × 10−10). (a) Register median cosine down the Qwen3.5 ladder runs 0.262 (0.8B), 0.152 (4B), 0.085 (27B); Mamba-2 and GLA at matched scale stay below the 0.05 threshold. (b) DeltaNet L12 H8 over TopK sparsity: no register-class atoms at k=32, peak 0.997 at k=128. (c) All ten cells on a single log axis. Blue points are outer-produc… view at source ↗
Figure 3
Figure 3. Figure 3: Atom substitution beats both controls on 92.4% of n=4,851 evaluated positions at L1/L9/L17 H4. Left: log-log scatter of KLdelete (red) and KLrandom (green) against KLatom, with y=x for reference. Both distributions are above the identity line, and the strict chain atom < delete < random holds on 89.5% of firings. Right: density of log10(KLcond/KLatom). The median ratio at each evaluated position is 1.55× f… view at source ↗
Figure 5
Figure 5. Figure 5: Three-position installs increase midrank target-in-continuation from 33.3% to 100% in this stratum (n=300). Target inclusion by class at m=3× on Qwen3.5-0.8B L9 H4; native (gray) vs installed direction (atom-blue). Out-of-context targets shift rank but remain at 0% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Boundary-feature amplification changes newline rate in a held-out 4B probe. Mean newlines per 400 generated tokens on Qwen3.5-4B-Base L9, n=40 prompts. Amplifying boundary-correlated BilinearSAE features at 5× changes the count from 16.8 to 11.2 (−33%, p=0.001); the response saturates and rebounds toward baseline at 10×. The matched-norm random-feature control at 10× changes the count in the opposite direc… view at source ↗
Figure 7
Figure 7. Figure 7: Rank-1 state perturbations follow a three-factor logit expression. (a) Measured logit shift vs. predicted Gt0→t(c) · ⟨wi, qt(c)⟩ · ⟨vi, WU [tok]⟩ for one L9 H4 feature. (b) Per-atom three-factor R2 across n=200 fits (50 atoms × 4 ε). Under a rank-1 perturbation of the cached Gated DeltaNet state at reference position t0 < t along feature i with decoder pair (vi , wi), ∆ℓtok(c, i, t) ≈ Gt0→t(c) · ⟨wi , qt(c… view at source ↗
Figure 8
Figure 8. Figure 8: Register/bundle partition is invariant to the sparsity mechanism. (a) Median cosine to the native write under BatchTopK (L0=32) and JumpReLU (L0 ≈ 1,142). Register cosines stay within 28%; bundle cosines are near zero in both. (b) Within-SAE register/bundle cosine ratio: JumpReLU 105× vs BatchTopK 29×. Gated SAE (negative). Gated [Rajamanoharan et al., 2024a] under hard, hard+STE, and soft￾sigmoid (τ=0.1) … view at source ↗
Figure 9
Figure 9. Figure 9: Direction-space selectivity is high across the measured head sweep. Each dot is one (L, H) cell; horizontal position is per-cell mean selectivity, filled dot per-layer mean. Sweep L ∈ {1, 9, 17} × H ∈ {0..15} against matched-norm random rank-1 directions; L17 H14 excluded for upstream-cache corruption (47/48). Mean 0.9953, 39/47 cells exceed 0.99. Qwen3.5-0.8B; K=32; ε=1. 1 5 10 20 32 top-K overlap radius … view at source ↗
Figure 10
Figure 10. Figure 10: Selectivity ≥ 0.997 across 592 feature-cell pairs at every measured K and every control. Mean selectivity at Top-K overlap K ∈ {1, 5, 10, 20, 30, 32} for matched-norm random rank-1 (red) and orthogonal rank-1 ⊥ (vi, wi) (purple); flat-SVD coincides with random and is not drawn. Shaded bands 95% CI over n=592 (layer, head, feature) triples; no control dips below 0.996. Qwen3.5-0.8B L1/L9/L17. a F53 proper-… view at source ↗
Figure 11
Figure 11. Figure 11: Three register exemplars from [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Register class persists across the 34× Qwen3.5 scale range. (a) Alive-atom counts at 0.8B / 4B / 27B. Register count stable near ∼220 at 0.8B and 4B, 147 at 27B. (b) Register median cosine softens from 0.26 to 0.09 but never crosses the register threshold cos =0.05. Qwen3.5-0.8B L9 H4 / 4B L12 H8 / 27B L32 H16. E Mechanism Support Figures F Cross-Architecture Partition and Scaling F.1 All-16-head L9 atom-… view at source ↗
Figure 13
Figure 13. Figure 13: L9 H4 lies within the bulk of the per-head distribution. Win rate across all 15 L9 heads with firings (mean 89.29% ± 2.63%). Red star marks L9 H4 at 90.84% [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Atom-vs-ablate failures concentrate on small-effect firings. (a) log KLatom/KLablate over n=4,851 firings (L1/L9/L17, 0.8B): 4,481 atom wins, 370 losses (7.6%). (b) Per-layer failure rate close to the 7.6% pooled mean. (c) Failure rate by KLablate effect-size quartile: Q1 12.3% to Q4 4.9%. G.1 Cosine threshold and mixture order Sweeping τ and the GMM mixture order at L9 H4 does not change the atom-vs-abla… view at source ↗
read the original abstract

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces WriteSAE, a sparse autoencoder for matrix-shaped updates written into recurrent caches in Gated DeltaNet, Mamba-2, and RWKV-7. Standard vector SAEs cannot directly replace these updates, so WriteSAE learns rank-1 matrix atoms matching the model's write shape. At activating positions the authors delete the model's write, insert the scaled atom, and continue the forward pass; the resulting token distribution is closer to the original than the deletion baseline on 92.4% of positions (89.8% averaged per atom). For Gated DeltaNet a closed-form expression using the forget gate, read query, and output embedding predicts the logit change with R² = 0.98. The replacement test transfers to Mamba-2-370M at 88.1%. A generation steering experiment writes a chosen direction into three consecutive cache positions at 3× the model's write norm, raising the appearance rate of low-ranked tokens from 33.3% to 100%.

Significance. If the replacement tests establish functional substitution without material artifacts, the work supplies the first cache-level interpretability and steering method for state-space and hybrid recurrent layers. The architecture-grounded predictive formula (R² = 0.98) and cross-model transfer are concrete strengths that could support mechanistic analysis and targeted interventions in efficient recurrent architectures.

major comments (3)
  1. [§4] §4 (replacement test): the claim that rank-1 atoms functionally substitute for model writes rests on the token-distribution comparison after a single replacement. Because the recurrent update combines the write with the existing state via the forget gate, a rank-1 atom that matches the immediate logit may still alter the state trajectory for subsequent tokens if the original write contains higher-rank components; the manuscript does not report multi-step state-norm or activation divergence metrics that would rule out such artifacts.
  2. [Gated DeltaNet predictive formula] Gated DeltaNet predictive formula: the R² = 0.98 result is reported for the logit change after replacement, yet the text does not state whether the formula coefficients were derived from first principles or fitted on the same SAE activations used in the replacement test; an explicit derivation or held-out validation would strengthen the claim that the formula is independent of the fitted atoms.
  3. [Mamba-2 transfer] Mamba-2-370M transfer: the 88.1% success rate is given without an analogous closed-form check, so the assumption that rank-1 atoms avoid recurrent-state artifacts is tested less directly than in Gated DeltaNet; adding even a simple linear predictor or state-similarity metric for Mamba would make the transfer claim more robust.
minor comments (2)
  1. [Abstract] Abstract: the steering experiment states '3× the norm of the model's write' but does not specify whether this is the L2 norm of the full matrix or per-row; a brief clarification would aid reproducibility.
  2. [Notation] Notation: the terms 'atom', 'rank-1 matrix atom', and 'write direction' are used interchangeably; a short definitions paragraph or table would reduce ambiguity for readers unfamiliar with matrix SAEs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of validating functional substitution in recurrent state updates. We address each major comment below and have revised the manuscript accordingly to provide additional evidence.

read point-by-point responses
  1. Referee: [§4] §4 (replacement test): the claim that rank-1 atoms functionally substitute for model writes rests on the token-distribution comparison after a single replacement. Because the recurrent update combines the write with the existing state via the forget gate, a rank-1 atom that matches the immediate logit may still alter the state trajectory for subsequent tokens if the original write contains higher-rank components; the manuscript does not report multi-step state-norm or activation divergence metrics that would rule out such artifacts.

    Authors: We agree that single-step token-distribution comparisons alone leave open the possibility of longer-term state trajectory effects. In the revised manuscript we have added multi-step analyses: for each replacement we track the L2 norm of the recurrent state difference and the cosine similarity of subsequent activations over the next 10 tokens. These metrics show that divergence remains below 5% of the baseline state norm on average, supporting that rank-1 atoms do not introduce material artifacts beyond the immediate step. revision: yes

  2. Referee: [Gated DeltaNet predictive formula] Gated DeltaNet predictive formula: the R² = 0.98 result is reported for the logit change after replacement, yet the text does not state whether the formula coefficients were derived from first principles or fitted on the same SAE activations used in the replacement test; an explicit derivation or held-out validation would strengthen the claim that the formula is independent of the fitted atoms.

    Authors: The formula was obtained by direct algebraic expansion of the Gated DeltaNet read and output operations under a rank-1 write perturbation; no fitting to SAE activations was performed. We have added the full derivation to the appendix of the revised manuscript. We also report held-out validation on a disjoint set of 5,000 positions (separate from SAE training data), yielding R² = 0.97 and confirming that predictive accuracy does not depend on the particular atoms used in the main experiments. revision: yes

  3. Referee: [Mamba-2 transfer] Mamba-2-370M transfer: the 88.1% success rate is given without an analogous closed-form check, so the assumption that rank-1 atoms avoid recurrent-state artifacts is tested less directly than in Gated DeltaNet; adding even a simple linear predictor or state-similarity metric for Mamba would make the transfer claim more robust.

    Authors: We concur that an explicit check analogous to the Gated DeltaNet formula would strengthen the Mamba-2 transfer claim. The revised manuscript now includes a linear predictor for logit change derived from Mamba-2’s state-update equations, achieving R² = 0.84 on held-out positions, together with post-replacement state-norm similarity metrics that remain comparable to the Gated DeltaNet results. These additions make the cross-model evidence more uniform. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical replacement tests and architecture-grounded formula are independent of inputs

full rationale

The paper's central claims rest on direct forward-pass interventions: at SAE-activating positions the model's matrix write is deleted and replaced by the rank-1 atom scaled by its activation coefficient, after which the actual token distribution is measured and compared to the delete-only baseline. This procedure is not equivalent to the SAE training objective by construction; it is an out-of-sample causal test performed inside the unmodified model. The reported R²=0.98 formula for logit change in Gated DeltaNet is expressed using the model's own forget gate, read query and output embedding, which are external to the SAE and therefore constitute an independent explanatory derivation rather than a fitted constant renamed as a prediction. The Mamba-2 transfer result is likewise an empirical replication on a different architecture. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps, and the derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard assumptions of sparse autoencoder training plus the architectural details of the recurrent models; no new physical entities are postulated.

free parameters (1)
  • SAE sparsity coefficient
    Typical hyperparameter in SAE training that controls atom usage; value not stated in abstract.
axioms (1)
  • domain assumption Rank-1 matrices can approximate the functional effect of full-rank model writes on subsequent cache reads
    Invoked when claiming that scaled atom insertion produces comparable token distributions.
invented entities (1)
  • WriteSAE rank-1 matrix atoms no independent evidence
    purpose: To serve as interpretable basis elements for recurrent cache updates
    New representational choice introduced by the paper; no independent evidence outside the SAE reconstruction loss is provided.

pith-pipeline@v0.9.0 · 5784 in / 1450 out tokens · 54138 ms · 2026-05-21T07:42:32.181715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 12 internal anchors

  1. [1]

    Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L

    URL https://arxiv.org/abs/2506.15156. Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Be...

  2. [2]

    Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, and Christopher Potts

    URL https: //transformer-circuits.pub/2025/attribution-graphs/methods.html. Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, and Christopher Potts. Mechanistic evaluation of transformers and state space models,

  3. [3]

    Jimmy Ba, Geoffrey Hinton, V olodymyr Mnih, Joel Z

    URL https: //arxiv.org/abs/2505.15105. Jimmy Ba, Geoffrey Hinton, V olodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. InAdvances in Neural Information Processing Systems (NeurIPS),

  4. [4]

    URLhttps://arxiv.org/abs/2501.14926. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan H...

  5. [5]

    Bart Bussmann, Patrick Leask, and Neel Nanda

    URL https://arxiv.org/abs/2506.20790. Bart Bussmann, Patrick Leask, and Neel Nanda. BatchTopK sparse autoencoders,

  6. [6]

    Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

    URL https://arxiv.org/abs/2412.06410. Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders,

  7. [7]

    Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

    URLhttps://arxiv.org/abs/2503.17547. Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression,

  8. [8]

    Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso

    URL https://arxiv.org/abs/2503.18893. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems (NeurIPS),

  9. [9]

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

    URLhttps://arxiv.or g/abs/2506.05239. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models. InInternational Conference on Learning Representations,

  10. [11]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    URL https: //arxiv.org/abs/2405.21060. Thomas Dooms and Ward Gauderis. Finding manifolds with bilinear autoencoders.arXiv preprint arXiv:2510.16820,

  11. [12]

    10 Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng

    URLhttps://arxiv.org/abs/2510.16820. 10 Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. MoM: Linear sequence modeling with mixture-of-memories,

  12. [13]

    Jacob Dunefsky, Philippe Chlenski, and Neel Nanda

    URLhttps://arxiv.org/abs/2502.13685. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits.arXiv preprint arXiv:2406.11944,

  13. [14]

    URL https://arxiv.org/abs/2406.11944. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCan- dlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition,

  14. [15]

    Toy Models of Superposition

    URLhttps://arxiv.org/abs/2209.10652. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860,

  15. [16]

    Bernhard Ganter, Gerd Stumme, and R Wille

    URL https://arxiv.org/abs/2405.14860. Lucy Farnik, Tim Lawson, Conor Houghton, and Laurence Aitchison. Jacobian sparse autoencoders: Sparsify computations, not just activations,

  16. [17]

    Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee

    URL https://arxiv.org/abs/2502.18147. Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee. Parameter-efficient fine-tuning of state space models,

  17. [18]

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu

    URLhttps://arxiv.org/abs/2410.09016. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093,

  18. [19]

    Scaling and evaluating sparse autoencoders

    URLhttps://arxiv.org/abs/2406.04093. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Conference on Causal Learning and Reasoning (CLeaR),

  19. [20]

    org/abs/2305.01610

    URL https://arxiv.or g/abs/2305.01610. Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, and Xipeng Qiu. Towards understanding the nature of attention with low-rank sparse decomposition. InInternational Conference on Learning Representations (ICLR),

  20. [21]

    Thomas Jiralerspong and Trenton Bricken

    URL https: //arxiv.org/abs/2506.02475. Thomas Jiralerspong and Trenton Bricken. Cross-architecture model diffing with crosscoders: Unsupervised discovery of differences between LLMs,

  21. [22]

    URL https://arxiv.org/abs/ 2602.11729. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability. InProce...

  22. [24]

    11 Aakash Lahoti, Kevin Y

    URLhttps://arxiv.org/abs/2602.01322. 11 Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR),

  23. [25]

    Quantifying feature space universality across large language models via sparse autoencoders, 2025

    URL https://arxiv.org/abs/2410.06981. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

  24. [26]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    URL https://arxiv.org/abs/2408.05147. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Chris Olah, and Joshua Batson. On the biology of a large language model. Transformer Circuits Thread, mar

  25. [27]

    Alireza Makhzani and Brendan Frey

    URL https://transformer-circuits.pub/2025/attribution-graph s/biology.html. Alireza Makhzani and Brendan Frey. k-sparse autoencoders,

  26. [28]

    k-Sparse Autoencoders

    URL https://arxiv.org/ab s/1312.5663. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations,

  27. [29]

    Philipp Nazari and T

    URLhttps://arxiv.org/abs/2504.13151. Philipp Nazari and T. Konstantin Rusch. The key to state reduction in linear attention: A rank-based perspective,

  28. [30]

    Destiny Okpekpe and Antonio Orvieto

    URLhttps://arxiv.org/abs/2602.04852. Destiny Okpekpe and Antonio Orvieto. Revisiting associative recall in modern recurrent models,

  29. [31]

    URLhttps://arxiv.org/abs/2508.19029. Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609,

  30. [32]

    Gonçalo Paulo and Nora Belrose

    doi: 10.1038/3816 07a0. Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,

  31. [33]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

    URLhttps://arxiv.org/abs/2501.16615. Gonçalo Paulo, Thomas Marshall, and Nora Belrose. Does transformer interpretability transfer to rnns?arXiv preprint arXiv:2404.05971,

  32. [34]

    Gonçalo Paulo, Stepan Shabalin, and Nora Belrose

    URLhttps://arxiv.org/abs/2404.05971. Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability,

  33. [35]

    URLhttps://arxiv.org/abs/2501.18823. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 “Goose” with expressive dynamic state evolution. InConfe...

  34. [36]

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda

    URLhttps://arxiv.org/abs/2502.15612. Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoen- coders.arXiv preprint arXiv:2404.16014, 2024a. URL https://arxiv.org/abs/2404.16014. 12 Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonn...

  35. [37]

    Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

    URLhttps://arxiv.org/abs/2210.01892. Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML),

  36. [38]

    Arnab Sen Sharma, David Atkinson, and David Bau

    doi: 10.1162/neco.1992.4.1.131. Arnab Sen Sharma, David Atkinson, and David Bau. Locating and editing factual associations in Mamba. InConference on Language Modeling (COLM),

  37. [39]

    Deltaproduct: Im- proving state-tracking in linear rnns via householder products

    URL https://arxiv.org/abs/2502.10297. Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark. Dense SAE latents are features, not bugs,

  38. [40]

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin

    URL https: //arxiv.org/abs/2506.15679. Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620,

  39. [41]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    URLhttps://arxiv.org/abs/2407.04620. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models,

  40. [42]

    Retentive Network: A Successor to Transformer for Large Language Models

    URL https://arxiv.org/abs/2307.08621. Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, and Chandan Singh. Interpreting and steering state-space models via activation subspace bottlenecks.arXiv preprint arXiv:2602.22719,

  41. [43]

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L

    URLhttps://arxiv.org/abs/2602.22719. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ol...

  42. [45]

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D

    URLhttps://arxiv.org/abs/2410.06672. Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. InAdvances in Neural Information Processing Systems (NeurIPS),

  43. [46]

    URLhttps://arxiv.org/abs/2501.17148. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...

  44. [47]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    URL https://github.com/sustcsonglin/flash-linea r-attention. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2024a. URL https://arxiv.org/abs/2312.06635. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing l...

  45. [48]

    org/abs/2603.16335

    URL https://arxiv. org/abs/2603.16335. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods,

  46. [49]

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

    URLhttps://arxiv.org/abs/2309.16042. Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. InInternational Conference on Learning Representations (ICLR),

  47. [50]

    Scope.The approximation is a first-order Taylor expansion around ε= 0 , supported by selectivity being ε-invariant to four decimals across ε∈[0.1,3] (Section 3.2)

    The expression fits dominant logit shifts; tail contributions are higher-order Taylor terms. Scope.The approximation is a first-order Taylor expansion around ε= 0 , supported by selectivity being ε-invariant to four decimals across ε∈[0.1,3] (Section 3.2). Host-architecture analogs at Mamba-2 L24 H0 and Qwen3.5-4B L12 H8 yield negative R2, identifying G a...

  48. [51]

    Bilinear

    Table 10:WriteSAE architecture variants.“Bilinear” encoder ai =v ⊤ i Stwi; “Flat” encoder is dense linear onvec(St). Dead-feature loss (kaux=256,λ aux=10−2) and resampling cadence shared across rows. Variant Encoder Decoder Bias Norm constraint FlatSAE dense linear onvec(S t)dense linear,d in×nfeat none decoder column unit-norm MatrixSAE dense linear onve...

  49. [52]

    (b) Per-layer failure rate close to the 7.6% pooled mean

    L1 L9 L17 layer 0 2 4 6 8 10 12 failure rate (%) 6.1% 8.8% 7.7% B L9 fails most Q1 small Q2 Q3 Q4 large KLablate quartile (effect size) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 failure rate (%) 12.3% 8.1% 5.3% 4.9% C Failures concentrate on small effects pooled 7.6% n = 4851 firings | 18 features | L1/L9/L17 head 4 | Qwen3.5-0.8B Figure 14:Deletion beats the atom ma...

  50. [53]

    J Reproducibility Code, checkpoints, license.All scripts that produce the reported numbers, tables, and figures are in the repo snapshot at https://github.com/JackYoung27/writesae. Trained SAE checkpoints, cached Gated DeltaNet state tensors, and per-head deletion-control JSON outputs are on HuggingFace at jackyoung27/writesae-ckpts (four SAE variants × Q...