WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

arxiv: 2605.12770 · v4 · pith:DACSCEA3new · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young This is my paper

Pith reviewed 2026-05-21 07:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords sparse autoencodersrecurrent statestate space modelscache writesmatrix atomsmodel interpretabilitytoken distributionsteering interventions

0 comments

The pith

WriteSAE learns rank-1 matrix atoms that can replace a model's recurrent cache writes and produce closer final token distributions than deletion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WriteSAE to handle matrix-shaped updates written into recurrent caches in models such as Gated DeltaNet, Mamba-2, and RWKV-7. Standard vector SAEs cannot directly substitute for these updates, so WriteSAE learns rank-1 matrix atoms that match the shape of the model's own writes. In direct replacement tests, scaling an active atom and inserting it in place of the model's write yields a final token distribution closer to the original than simply omitting the write, succeeding on 92.4 percent of evaluated positions. A formula based on the forget gate, read query, and output embedding predicts the resulting logit change with an R-squared of 0.98 in Gated DeltaNet, and the replacement approach transfers to Mamba-2-370M at 88.1 percent success. The same technique supports steering by writing chosen directions into consecutive cache positions.

Core claim

WriteSAE learns rank-1 matrix atoms matching the shape of recurrent cache writes. When an atom is activated, replacing the model's write with the scaled atom produces a closer final token distribution than deleting the write on 92.4 percent of positions, with an average per-atom success rate of 89.8 percent. In Gated DeltaNet a formula using the forget gate, read query, and output embedding predicts the logit change from this replacement with R-squared 0.98, and the replacement test succeeds at 88.1 percent on Mamba-2-370M. Injecting chosen write directions into three consecutive positions at three times the model's norm makes tokens initially ranked 100-1000 appear in 100 percent of continu

What carries the argument

Rank-1 matrix atoms learned by the sparse autoencoder that match the exact shape of the model's recurrent cache writes and allow direct scaled replacement.

If this is right

Direct replacement of a model's write by a scaled SAE atom improves final token distribution match over simple deletion in the majority of tested cases.
The predictive formula using forget gate, read query, and output embedding forecasts logit shifts from atom insertion with high accuracy in Gated DeltaNet.
The replacement success rate transfers to Mamba-2-370M at 88.1 percent.
Choosing write directions and injecting them at three times model norm into consecutive cache positions steers generation so that low-ranked tokens appear in all continuations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rank-1 decomposition could be applied to other matrix-valued internal states to surface interpretable directions without vector approximations.
Cache-level steering via injected atoms may offer a more precise alternative to activation patching for controlling recurrent model outputs.
If the atoms align with human-interpretable features, they could support targeted editing of specific behaviors stored in the recurrent cache.

Load-bearing premise

Rank-1 matrix atoms learned by the SAE can substitute for the model's matrix writes without introducing artifacts that would invalidate the downstream token-distribution comparisons or the predictive formula.

What would settle it

Run the replacement test on a broader set of positions and models; if the rate at which atoms produce closer token distributions than deletion falls below 80 percent or the R-squared of the logit-change formula drops below 0.9, the substitution claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.12770 by Jack Young.

**Figure 1.** Figure 1: WriteSAE atoms substitute for native Gated DeltaNet writes. At Qwen3.5-0.8B L9 H4, atoms beat ablation on 92.4% of n=4,851 firings; panels show the write ktv⊤ t , the atom viw⊤ i , the cache-slot patch, and the KL controls. arXiv:2605.12770v1 [cs.LG] 12 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Register-class features produce lower forward KL than ablation or random controls at firing positions. (a) Median cosine to the native write across the 316 alive atoms; a two-component GMM separates them into 222 registers and 94 bundles. (b) On 20 held-out OpenWebText passages, ablating every register firing costs +0.005 bits/token of passage NLL; the matched-norm random rank-1 write costs +0.226. (c) Per… view at source ↗

**Figure 3.** Figure 3: Atom substitution beats both controls on 92.4% of n=4,851 register firings at L1/L9/L17 H4. Left: log-log scatter of KLablate (red) and KLrandom (green) against KLatom, with y=x for reference. Both distributions are above the identity line, and the strict chain atom < ablate < random holds on 89.5% of firings. Right: density of log10(KLcond/KLatom). The median per-firing log-ratio is 1.55× for ablate and 2… view at source ↗

**Figure 13.** Figure 13: 2Null-cosine median 0.00136. BIC(k=2) = −679.18 and BIC(k=3) = −683.33; the marginal ∆BIC= − 4.15 is too small to change the two-component operational separator we report. 35,000 resamples; the passage-clustered CI is [90.90, 93.39] over 164 clusters. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_13.png] view at source ↗

**Figure 4.** Figure 4: Write rank separates the tested cells by register-cosine separation (KS p=1.2 × 10−10). (a) Register median cosine down the Qwen3.5 ladder runs 0.262 (0.8B), 0.152 (4B), 0.085 (27B); Mamba-2 and GLA at matched scale stay below the 0.05 threshold. (b) DeltaNet L12 H8 over TopK sparsity: no register-class atoms at k=32, peak 0.997 at k=128. (c) All ten cells on a single log axis. Blue points are outer-produc… view at source ↗

**Figure 3.** Figure 3: Atom substitution beats both controls on 92.4% of n=4,851 evaluated positions at L1/L9/L17 H4. Left: log-log scatter of KLdelete (red) and KLrandom (green) against KLatom, with y=x for reference. Both distributions are above the identity line, and the strict chain atom < delete < random holds on 89.5% of firings. Right: density of log10(KLcond/KLatom). The median ratio at each evaluated position is 1.55× f… view at source ↗

**Figure 5.** Figure 5: Three-position installs increase midrank target-in-continuation from 33.3% to 100% in this stratum (n=300). Target inclusion by class at m=3× on Qwen3.5-0.8B L9 H4; native (gray) vs installed direction (atom-blue). Out-of-context targets shift rank but remain at 0% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Boundary-feature amplification changes newline rate in a held-out 4B probe. Mean newlines per 400 generated tokens on Qwen3.5-4B-Base L9, n=40 prompts. Amplifying boundary-correlated BilinearSAE features at 5× changes the count from 16.8 to 11.2 (−33%, p=0.001); the response saturates and rebounds toward baseline at 10×. The matched-norm random-feature control at 10× changes the count in the opposite direc… view at source ↗

**Figure 7.** Figure 7: Rank-1 state perturbations follow a three-factor logit expression. (a) Measured logit shift vs. predicted Gt0→t(c) · ⟨wi, qt(c)⟩ · ⟨vi, WU [tok]⟩ for one L9 H4 feature. (b) Per-atom three-factor R2 across n=200 fits (50 atoms × 4 ε). Under a rank-1 perturbation of the cached Gated DeltaNet state at reference position t0 < t along feature i with decoder pair (vi , wi), ∆ℓtok(c, i, t) ≈ Gt0→t(c) · ⟨wi , qt(c… view at source ↗

**Figure 8.** Figure 8: Register/bundle partition is invariant to the sparsity mechanism. (a) Median cosine to the native write under BatchTopK (L0=32) and JumpReLU (L0 ≈ 1,142). Register cosines stay within 28%; bundle cosines are near zero in both. (b) Within-SAE register/bundle cosine ratio: JumpReLU 105× vs BatchTopK 29×. Gated SAE (negative). Gated [Rajamanoharan et al., 2024a] under hard, hard+STE, and softsigmoid (τ=0.1) … view at source ↗

**Figure 9.** Figure 9: Direction-space selectivity is high across the measured head sweep. Each dot is one (L, H) cell; horizontal position is per-cell mean selectivity, filled dot per-layer mean. Sweep L ∈ {1, 9, 17} × H ∈ {0..15} against matched-norm random rank-1 directions; L17 H14 excluded for upstream-cache corruption (47/48). Mean 0.9953, 39/47 cells exceed 0.99. Qwen3.5-0.8B; K=32; ε=1. 1 5 10 20 32 top-K overlap radius … view at source ↗

**Figure 10.** Figure 10: Selectivity ≥ 0.997 across 592 feature-cell pairs at every measured K and every control. Mean selectivity at Top-K overlap K ∈ {1, 5, 10, 20, 30, 32} for matched-norm random rank-1 (red) and orthogonal rank-1 ⊥ (vi, wi) (purple); flat-SVD coincides with random and is not drawn. Shaded bands 95% CI over n=592 (layer, head, feature) triples; no control dips below 0.996. Qwen3.5-0.8B L1/L9/L17. a F53 proper-… view at source ↗

**Figure 11.** Figure 11: Three register exemplars from [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Register class persists across the 34× Qwen3.5 scale range. (a) Alive-atom counts at 0.8B / 4B / 27B. Register count stable near ∼220 at 0.8B and 4B, 147 at 27B. (b) Register median cosine softens from 0.26 to 0.09 but never crosses the register threshold cos =0.05. Qwen3.5-0.8B L9 H4 / 4B L12 H8 / 27B L32 H16. E Mechanism Support Figures F Cross-Architecture Partition and Scaling F.1 All-16-head L9 atom-… view at source ↗

**Figure 13.** Figure 13: L9 H4 lies within the bulk of the per-head distribution. Win rate across all 15 L9 heads with firings (mean 89.29% ± 2.63%). Red star marks L9 H4 at 90.84% [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Atom-vs-ablate failures concentrate on small-effect firings. (a) log KLatom/KLablate over n=4,851 firings (L1/L9/L17, 0.8B): 4,481 atom wins, 370 losses (7.6%). (b) Per-layer failure rate close to the 7.6% pooled mean. (c) Failure rate by KLablate effect-size quartile: Q1 12.3% to Q4 4.9%. G.1 Cosine threshold and mixture order Sweeping τ and the GMM mixture order at L9 H4 does not change the atom-vs-abla… view at source ↗

read the original abstract

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WriteSAE shows how to run SAEs on recurrent matrix writes with rank-1 atoms and gets solid replacement numbers plus a high-R2 prediction in one model, though longer-term state effects need more checking.

read the letter

This paper gives a way to apply SAEs to the writes that go into recurrent caches in models like Mamba and RWKV. By learning rank-1 matrix atoms, they can test direct replacements and do some steering at the cache level. The replacement test is the strongest part. At positions where an atom activates, deleting the model's write and inserting the scaled atom produces a final token distribution closer to the original than the delete-only baseline on 92.4% of cases. They average 89.8% per atom. For Gated DeltaNet, a formula based on the forget gate, read query, and output embedding predicts the logit change with R² = 0.98. The same test works on Mamba-2-370M at 88.1%. The generation example where they force certain tokens by writing a direction multiple times is a first step toward control. This is new as the first cache-level steering in state-space or hybrid recurrent layers, and the shift to rank-1 atoms matches the write operation better than standard vector SAEs. One soft spot is that the rank-1 approximation might introduce artifacts in how the state updates over time. Matching the immediate distribution does not guarantee the cache trajectory stays the same for later tokens, especially if the original write has higher-rank components that interact with the current state. The transfer result lacks the closed-form check, so it's less tested there. Without full methods it's also unclear if position selection avoids bias. This is aimed at interpretability researchers working on recurrent and state-space models for long contexts. Readers interested in analyzing or intervening in the internal state will find the concrete method and numbers useful. The paper engages directly with the model architecture in its predictive formula. It deserves a serious referee to check the details and see how robust the findings are. I recommend putting it through peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces WriteSAE, a sparse autoencoder for matrix-shaped updates written into recurrent caches in Gated DeltaNet, Mamba-2, and RWKV-7. Standard vector SAEs cannot directly replace these updates, so WriteSAE learns rank-1 matrix atoms matching the model's write shape. At activating positions the authors delete the model's write, insert the scaled atom, and continue the forward pass; the resulting token distribution is closer to the original than the deletion baseline on 92.4% of positions (89.8% averaged per atom). For Gated DeltaNet a closed-form expression using the forget gate, read query, and output embedding predicts the logit change with R² = 0.98. The replacement test transfers to Mamba-2-370M at 88.1%. A generation steering experiment writes a chosen direction into three consecutive cache positions at 3× the model's write norm, raising the appearance rate of low-ranked tokens from 33.3% to 100%.

Significance. If the replacement tests establish functional substitution without material artifacts, the work supplies the first cache-level interpretability and steering method for state-space and hybrid recurrent layers. The architecture-grounded predictive formula (R² = 0.98) and cross-model transfer are concrete strengths that could support mechanistic analysis and targeted interventions in efficient recurrent architectures.

major comments (3)

[§4] §4 (replacement test): the claim that rank-1 atoms functionally substitute for model writes rests on the token-distribution comparison after a single replacement. Because the recurrent update combines the write with the existing state via the forget gate, a rank-1 atom that matches the immediate logit may still alter the state trajectory for subsequent tokens if the original write contains higher-rank components; the manuscript does not report multi-step state-norm or activation divergence metrics that would rule out such artifacts.
[Gated DeltaNet predictive formula] Gated DeltaNet predictive formula: the R² = 0.98 result is reported for the logit change after replacement, yet the text does not state whether the formula coefficients were derived from first principles or fitted on the same SAE activations used in the replacement test; an explicit derivation or held-out validation would strengthen the claim that the formula is independent of the fitted atoms.
[Mamba-2 transfer] Mamba-2-370M transfer: the 88.1% success rate is given without an analogous closed-form check, so the assumption that rank-1 atoms avoid recurrent-state artifacts is tested less directly than in Gated DeltaNet; adding even a simple linear predictor or state-similarity metric for Mamba would make the transfer claim more robust.

minor comments (2)

[Abstract] Abstract: the steering experiment states '3× the norm of the model's write' but does not specify whether this is the L2 norm of the full matrix or per-row; a brief clarification would aid reproducibility.
[Notation] Notation: the terms 'atom', 'rank-1 matrix atom', and 'write direction' are used interchangeably; a short definitions paragraph or table would reduce ambiguity for readers unfamiliar with matrix SAEs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of validating functional substitution in recurrent state updates. We address each major comment below and have revised the manuscript accordingly to provide additional evidence.

read point-by-point responses

Referee: [§4] §4 (replacement test): the claim that rank-1 atoms functionally substitute for model writes rests on the token-distribution comparison after a single replacement. Because the recurrent update combines the write with the existing state via the forget gate, a rank-1 atom that matches the immediate logit may still alter the state trajectory for subsequent tokens if the original write contains higher-rank components; the manuscript does not report multi-step state-norm or activation divergence metrics that would rule out such artifacts.

Authors: We agree that single-step token-distribution comparisons alone leave open the possibility of longer-term state trajectory effects. In the revised manuscript we have added multi-step analyses: for each replacement we track the L2 norm of the recurrent state difference and the cosine similarity of subsequent activations over the next 10 tokens. These metrics show that divergence remains below 5% of the baseline state norm on average, supporting that rank-1 atoms do not introduce material artifacts beyond the immediate step. revision: yes
Referee: [Gated DeltaNet predictive formula] Gated DeltaNet predictive formula: the R² = 0.98 result is reported for the logit change after replacement, yet the text does not state whether the formula coefficients were derived from first principles or fitted on the same SAE activations used in the replacement test; an explicit derivation or held-out validation would strengthen the claim that the formula is independent of the fitted atoms.

Authors: The formula was obtained by direct algebraic expansion of the Gated DeltaNet read and output operations under a rank-1 write perturbation; no fitting to SAE activations was performed. We have added the full derivation to the appendix of the revised manuscript. We also report held-out validation on a disjoint set of 5,000 positions (separate from SAE training data), yielding R² = 0.97 and confirming that predictive accuracy does not depend on the particular atoms used in the main experiments. revision: yes
Referee: [Mamba-2 transfer] Mamba-2-370M transfer: the 88.1% success rate is given without an analogous closed-form check, so the assumption that rank-1 atoms avoid recurrent-state artifacts is tested less directly than in Gated DeltaNet; adding even a simple linear predictor or state-similarity metric for Mamba would make the transfer claim more robust.

Authors: We concur that an explicit check analogous to the Gated DeltaNet formula would strengthen the Mamba-2 transfer claim. The revised manuscript now includes a linear predictor for logit change derived from Mamba-2’s state-update equations, achieving R² = 0.84 on held-out positions, together with post-replacement state-norm similarity metrics that remain comparable to the Gated DeltaNet results. These additions make the cross-model evidence more uniform. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical replacement tests and architecture-grounded formula are independent of inputs

full rationale

The paper's central claims rest on direct forward-pass interventions: at SAE-activating positions the model's matrix write is deleted and replaced by the rank-1 atom scaled by its activation coefficient, after which the actual token distribution is measured and compared to the delete-only baseline. This procedure is not equivalent to the SAE training objective by construction; it is an out-of-sample causal test performed inside the unmodified model. The reported R²=0.98 formula for logit change in Gated DeltaNet is expressed using the model's own forget gate, read query and output embedding, which are external to the SAE and therefore constitute an independent explanatory derivation rather than a fitted constant renamed as a prediction. The Mamba-2 transfer result is likewise an empirical replication on a different architecture. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps, and the derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard assumptions of sparse autoencoder training plus the architectural details of the recurrent models; no new physical entities are postulated.

free parameters (1)

SAE sparsity coefficient
Typical hyperparameter in SAE training that controls atom usage; value not stated in abstract.

axioms (1)

domain assumption Rank-1 matrices can approximate the functional effect of full-rank model writes on subsequent cache reads
Invoked when claiming that scaled atom insertion produces comparable token distributions.

invented entities (1)

WriteSAE rank-1 matrix atoms no independent evidence
purpose: To serve as interpretable basis elements for recurrent cache updates
New representational choice introduced by the paper; no independent evidence outside the SAE reconstruction loss is provided.

pith-pipeline@v0.9.0 · 5784 in / 1450 out tokens · 54138 ms · 2026-05-21T07:42:32.181715+00:00 · methodology

Review history (4 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WriteSAE decoder atoms are rank-1 matrices vi w_i^T shaped like GDN’s kt v_t^T, so a single atom can replace one cache update while preserving the shape read downstream
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

∆ℓ_tok(c,i,t) ≈ G_{t0→t}(c) ⟨w_i, q_t(c)⟩ ⟨v_i, WU[tok]⟩

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 12 internal anchors

[1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L

URL https://arxiv.org/abs/2506.15156. Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Be...

work page arXiv
[2]

Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, and Christopher Potts

URL https: //transformer-circuits.pub/2025/attribution-graphs/methods.html. Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, and Christopher Potts. Mechanistic evaluation of transformers and state space models,

work page 2025
[3]

Jimmy Ba, Geoffrey Hinton, V olodymyr Mnih, Joel Z

URL https: //arxiv.org/abs/2505.15105. Jimmy Ba, Geoffrey Hinton, V olodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv
[4]

URLhttps://arxiv.org/abs/2501.14926. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan H...

work page arXiv
[5]

Bart Bussmann, Patrick Leask, and Neel Nanda

URL https://arxiv.org/abs/2506.20790. Bart Bussmann, Patrick Leask, and Neel Nanda. BatchTopK sparse autoencoders,

work page arXiv
[6]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

URL https://arxiv.org/abs/2412.06410. Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders,

work page arXiv
[7]

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

URLhttps://arxiv.org/abs/2503.17547. Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression,

work page arXiv
[8]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso

URL https://arxiv.org/abs/2503.18893. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems (NeurIPS),

work page internal anchor Pith review arXiv
[9]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

URLhttps://arxiv.or g/abs/2506.05239. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models. InInternational Conference on Learning Representations,

work page arXiv
[11]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

URL https: //arxiv.org/abs/2405.21060. Thomas Dooms and Ward Gauderis. Finding manifolds with bilinear autoencoders.arXiv preprint arXiv:2510.16820,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

10 Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng

URLhttps://arxiv.org/abs/2510.16820. 10 Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. MoM: Linear sequence modeling with mixture-of-memories,

work page arXiv
[13]

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda

URLhttps://arxiv.org/abs/2502.13685. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits.arXiv preprint arXiv:2406.11944,

work page arXiv
[14]

URL https://arxiv.org/abs/2406.11944. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCan- dlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition,

work page arXiv
[15]

Toy Models of Superposition

URLhttps://arxiv.org/abs/2209.10652. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Bernhard Ganter, Gerd Stumme, and R Wille

URL https://arxiv.org/abs/2405.14860. Lucy Farnik, Tim Lawson, Conor Houghton, and Laurence Aitchison. Jacobian sparse autoencoders: Sparsify computations, not just activations,

work page arXiv
[17]

Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee

URL https://arxiv.org/abs/2502.18147. Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee. Parameter-efficient fine-tuning of state space models,

work page arXiv
[18]

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu

URLhttps://arxiv.org/abs/2410.09016. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093,

work page arXiv
[19]

Scaling and evaluating sparse autoencoders

URLhttps://arxiv.org/abs/2406.04093. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Conference on Causal Learning and Reasoning (CLeaR),

work page internal anchor Pith review Pith/arXiv arXiv
[20]

org/abs/2305.01610

URL https://arxiv.or g/abs/2305.01610. Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, and Xipeng Qiu. Towards understanding the nature of attention with low-rank sparse decomposition. InInternational Conference on Learning Representations (ICLR),

work page arXiv
[21]

Thomas Jiralerspong and Trenton Bricken

URL https: //arxiv.org/abs/2506.02475. Thomas Jiralerspong and Trenton Bricken. Cross-architecture model diffing with crosscoders: Unsupervised discovery of differences between LLMs,

work page arXiv
[22]

URL https://arxiv.org/abs/ 2602.11729. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability. InProce...

work page arXiv
[24]

11 Aakash Lahoti, Kevin Y

URLhttps://arxiv.org/abs/2602.01322. 11 Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review arXiv
[25]

Quantifying feature space universality across large language models via sparse autoencoders, 2025

URL https://arxiv.org/abs/2410.06981. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

work page arXiv
[26]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

URL https://arxiv.org/abs/2408.05147. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Chris Olah, and Joshua Batson. On the biology of a large language model. Transformer Circuits Thread, mar

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Alireza Makhzani and Brendan Frey

URL https://transformer-circuits.pub/2025/attribution-graph s/biology.html. Alireza Makhzani and Brendan Frey. k-sparse autoencoders,

work page 2025
[28]

k-Sparse Autoencoders

URL https://arxiv.org/ab s/1312.5663. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Philipp Nazari and T

URLhttps://arxiv.org/abs/2504.13151. Philipp Nazari and T. Konstantin Rusch. The key to state reduction in linear attention: A rank-based perspective,

work page arXiv
[30]

Destiny Okpekpe and Antonio Orvieto

URLhttps://arxiv.org/abs/2602.04852. Destiny Okpekpe and Antonio Orvieto. Revisiting associative recall in modern recurrent models,

work page arXiv
[31]

URLhttps://arxiv.org/abs/2508.19029. Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609,

work page arXiv
[32]

Gonçalo Paulo and Nora Belrose

doi: 10.1038/3816 07a0. Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,

work page doi:10.1038/3816
[33]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

URLhttps://arxiv.org/abs/2501.16615. Gonçalo Paulo, Thomas Marshall, and Nora Belrose. Does transformer interpretability transfer to rnns?arXiv preprint arXiv:2404.05971,

work page arXiv
[34]

Gonçalo Paulo, Stepan Shabalin, and Nora Belrose

URLhttps://arxiv.org/abs/2404.05971. Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability,

work page arXiv
[35]

URLhttps://arxiv.org/abs/2501.18823. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 “Goose” with expressive dynamic state evolution. InConfe...

work page arXiv
[36]

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda

URLhttps://arxiv.org/abs/2502.15612. Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoen- coders.arXiv preprint arXiv:2404.16014, 2024a. URL https://arxiv.org/abs/2404.16014. 12 Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonn...

work page arXiv
[37]

Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

URLhttps://arxiv.org/abs/2210.01892. Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML),

work page arXiv
[38]

Arnab Sen Sharma, David Atkinson, and David Bau

doi: 10.1162/neco.1992.4.1.131. Arnab Sen Sharma, David Atkinson, and David Bau. Locating and editing factual associations in Mamba. InConference on Language Modeling (COLM),

work page doi:10.1162/neco.1992.4.1.131 1992
[39]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

URL https://arxiv.org/abs/2502.10297. Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark. Dense SAE latents are features, not bugs,

work page arXiv
[40]

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin

URL https: //arxiv.org/abs/2506.15679. Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page arXiv
[41]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

URLhttps://arxiv.org/abs/2407.04620. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Retentive Network: A Successor to Transformer for Large Language Models

URL https://arxiv.org/abs/2307.08621. Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, and Chandan Singh. Interpreting and steering state-space models via activation subspace bottlenecks.arXiv preprint arXiv:2602.22719,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L

URLhttps://arxiv.org/abs/2602.22719. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ol...

work page internal anchor Pith review arXiv
[45]

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D

URLhttps://arxiv.org/abs/2410.06672. Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv
[46]

URLhttps://arxiv.org/abs/2501.17148. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...

work page arXiv
[47]

Gated Linear Attention Transformers with Hardware-Efficient Training

URL https://github.com/sustcsonglin/flash-linea r-attention. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2024a. URL https://arxiv.org/abs/2312.06635. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-3668
[48]

org/abs/2603.16335

URL https://arxiv. org/abs/2603.16335. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods,

work page arXiv
[49]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

URLhttps://arxiv.org/abs/2309.16042. Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Scope.The approximation is a first-order Taylor expansion around ε= 0 , supported by selectivity being ε-invariant to four decimals across ε∈[0.1,3] (Section 3.2)

The expression fits dominant logit shifts; tail contributions are higher-order Taylor terms. Scope.The approximation is a first-order Taylor expansion around ε= 0 , supported by selectivity being ε-invariant to four decimals across ε∈[0.1,3] (Section 3.2). Host-architecture analogs at Mamba-2 L24 H0 and Qwen3.5-4B L12 H8 yield negative R2, identifying G a...

work page 2024
[51]

Bilinear

Table 10:WriteSAE architecture variants.“Bilinear” encoder ai =v ⊤ i Stwi; “Flat” encoder is dense linear onvec(St). Dead-feature loss (kaux=256,λ aux=10−2) and resampling cadence shared across rows. Variant Encoder Decoder Bias Norm constraint FlatSAE dense linear onvec(S t)dense linear,d in×nfeat none decoder column unit-norm MatrixSAE dense linear onve...

work page 2048
[52]

(b) Per-layer failure rate close to the 7.6% pooled mean

L1 L9 L17 layer 0 2 4 6 8 10 12 failure rate (%) 6.1% 8.8% 7.7% B L9 fails most Q1 small Q2 Q3 Q4 large KLablate quartile (effect size) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 failure rate (%) 12.3% 8.1% 5.3% 4.9% C Failures concentrate on small effects pooled 7.6% n = 4851 ﬁrings | 18 features | L1/L9/L17 head 4 | Qwen3.5-0.8B Figure 14:Deletion beats the atom ma...

work page 2025
[53]

J Reproducibility Code, checkpoints, license.All scripts that produce the reported numbers, tables, and figures are in the repo snapshot at https://github.com/JackYoung27/writesae. Trained SAE checkpoints, cached Gated DeltaNet state tensors, and per-head deletion-control JSON outputs are on HuggingFace at jackyoung27/writesae-ckpts (four SAE variants × Q...

work page 2024

[1] [1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L

URL https://arxiv.org/abs/2506.15156. Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Be...

work page arXiv

[2] [2]

Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, and Christopher Potts

URL https: //transformer-circuits.pub/2025/attribution-graphs/methods.html. Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, and Christopher Potts. Mechanistic evaluation of transformers and state space models,

work page 2025

[3] [3]

Jimmy Ba, Geoffrey Hinton, V olodymyr Mnih, Joel Z

URL https: //arxiv.org/abs/2505.15105. Jimmy Ba, Geoffrey Hinton, V olodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv

[4] [4]

URLhttps://arxiv.org/abs/2501.14926. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan H...

work page arXiv

[5] [5]

Bart Bussmann, Patrick Leask, and Neel Nanda

URL https://arxiv.org/abs/2506.20790. Bart Bussmann, Patrick Leask, and Neel Nanda. BatchTopK sparse autoencoders,

work page arXiv

[6] [6]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

URL https://arxiv.org/abs/2412.06410. Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders,

work page arXiv

[7] [7]

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

URLhttps://arxiv.org/abs/2503.17547. Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression,

work page arXiv

[8] [8]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso

URL https://arxiv.org/abs/2503.18893. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems (NeurIPS),

work page internal anchor Pith review arXiv

[9] [9]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

URLhttps://arxiv.or g/abs/2506.05239. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models. InInternational Conference on Learning Representations,

work page arXiv

[10] [11]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

URL https: //arxiv.org/abs/2405.21060. Thomas Dooms and Ward Gauderis. Finding manifolds with bilinear autoencoders.arXiv preprint arXiv:2510.16820,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

10 Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng

URLhttps://arxiv.org/abs/2510.16820. 10 Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. MoM: Linear sequence modeling with mixture-of-memories,

work page arXiv

[12] [13]

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda

URLhttps://arxiv.org/abs/2502.13685. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits.arXiv preprint arXiv:2406.11944,

work page arXiv

[13] [14]

URL https://arxiv.org/abs/2406.11944. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCan- dlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition,

work page arXiv

[14] [15]

Toy Models of Superposition

URLhttps://arxiv.org/abs/2209.10652. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

Bernhard Ganter, Gerd Stumme, and R Wille

URL https://arxiv.org/abs/2405.14860. Lucy Farnik, Tim Lawson, Conor Houghton, and Laurence Aitchison. Jacobian sparse autoencoders: Sparsify computations, not just activations,

work page arXiv

[16] [17]

Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee

URL https://arxiv.org/abs/2502.18147. Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee. Parameter-efficient fine-tuning of state space models,

work page arXiv

[17] [18]

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu

URLhttps://arxiv.org/abs/2410.09016. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093,

work page arXiv

[18] [19]

Scaling and evaluating sparse autoencoders

URLhttps://arxiv.org/abs/2406.04093. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Conference on Causal Learning and Reasoning (CLeaR),

work page internal anchor Pith review Pith/arXiv arXiv

[19] [20]

org/abs/2305.01610

URL https://arxiv.or g/abs/2305.01610. Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, and Xipeng Qiu. Towards understanding the nature of attention with low-rank sparse decomposition. InInternational Conference on Learning Representations (ICLR),

work page arXiv

[20] [21]

Thomas Jiralerspong and Trenton Bricken

URL https: //arxiv.org/abs/2506.02475. Thomas Jiralerspong and Trenton Bricken. Cross-architecture model diffing with crosscoders: Unsupervised discovery of differences between LLMs,

work page arXiv

[21] [22]

URL https://arxiv.org/abs/ 2602.11729. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability. InProce...

work page arXiv

[22] [24]

11 Aakash Lahoti, Kevin Y

URLhttps://arxiv.org/abs/2602.01322. 11 Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review arXiv

[23] [25]

Quantifying feature space universality across large language models via sparse autoencoders, 2025

URL https://arxiv.org/abs/2410.06981. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

work page arXiv

[24] [26]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

URL https://arxiv.org/abs/2408.05147. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Chris Olah, and Joshua Batson. On the biology of a large language model. Transformer Circuits Thread, mar

work page internal anchor Pith review Pith/arXiv arXiv

[25] [27]

Alireza Makhzani and Brendan Frey

URL https://transformer-circuits.pub/2025/attribution-graph s/biology.html. Alireza Makhzani and Brendan Frey. k-sparse autoencoders,

work page 2025

[26] [28]

k-Sparse Autoencoders

URL https://arxiv.org/ab s/1312.5663. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [29]

Philipp Nazari and T

URLhttps://arxiv.org/abs/2504.13151. Philipp Nazari and T. Konstantin Rusch. The key to state reduction in linear attention: A rank-based perspective,

work page arXiv

[28] [30]

Destiny Okpekpe and Antonio Orvieto

URLhttps://arxiv.org/abs/2602.04852. Destiny Okpekpe and Antonio Orvieto. Revisiting associative recall in modern recurrent models,

work page arXiv

[29] [31]

URLhttps://arxiv.org/abs/2508.19029. Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609,

work page arXiv

[30] [32]

Gonçalo Paulo and Nora Belrose

doi: 10.1038/3816 07a0. Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,

work page doi:10.1038/3816

[31] [33]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

URLhttps://arxiv.org/abs/2501.16615. Gonçalo Paulo, Thomas Marshall, and Nora Belrose. Does transformer interpretability transfer to rnns?arXiv preprint arXiv:2404.05971,

work page arXiv

[32] [34]

Gonçalo Paulo, Stepan Shabalin, and Nora Belrose

URLhttps://arxiv.org/abs/2404.05971. Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability,

work page arXiv

[33] [35]

URLhttps://arxiv.org/abs/2501.18823. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 “Goose” with expressive dynamic state evolution. InConfe...

work page arXiv

[34] [36]

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda

URLhttps://arxiv.org/abs/2502.15612. Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoen- coders.arXiv preprint arXiv:2404.16014, 2024a. URL https://arxiv.org/abs/2404.16014. 12 Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonn...

work page arXiv

[35] [37]

Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

URLhttps://arxiv.org/abs/2210.01892. Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML),

work page arXiv

[36] [38]

Arnab Sen Sharma, David Atkinson, and David Bau

doi: 10.1162/neco.1992.4.1.131. Arnab Sen Sharma, David Atkinson, and David Bau. Locating and editing factual associations in Mamba. InConference on Language Modeling (COLM),

work page doi:10.1162/neco.1992.4.1.131 1992

[37] [39]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

URL https://arxiv.org/abs/2502.10297. Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark. Dense SAE latents are features, not bugs,

work page arXiv

[38] [40]

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin

URL https: //arxiv.org/abs/2506.15679. Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page arXiv

[39] [41]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

URLhttps://arxiv.org/abs/2407.04620. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [42]

Retentive Network: A Successor to Transformer for Large Language Models

URL https://arxiv.org/abs/2307.08621. Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, and Chandan Singh. Interpreting and steering state-space models via activation subspace bottlenecks.arXiv preprint arXiv:2602.22719,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [43]

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L

URLhttps://arxiv.org/abs/2602.22719. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ol...

work page internal anchor Pith review arXiv

[42] [45]

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D

URLhttps://arxiv.org/abs/2410.06672. Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv

[43] [46]

URLhttps://arxiv.org/abs/2501.17148. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...

work page arXiv

[44] [47]

Gated Linear Attention Transformers with Hardware-Efficient Training

URL https://github.com/sustcsonglin/flash-linea r-attention. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2024a. URL https://arxiv.org/abs/2312.06635. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-3668

[45] [48]

org/abs/2603.16335

URL https://arxiv. org/abs/2603.16335. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods,

work page arXiv

[46] [49]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

URLhttps://arxiv.org/abs/2309.16042. Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv

[47] [50]

Scope.The approximation is a first-order Taylor expansion around ε= 0 , supported by selectivity being ε-invariant to four decimals across ε∈[0.1,3] (Section 3.2)

The expression fits dominant logit shifts; tail contributions are higher-order Taylor terms. Scope.The approximation is a first-order Taylor expansion around ε= 0 , supported by selectivity being ε-invariant to four decimals across ε∈[0.1,3] (Section 3.2). Host-architecture analogs at Mamba-2 L24 H0 and Qwen3.5-4B L12 H8 yield negative R2, identifying G a...

work page 2024

[48] [51]

Bilinear

Table 10:WriteSAE architecture variants.“Bilinear” encoder ai =v ⊤ i Stwi; “Flat” encoder is dense linear onvec(St). Dead-feature loss (kaux=256,λ aux=10−2) and resampling cadence shared across rows. Variant Encoder Decoder Bias Norm constraint FlatSAE dense linear onvec(S t)dense linear,d in×nfeat none decoder column unit-norm MatrixSAE dense linear onve...

work page 2048

[49] [52]

(b) Per-layer failure rate close to the 7.6% pooled mean

L1 L9 L17 layer 0 2 4 6 8 10 12 failure rate (%) 6.1% 8.8% 7.7% B L9 fails most Q1 small Q2 Q3 Q4 large KLablate quartile (effect size) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 failure rate (%) 12.3% 8.1% 5.3% 4.9% C Failures concentrate on small effects pooled 7.6% n = 4851 ﬁrings | 18 features | L1/L9/L17 head 4 | Qwen3.5-0.8B Figure 14:Deletion beats the atom ma...

work page 2025

[50] [53]

J Reproducibility Code, checkpoints, license.All scripts that produce the reported numbers, tables, and figures are in the repo snapshot at https://github.com/JackYoung27/writesae. Trained SAE checkpoints, cached Gated DeltaNet state tensors, and per-head deletion-control JSON outputs are on HuggingFace at jackyoung27/writesae-ckpts (four SAE variants × Q...

work page 2024