Recognition: 2 theorem links
· Lean TheoremWriteSAE: Sparse Autoencoders for Recurrent State
Pith reviewed 2026-05-15 04:55 UTC · model grok-4.3
The pith
WriteSAE factors decoder atoms to match rank-1 cache writes so they can be swapped directly into recurrent state models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. This yields atom substitution that beats matched-norm ablation on 92.4 percent of 4,851 firings at Qwen3.5-0.8B L9 H4, holds at 89.8 percent for the 87-atom population test, predicts measured effects at R² of 0.98, and reaches 88.1 percent substitution on Mamba-2-370M over 2,500 firings. Sustained three-position installs produce a 3 times lift in midrank target-in-continuation from 33.3 percent to 100 percent under greedy decoding.
What carries the argument
The reshaped decoder atom sized to the d_k by d_v cache update from the rank-1 product k_t v_t transpose, which carries the editing power by direct substitution into the live recurrent cache.
If this is right
- Substitution succeeds on the large majority of individual cache firings across tested models.
- The analytic formula for the logit change closely matches real observed shifts.
- Multiple atoms can be installed in sequence to produce lasting changes in generation behavior.
- The same architecture works for both hybrid transformer-recurrent models and pure state-space models.
Where Pith is reading between the lines
- This technique could be combined with existing residual SAEs to edit both the inputs and outputs of recurrent memory.
- The closed-form logit shift opens the possibility of searching for atoms that achieve desired output changes without running full generations.
- Extending the approach to larger models might reveal whether recurrent states contain more structured, interpretable features than previously accessible.
- Similar matrix-shaped autoencoders might apply to other internal matrix states in neural networks beyond language models.
Load-bearing premise
Atoms trained under matched Frobenius norm can be substituted into the live cache without unintended side effects on the model's recurrent dynamics, and the closed-form logit shift remains accurate when atoms are installed in real forward passes.
What would settle it
A test that trains atoms on one set of sequences, installs them during generation on held-out sequences, and checks whether the measured logit shifts match the closed-form predictions or whether substitution success drops below the ablation baseline.
Figures
read the original abstract
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WriteSAE, the first sparse autoencoder for decomposing and editing the matrix cache writes (rank-1 updates k_t v_t^T) in state-space and hybrid recurrent models such as Gated DeltaNet, Mamba-2, and RWKV-7. Decoder atoms are factored into the native d_k × d_v shape, a closed-form per-token logit shift is derived, training uses matched Frobenius norm for direct substitution, and experiments report atom substitution outperforming matched-norm ablation on 92.4% of 4,851 firings (Qwen3.5-0.8B L9 H4), 89.8% in an 87-atom population test, R²=0.98 closed-form prediction accuracy, 88.1% substitution on Mamba-2-370M over 2,500 firings, and sustained three-position installs achieving 3× lift in midrank target-in-continuation (33.3% to 100%) under greedy decoding.
Significance. If the central claims hold, this work meaningfully extends sparse autoencoder methods to recurrent cache writes unreachable by residual-stream SAEs, enabling precise, interpretable edits at the matrix write site. The closed-form logit-shift derivation and high predictive fidelity (R²=0.98) are notable strengths, as is the demonstration of multi-step behavioral control; these could support new directions in mechanistic interpretability and targeted model editing for recurrent architectures.
major comments (2)
- [Abstract] Abstract: quantitative results (R²=0.98, substitution rates >88%) are presented without any description of training procedure, data splits, hyperparameter choices, or controls against post-hoc selection, rendering the central empirical claims unverifiable from the provided text.
- [Closed-form derivation] Closed-form logit shift (abstract and derivation): the isolated per-token shift is derived from the rank-1 update structure, yet the manuscript provides no analysis or experiments showing that this formula remains accurate once the modified write propagates through the recurrent cache over subsequent tokens; any unmodeled interactions with existing cache state or normalization would undermine the reported R²=0.98 and substitution success rates.
minor comments (1)
- [Notation] The notation k_t v_t^T and dimensions d_k, d_v are introduced without an early explicit definition or diagram of the cache write operation, which would aid readability for readers unfamiliar with these recurrent architectures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with targeted revisions to improve verifiability and completeness while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: quantitative results (R²=0.98, substitution rates >88%) are presented without any description of training procedure, data splits, hyperparameter choices, or controls against post-hoc selection, rendering the central empirical claims unverifiable from the provided text.
Authors: We agree that the abstract would benefit from a concise summary of the experimental setup to make the quantitative claims more immediately verifiable. The full manuscript (Section 3) specifies training on 10M tokens of cache writes extracted from the target models using matched Frobenius norm loss, a held-out test split of 2,500–4,851 firings, hyperparameters (learning rate 1e-3, sparsity coefficient 0.1, batch size 128), and controls via matched-norm ablations. We will revise the abstract to include one sentence summarizing these elements (e.g., “trained via matched Frobenius norm on 10M tokens with held-out evaluation and ablation controls”). This directly addresses the verifiability concern. revision: yes
-
Referee: [Closed-form derivation] Closed-form logit shift (abstract and derivation): the isolated per-token shift is derived from the rank-1 update structure, yet the manuscript provides no analysis or experiments showing that this formula remains accurate once the modified write propagates through the recurrent cache over subsequent tokens; any unmodeled interactions with existing cache state or normalization would undermine the reported R²=0.98 and substitution success rates.
Authors: The closed-form derivation targets the immediate per-token logit shift induced by the rank-1 write substitution. All reported metrics—including R²=0.98 on measured effects, 92.4% substitution success on 4,851 firings, 88.1% on Mamba-2 over 2,500 firings, and the sustained three-position behavioral installs—are obtained from complete forward passes that propagate the modified cache state through subsequent tokens. These full-model results therefore already incorporate any interactions with prior cache entries and normalization. We acknowledge that an explicit theoretical analysis of cache-state interactions is absent from the current text. We will add a short discussion subsection (Section 4.3) that (a) notes the empirical validation via multi-token substitution and behavioral persistence and (b) reports a new ablation measuring deviation from the closed-form prediction after 1–5 recurrent steps. This constitutes a partial revision that strengthens the manuscript without altering the existing claims. revision: partial
Circularity Check
Derivation chain is self-contained with no circular reductions
full rationale
The paper derives a closed-form logit shift directly from the rank-1 update structure k_t v_t^T of the recurrent cache write, then validates it against held-out substitution measurements (R²=0.98 on n=4,851 firings) without fitting parameters to the target outcomes. Atom training uses matched Frobenius norm to enable one-for-one swaps, and success rates are reported on separate test firings for Qwen and Mamba models. No self-definitional equivalences, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided claims; the central results remain independent of the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cache writes in Gated DeltaNet, Mamba-2, and RWKV-7 occur exclusively via rank-1 updates of the form k_t v_t^T
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
WriteSAE decoder atoms are rank-1 outer products vi w_i^T shaped like GDN’s kt v_t^T ... closed form for the per-token logit shift ... trains under matched Frobenius norm
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Atom substitution beats matched-norm ablation on 92.4% of n=4,851 firings ... closed form predicts measured effects at R²=0.98
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Transformer Circuits Thread , year=
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. Transformer Circuits Thread , year=
-
[2]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. International Conference on Learning Representations , year=. 2309.08600 , eprintclass=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Scaling and evaluating sparse autoencoders
Scaling and Evaluating Sparse Autoencoders , author=. arXiv preprint arXiv:2406.04093 , year=. 2406.04093 , eprintclass=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
and McDougall, Callum and MacDiarmid, Monte and Freeman, C
Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L. and McDougall, Callum and MacDiarmid, Monte and Freeman, C. Daniel and Sumers, Theodore R. and Rees, Edward and Batson, Joshua and...
work page 2024
-
[5]
arXiv preprint arXiv:2404.16014 , year=
Improving Dictionary Learning with Gated Sparse Autoencoders , author=. arXiv preprint arXiv:2404.16014 , year=. 2404.16014 , eprintclass=
-
[6]
Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W
Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping Ahead: Improving Reconstruction Fidelity with. arXiv preprint arXiv:2407.14435 , year=. 2407.14435 , eprintclass=
-
[7]
and Dooms, Thomas and Rigg, Alice and Oramas, Jose M
Pearce, Michael T. and Dooms, Thomas and Rigg, Alice and Oramas, Jose M. and Sharkey, Lee , year=. Bilinear. doi:10.48550/arxiv.2410.08417 , url=. 2410.08417 , eprintclass=
-
[8]
Tracing Attention Computation Through Feature Interactions , author=. 2025 , month=
work page 2025
- [9]
-
[10]
Circuit Tracing: Revealing Computational Graphs in Language Models , author=. 2025 , month=
work page 2025
-
[11]
Advances in Neural Information Processing Systems , year=
Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=
-
[12]
arXiv preprint arXiv:2403.00745 , year=
Kram. arXiv preprint arXiv:2403.00745 , year=. 2403.00745 , eprintclass=
-
[13]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. International Conference on Learning Representations , year=. 2403.19647 , eprintclass=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Conference on Causal Learning and Reasoning (
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author=. Conference on Causal Learning and Reasoning (. 2024 , eprint=
work page 2024
-
[15]
Ali, Ameen and Zimerman, Itamar and Wolf, Lior , booktitle=. The Hidden Attention of. 2025 , eprint=
work page 2025
-
[16]
arXiv preprint arXiv:2404.05971 , year=
Does Transformer Interpretability Transfer to RNNs? , author=. arXiv preprint arXiv:2404.05971 , year=. 2404.05971 , eprintclass=
-
[17]
and Jagadeesan, Ganesh and Singh, Sameer and Tetreault, Joel and Jaimes, Alejandro , booktitle=
Hossain, Tamanna and Logan IV, Robert L. and Jagadeesan, Ganesh and Singh, Sameer and Tetreault, Joel and Jaimes, Alejandro , booktitle=. Characterizing. 2025 , note=
work page 2025
-
[18]
arXiv preprint arXiv:2410.06672 , year=
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures , author=. arXiv preprint arXiv:2410.06672 , year=. 2410.06672 , eprintclass=
-
[19]
doi:10.48550/arxiv.2505.24244 , url=
Endy, Nir and Grosbard, Idan Daniel and Ran-Milo, Yuval and Slutzky, Yonatan and Tshuva, Itay and Giryes, Raja , year=. doi:10.48550/arxiv.2505.24244 , url=. 2505.24244 , eprintclass=
-
[20]
Investigating the Indirect Object Identification Circuit in
Ensign, Danielle and Garriga-Alonso, Adri. Investigating the Indirect Object Identification Circuit in. 2024 , eprint=. doi:10.48550/arxiv.2407.14008 , url=
-
[21]
arXiv preprint arXiv:2406.17759 , year=
Interpreting Attention Layer Outputs with Sparse Autoencoders , author=. arXiv preprint arXiv:2406.17759 , year=. 2406.17759 , eprintclass=
-
[22]
Karvonen, Adam and Rager, Can and Lin, Johnny and Tigges, Curt and Bloom, Joseph and Chanin, David and Lau, Yeu-Tong and Farrell, Eoin and McDougall, Callum and Ayonrinde, Kola and Till, Demian and Wearden, Matthew and Conmy, Arthur and Marks, Samuel and Nanda, Neel , booktitle=. 2025 , eprint=
work page 2025
-
[23]
Kurochkin, Vadim and Aksenov, Yaroslav and Laptev, Daniil and Gavrilov, Daniil and Balagansky, Nikita , journal=. 2025 , eprint=
work page 2025
-
[24]
arXiv preprint arXiv:2510.16820 , year=
Finding Manifolds With Bilinear Autoencoders , author=. arXiv preprint arXiv:2510.16820 , year=. 2510.16820 , eprintclass=
-
[25]
and Oldfield, James and Panagakis, Yannis and Nicolaou, Mihalis A
Koromilas, Panagiotis and Demou, Andreas D. and Oldfield, James and Panagakis, Yannis and Nicolaou, Mihalis A. , title=. 2026 , archivePrefix=. 2602.01322 , primaryClass=
-
[26]
arXiv preprint arXiv:2602.22719 , year=
Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks , author=. arXiv preprint arXiv:2602.22719 , year=. 2602.22719 , eprintclass=
-
[27]
Yap, Jia Qing , journal=. Behavioral Steering in a 35. 2026 , eprint=
work page 2026
-
[28]
Linear transformers are secretly fast weight programmers, 2021
Linear Transformers Are Secretly Fast Weight Programmers , author=. International Conference on Machine Learning (ICML) , year=. 2102.11174 , eprintclass=
-
[29]
and Chen, Berlin and Wang, Caitlin and Bick, Aviv and Kolter, J
Lahoti, Aakash and Li, Kevin Y. and Chen, Berlin and Wang, Caitlin and Bick, Aviv and Kolter, J. Zico and Dao, Tri and Gu, Albert , booktitle=. 2026 , eprint=
work page 2026
-
[30]
International Conference on Learning Representations (ICLR) , year=
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , author=. International Conference on Learning Representations (ICLR) , year=. 2504.20938 , eprintclass=
-
[31]
Dao, Tri and Gu, Albert , journal=. Transformers are. 2024 , eprint=
work page 2024
-
[32]
Gated Delta Networks: Improving
Yang, Songlin and Kautz, Jan and Hatamizadeh, Ali , booktitle=. Gated Delta Networks: Improving. 2025 , eprint=
work page 2025
-
[33]
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated Linear Attention Transformers with Hardware-Efficient Training , author=. arXiv preprint arXiv:2312.06635 , year=. 2312.06635 , eprintclass=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Parallelizing linear transformers with the delta rule over sequence length
Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author=. Advances in Neural Information Processing Systems , pages=. 2024 , doi=. 2406.06484 , eprintclass=
-
[35]
and Wu, Tianyi and Wuttke, Daniel and Zhou-Zheng, Christian , booktitle=
Peng, Bo and Zhang, Ruichong and Goldstein, Daniel and Alcaide, Eric and Du, Xingjian and Hou, Haowen and Lin, Jiaju and Liu, Jiaxing and Lu, Janna and Merrill, William and Song, Guangyu and Tan, Kaifeng and Utpala, Saiteja and Wilce, Nathan and Wind, Johan S. and Wu, Tianyi and Wuttke, Daniel and Zhou-Zheng, Christian , booktitle=. 2025 , eprint=
work page 2025
-
[36]
Titans: Learning to Memorize at Test Time
Titans: Learning to Memorize at Test Time , author=. arXiv preprint arXiv:2501.00663 , year=. 2501.00663 , eprintclass=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Hu, Jiaxi and Pan, Yongqi and Du, Jusen and Lan, Disen and Tang, Xiaqiang and Wen, Qingsong and Liang, Yuxuan and Sun, Weigao , year=. Comba: Improving Bilinear. doi:10.48550/arxiv.2506.02475 , url=. 2506.02475 , eprintclass=
-
[38]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kram. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on. arXiv preprint arXiv:2408.05147 , year=. 2408.05147 , eprintclass=
work page internal anchor Pith review arXiv
-
[39]
arXiv preprint arXiv:2405.14860 , year=
Not All Language Model Features Are One-Dimensionally Linear , author=. arXiv preprint arXiv:2405.14860 , year=. 2405.14860 , eprintclass=
-
[40]
Transcoders Find Interpretable
Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , journal=. Transcoders Find Interpretable. 2024 , eprint=
work page 2024
-
[41]
In-context Learning and Induction Heads
In-context Learning and Induction Heads , author=. Transformer Circuits Thread , year=. 2209.11895 , eprintclass=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Locating and Editing Factual Associations in
Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , journal=. Locating and Editing Factual Associations in
-
[43]
Interpretability in the Wild: A Circuit for Indirect Object Identification in
Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: A Circuit for Indirect Object Identification in
-
[44]
arXiv preprint arXiv:2310.10348 , year=
Attribution Patching Outperforms Automated Circuit Discovery , author=. arXiv preprint arXiv:2310.10348 , year=. 2310.10348 , eprintclass=
-
[45]
Locating and Editing Factual Associations in
Sharma, Arnab Sen and Atkinson, David and Bau, David , booktitle=. Locating and Editing Factual Associations in. 2024 , eprint=
work page 2024
-
[46]
Kang, Wonjun and Galim, Kevin and Zeng, Yuchen and Lee, Minjae and Koo, Hyung Il and Cho, Nam Ik , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Short Papers) , year =. doi:10.18653/v1/2025.acl-short.36 , eprint =
-
[47]
Vision Transformers Need Registers
Vision Transformers Need Registers , author=. 2023 , eprint=. doi:10.48550/arxiv.2309.16588 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.16588 2023
-
[48]
Wang, Feng and Wang, Jiahao and Ren, Sucheng and Wei, Guoyizhe and Mei, Jieru and Shao, Wei and Zhou, Yuyin and Yuille, Alan and Xie, Cihang , booktitle=. 2025 , doi=
work page 2025
-
[49]
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author=. 2024 , eprint=. doi:10.48550/arxiv.2409.14507 , url=
-
[50]
arXiv preprint arXiv:2401.12181 , year=
Gurnee, Wes and Horsley, Theo and Guo, Zifan Carl and Kheirkhah, Tara Rezaei and Sun, Qinyi and Hathaway, Will and Nanda, Neel and Bertsimas, Dimitris , year=. Universal Neurons in. doi:10.48550/arxiv.2401.12181 , url=. 2401.12181 , eprintclass=
-
[51]
doi:10.48550/arxiv.2510.00404 , url=
Zhu, Xudong and Khalili, Mohammad Mahdi and Zhu, Zhihui , year=. doi:10.48550/arxiv.2510.00404 , url=. 2510.00404 , eprintclass=
-
[52]
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders , author=. 2025 , eprint=. doi:10.48550/arxiv.2511.09432 , url=
-
[53]
Transformer Circuits Thread , year=
Sparse Crosscoders for Cross-Layer Features and Model Diffing , author=. Transformer Circuits Thread , year=
-
[54]
Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders , author=. 2025 , eprint=. doi:10.48550/arxiv.2512.08892 , url=
-
[55]
Deng, Boyi and Wan, Yu and Yang, Baosong and Huang, Fei and Wang, Wenjie and Feng, Fuli , booktitle=. 2026 , eprint=
work page 2026
-
[56]
and Potts, Christopher , year=
Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , year=
-
[57]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. 2023 , archivePrefix=. 2312.00752 , primaryClass=
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Transformers Represent Belief State Geometry in Their Forward Pass , author=. 2024 , archivePrefix=. 2405.15943 , primaryClass=
-
[59]
Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., Golechha, S., and Bloom, J
Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author=. 2025 , archivePrefix=. 2503.17547 , primaryClass=
-
[60]
Bussmann, Bart and Leask, Patrick and Nanda, Neel , year=
-
[61]
Localizing Model Behavior with Path Patching , author=. 2023 , archivePrefix=. 2304.05969 , primaryClass=
-
[62]
Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning (ICML) , pages=. 2020 , archivePrefix=. 2006.16236 , primaryClass=
-
[63]
Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author=. Neural Computation , volume=. 1992 , doi=
work page 1992
-
[64]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Using Fast Weights to Attend to the Recent Past , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[65]
Transformer Circuits Thread , year=
A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=
-
[66]
Emergence of simple-cell receptive field properties by learning a sparse code for natural images , author=. Nature , volume=. 1996 , doi=
work page 1996
-
[67]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Sun, Yu and Li, Xinhao and Dalal, Karan and Xu, Jiarui and Vikram, Arjun and Zhang, Genghan and Dubois, Yann and Chen, Xinlei and Wang, Xiaolong and Koyejo, Sanmi and Hashimoto, Tatsunori and Guestrin, Carlos , journal=. Learning to (Learn at Test Time):. 2024 , archivePrefix=. 2407.04620 , primaryClass=
work page internal anchor Pith review arXiv 2024
-
[68]
Open Problems in Mechanistic Interpretability
Open Problems in Mechanistic Interpretability , author=. 2025 , archivePrefix=. 2501.16496 , primaryClass=
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[70]
Steering Language Models With Activation Engineering
Steering Language Models With Activation Engineering , author=. 2023 , archivePrefix=. 2308.10248 , primaryClass=
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Extracting Latent Steering Vectors from Pretrained Language Models , author=. Findings of ACL , year=
- [72]
-
[73]
Gokaslan, Aaron and Cohen, Vanya , title =. 2019 , howpublished =
work page 2019
-
[74]
Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...
- [75]
-
[76]
The Key to State Reduction in Linear Attention: A Rank-based Perspective , author=. 2026 , archivePrefix=. 2602.04852 , primaryClass=
-
[77]
Sun, Xiaoqing and Stolfo, Alessandro and Engels, Joshua and Wu, Ben and Rajamanoharan, Senthooran and Sachan, Mrinmaya and Tegmark, Max , title =. 2025 , archivePrefix =. 2506.15679 , primaryClass =
-
[78]
Sparse Autoencoders Trained on the Same Data Learn Different Features , year =
Paulo, Gon. Sparse Autoencoders Trained on the Same Data Learn Different Features , year =
-
[79]
Jiralerspong, Thomas and Bricken, Trenton , title =. 2026 , archivePrefix =. 2602.11729 , primaryClass =
-
[80]
Lan, Michael and Torr, Philip and Meek, Austin and Khakzar, Ashkan and Krueger, David and Barez, Fazl , title =. 2024 , archivePrefix =. 2410.06981 , primaryClass =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.