Understanding Transformers through the Lens of Pavlovian Conditioning

Mu Qiao

arxiv: 2508.08289 · v2 · submitted 2025-08-05 · 💻 cs.LG · cs.AI· q-bio.NC

Understanding Transformers through the Lens of Pavlovian Conditioning

Mu Qiao This is my paper

Pith reviewed 2026-05-19 00:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.NC

keywords transformer attentionPavlovian conditioningHebbian ruleassociative memorycapacity theoremlinear attentionerror propagationbiological plausibility

0 comments

The pith

Queries, keys, and values in transformer attention map directly to the stimuli and responses of classical conditioning, so each attention step builds a temporary Hebbian associative memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reinterprets the attention mechanism as a computational version of Pavlovian conditioning. Queries act as test stimuli that probe existing associations, keys serve as retrieval cues, and values supply the response information. Under this mapping, each attention operation forms and retrieves associations through a simple Hebbian rule, exactly as linear attention does. This view produces a capacity result: attention heads can hold O(sqrt(d_k)) associations for guaranteed error-free retrieval in the worst case, while average-case fidelity grows as O(d_k). The same lens also shows how error accumulates across layers and points to biologically inspired rules that could strengthen current architectures.

Core claim

Attention's queries, keys, and values map to test stimuli, conditional stimuli, and unconditional stimuli in classical conditioning. Each attention operation therefore constructs a transient associative memory via a Hebbian rule in which CS-US pairs create dynamic associations that later test stimuli can retrieve. The linearized model yields a capacity theorem in which attention heads store O(sqrt(d_k)) associations for worst-case error-free retrieval while average-case fidelity scales as O(d_k), plus an error-propagation analysis that identifies trade-offs among depth, width, and head redundancy, and an account of how biologically plausible learning rules could improve transformers.

What carries the argument

The direct mapping of linear attention to Pavlovian conditioning elements, with queries probing associations, keys acting as retrieval cues, and values supplying response content, all linked by a Hebbian association rule.

If this is right

Each attention head can store O(sqrt(d_k)) associations with guaranteed error-free retrieval in the worst case.
Average-case retrieval fidelity improves linearly with d_k.
Reliability requires balancing model depth against width and head redundancy to limit error propagation.
Adopting Hebbian-style or other biologically plausible update rules can strengthen transformer performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The analogy suggests that transformers succeed in part because they replicate association-forming mechanisms that biology already optimized.
New attention variants could be designed to enforce explicit conditioning-like updates and thereby improve sample efficiency.
Simple conditioning tasks could be used to measure whether real transformers exhibit the predicted capacity and error-scaling limits.

Load-bearing premise

Linear attention faithfully captures the essential associative dynamics of standard scaled dot-product attention without losing critical behaviors.

What would settle it

A controlled experiment that counts how many distinct associations a single linear-attention head can retrieve without error and finds a scaling that deviates from O(sqrt(d_k)) in the worst case would falsify the capacity claim.

Figures

Figures reproduced from arXiv: 2508.08289 by Mu Qiao.

**Figure 2.** Figure 2: Higher-order conditioning through stacked circuits. The input sequence is processed through [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Transformer architectures have revolutionized artificial intelligence (AI) through their attention mechanisms, yet the computational principles underlying their success remain opaque. We present a novel theoretical framework that reinterprets the core computation of attention as Pavlovian conditioning. Our model finds a direct mathematical analogue in linear attention, which simplifies the analysis of the underlying associative process. We demonstrate that attention's queries, keys, and values can be mapped to the three elements of classical conditioning: test stimuli that probe associations, conditional stimuli (CS) that serve as retrieval cues, and unconditional stimuli (US) that contain response information. Through this lens, we suggest that each attention operation constructs a transient associative memory via a Hebbian rule, where CS-US pairs form dynamic associations that test stimuli can later retrieve. Our framework yields several theoretical insights grounded in this linearized model: (1) a capacity theorem showing that attention heads can store $O(\sqrt{d_k})$ associations for worst-case, error-free retrieval, while average-case retrieval fidelity scales robustly as $O(d_k)$; (2) an error propagation analysis revealing fundamental architectural trade-offs of balancing model depth, width, and head redundancy to maintain reliability; and (3) an understanding of how biologically plausible learning rules could enhance transformer architectures. By establishing this deep connection, we suggest that the success of modern AI may stem not from architectural novelty alone, but from implementing computational principles analogous to those optimized by biology over millions of years of evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps linear attention to Pavlovian conditioning and claims an O(sqrt(d_k)) capacity bound, but the math support is missing from the abstract.

read the letter

The punchline is that this work gives a specific mapping of attention's Q, K, V to test stimuli, conditional stimuli, and unconditional stimuli, then uses a Hebbian outer-product view to derive a worst-case capacity of O(sqrt(d_k)) associations and some error-propagation trade-offs across depth, width, and heads. That mapping and the resulting bounds are the main new pieces on offer. The paper does a reasonable job laying out how the analogy might guide thinking about reliability when stacking layers or adding redundant heads, and it keeps the discussion tied to the linearized attention case, which simplifies the associative memory picture. Credit for trying to connect the mechanics to a concrete biological learning rule rather than staying at the level of loose metaphors. The soft spots are more substantial. The abstract states the capacity theorem and error analysis but shows none of the derivations, assumptions, or verification steps, so it is impossible to tell whether the O(sqrt(d_k)) result follows from the attention formula itself or simply restates properties built into the chosen Hebbian model. The stress-test concern about normalization or scaling terms is worth checking directly: if the paper adds any averaging or decay not present in standard Q(K^T V), the claimed architectural trade-offs become specific to the analogy rather than to attention. The circularity risk is real until the proofs are visible. This paper is mainly for readers who already like biological analogies for transformers and want ideas about capacity planning or error analysis. A serious referee could usefully check the derivations and the exact match to linear attention; without those steps the work stays interpretive. I would send it to review rather than desk-reject if the full text contains the missing math.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a theoretical framework that reinterprets transformer attention as Pavlovian conditioning. It maps queries to test stimuli, keys to conditional stimuli (CS), and values to unconditional stimuli (US), with each attention step constructing a transient associative memory via a Hebbian rule. The framework is specialized to linear attention to derive a capacity theorem (O(sqrt(d_k)) worst-case error-free associations, O(d_k) average-case fidelity) and an error-propagation analysis that identifies architectural trade-offs among depth, width, and head redundancy.

Significance. If the mapping is shown to be faithful to the linear-attention formula and the capacity and error results are derived independently of the analogy, the work could supply a biologically grounded lens on attention dynamics and suggest concrete design principles for balancing capacity and reliability. The explicit capacity bounds and error analysis would be the primary contributions; without rigorous verification that the Hebbian outer-product update reproduces standard linear attention (Q(K^TV) with its usual scaling), the architectural trade-offs remain tied to the model rather than to transformers themselves.

major comments (3)

[Section 3 (Pavlovian Conditioning Framework) and the statement of the capacity theorem] The central mapping (queries as test stimuli, keys as CS, values as US) and the claim that attention implements a Hebbian outer-product update are load-bearing for both the capacity theorem and the error analysis. The manuscript must explicitly state the precise linear-attention formula used (including any scaling, averaging, or decay terms) and demonstrate that it matches the Hebbian rule without additional normalization; any discrepancy would make the O(sqrt(d_k)) bound and the depth-width trade-offs specific to the analogy rather than to attention.
[Section 5 (Capacity Theorem)] The capacity theorem (worst-case O(sqrt(d_k)) error-free retrieval) requires a full derivation that isolates the contribution of the Hebbian rule from the initial analogy. Please supply the key steps or equation that establishes the bound and confirm that it does not restate consequences already built into the conditioning model.
[Section 6 (Error Propagation Analysis)] The error-propagation analysis that yields trade-offs among depth, width, and head redundancy must include the concrete recurrence or matrix equation governing error accumulation across layers. Without it, the claimed architectural recommendations cannot be evaluated for correctness or generality.

minor comments (2)

[Section 3] Notation for the Hebbian update and the retrieval operation should be introduced with a single consistent equation early in the framework section to avoid ambiguity when the same symbols are reused in the capacity and error sections.
[Introduction and Related Work] The manuscript should cite prior work on linear attention (e.g., the original linear-attention formulations and analyses of their approximation error) and on Hebbian associative memory models to situate the novelty of the conditioning lens.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below, indicating where revisions will be made to strengthen the presentation of our theoretical framework.

read point-by-point responses

Referee: [Section 3 (Pavlovian Conditioning Framework) and the statement of the capacity theorem] The central mapping (queries as test stimuli, keys as CS, values as US) and the claim that attention implements a Hebbian outer-product update are load-bearing for both the capacity theorem and the error analysis. The manuscript must explicitly state the precise linear-attention formula used (including any scaling, averaging, or decay terms) and demonstrate that it matches the Hebbian rule without additional normalization; any discrepancy would make the O(sqrt(d_k)) bound and the depth-width trade-offs specific to the analogy rather than to attention.

Authors: We agree that an explicit statement of the linear attention formula is essential for establishing the fidelity of the mapping. In Section 3 of the manuscript, we define linear attention as A = Q (K^T V) / d_k, where the division by d_k serves as an averaging term to normalize the outer product. This directly corresponds to the Hebbian update rule in our Pavlovian model, where the association matrix is updated as CS * US^T without additional softmax normalization, as linear attention approximates the attention mechanism in the limit. We will revise the manuscript to include a dedicated paragraph or equation block that derives this equivalence step-by-step, showing that the transient associative memory construction matches exactly. This ensures the capacity bounds apply to the standard linear attention formulation. revision: yes
Referee: [Section 5 (Capacity Theorem)] The capacity theorem (worst-case O(sqrt(d_k)) error-free retrieval) requires a full derivation that isolates the contribution of the Hebbian rule from the initial analogy. Please supply the key steps or equation that establishes the bound and confirm that it does not restate consequences already built into the conditioning model.

Authors: The capacity theorem is derived independently in Section 5 by analyzing the retrieval error in the Hebbian associative memory model. The key steps involve bounding the interference from multiple stored associations using concentration inequalities on the inner products between query and key vectors in d_k dimensions. Specifically, for error-free retrieval in the worst case, the number of associations m must satisfy m = O(sqrt(d_k)) to ensure that the signal from the correct association dominates the noise from others with high probability. This derivation relies on the properties of random vectors in high dimensions and the outer-product storage, not on the conditioning analogy per se. We will expand Section 5 to include these intermediate equations, such as the expression for the retrieved value and the error term bound. revision: yes
Referee: [Section 6 (Error Propagation Analysis)] The error-propagation analysis that yields trade-offs among depth, width, and head redundancy must include the concrete recurrence or matrix equation governing error accumulation across layers. Without it, the claimed architectural recommendations cannot be evaluated for correctness or generality.

Authors: We acknowledge that the error propagation analysis in Section 6 would benefit from an explicit recurrence relation. In the revised manuscript, we will introduce the matrix equation for error accumulation: E_{l+1} = E_l * W_l + delta_l, where E_l is the error at layer l, W_l the weight matrix influenced by attention, and delta_l the local perturbation from the Hebbian update. This recurrence allows us to derive the trade-offs by analyzing the spectral norm growth across layers, leading to recommendations for balancing depth (to avoid error amplification) with width and redundant heads (to average out errors). We will add this equation and the subsequent analysis to make the architectural insights fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are derived consequences within the proposed analogy model

full rationale

The paper defines a mapping of attention components (Q, K, V) to conditioning elements and posits a Hebbian update rule as the core of linear attention. It then derives capacity bounds O(sqrt(d_k)) and error propagation results explicitly as mathematical consequences of this linearized model. No load-bearing self-citations, fitted parameters renamed as predictions, or definitional equivalences are present in the provided text. The derivation chain remains self-contained: the analogy supplies the model, and the theorems are obtained by analyzing that model rather than restating its construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of equating linear attention to Pavlovian conditioning and on the assumption that this simplified model captures the essential associative behavior of full attention.

axioms (1)

domain assumption The computation in linear attention directly corresponds to Pavlovian conditioning with queries as test stimuli, keys as CS, and values as US, forming associations via a Hebbian rule.
This mapping is invoked to derive the capacity theorem, error analysis, and biological insights.

pith-pipeline@v0.9.0 · 5785 in / 1371 out tokens · 69036 ms · 2026-05-19T00:05:32.302342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 6 (Associative Memory Capacity) ... n < sqrt(ϵ δ d_k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Wickliffe C. Abraham. Metaplasticity: Tuning synapses and networks for plasticity. Nat Rev Neurosci, 9(5):387–387, May 2008

work page 2008
[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization, July 2016

work page 2016
[3]

E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J Neurosci, 2(1):32–48, January 1982

work page 1982
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page
[5]

Comment: 40+32 pages

work page
[6]

Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nat Rev Neurosci, 13(1):51–62, January 2012

work page 2012
[7]

Colwell, and Adrian Weller

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking Attention with Performers. In International Conference on Learning Representations, October 2020

work page 2020
[8]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, January 2024

work page 2024
[9]

Twenty-Five Lessons from Computational Neuromodulation

Peter Dayan. Twenty-Five Lessons from Computational Neuromodulation. Neuron, 76(1):240– 256, October 2012

work page 2012
[10]

A Mathematical Framework for Transformer Circuits

Elhage et al. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread, 2021

work page 2021
[11]

Wulfram Gerstner and Werner M. Kistler. Mathematical formulations of Hebbian learning. Biol Cybern, 87(5):404–415, December 2002

work page 2002
[12]

Intensity Generalization: Physiology and Modelling of a Neglected Topic

STEFANO Ghirlanda. Intensity Generalization: Physiology and Modelling of a Neglected Topic. Journal of Theoretical Biology, 214(3):389–404, February 2002

work page 2002
[13]

D. O. Hebb. The Organization of Behavior; a Neuropsychological Theory. The Organization of Behavior; a Neuropsychological Theory. Wiley, Oxford, England, 1949

work page 1949
[14]

Isaacson and Massimo Scanziani

Jeffry S. Isaacson and Massimo Scanziani. How inhibition shapes cortical activity. Neuron, 72(2):231–243, October 2011

work page 2011
[15]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, August 2020

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, August 2020. Comment: ICML 2020, project at https://linear-transformers.com/

work page 2020
[16]

Authors Jack Lindsey†, Wes Gurnee*, Emmanuel Ameisen*, Brian Chen*, Adam Pearce*, Nicholas L. Turner*, Craig Citro*, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cun- ningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimme...

work page 2025
[17]

Neuromodulation of Neuronal Circuits: Back to the Future

Eve Marder. Neuromodulation of Neuronal Circuits: Back to the Future. Neuron, 76(1):1–11, October 2012

work page 2012
[18]

S. Maren. Neurobiology of Pavlovian fear conditioning. Annu Rev Neurosci, 24:897–931, 2001

work page 2001
[19]

Mysore and Eric I

Shreesh P. Mysore and Eric I. Knudsen. The role of a midbrain network in competitive stimulus selection. Curr Opin Neurobiol, 21(4):653–660, August 2011

work page 2011
[20]

E. Oja. A simplified neuron model as a principal component analyzer. J Math Biol, 15(3):267– 273, 1982

work page 1982
[21]

In-context Learning and Induction Heads

Olsson et al. In-context Learning and Induction Heads. Transformer Circuits Thread, 2022

work page 2022
[22]

Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex

P Ivan Pavlov (1927). Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex. Ann Neurosci, 17(3):136–141, July 2010

work page 1927
[23]

The Devil in Linear Transformer, October 2022

Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The Devil in Linear Transformer, October 2022. Comment: accepted to EMNLP2022

work page 2022
[24]

Leaner Transformers: More Heads, Less Depth, May 2025

Hemanth Saratchandran, Damien Teney, and Simon Lucey. Leaner Transformers: More Heads, Less Depth, May 2025

work page 2025
[25]

Linear Transformers Are Secretly Fast Weight Programmers, June 2021

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers, June 2021

work page 2021
[26]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, January 2017

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, January 2017

work page 2017
[27]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

work page
[28]

Dendrites

Edited by Greg Stuart, Nelson Spruston, and and Michael Hausser, editors. Dendrites. Oxford University Press, Oxford, New York, third edition, third edition edition, June 2016

work page 2016
[29]

Retentive Network: A Successor to Transformer for Large Language Models, August 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive Network: A Successor to Transformer for Large Language Models, August 2023

work page 2023
[30]

Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, November 2019

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, November 2019. Comment: EMNLP 2019

work page 2019
[31]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, 2017

work page 2017
[32]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models, October 2022. Comment: Transactions on Machine Learning Research (TMLR), 2022

work page 2022
[33]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024. Comment: Final camera ready

work page 2024
[34]

Tolias, and Doris Tsao

Anthony Zador, Sean Escola, Blake Richards, Bence Ölveczky, Yoshua Bengio, Kwabena Boahen, Matthew Botvinick, Dmitri Chklovskii, Anne Churchland, Claudia Clopath, James Di- Carlo, Surya Ganguli, Jeff Hawkins, Konrad Körding, Alexei Koulakov, Yann LeCun, Timothy Lillicrap, Adam Marblestone, Bruno Olshausen, Alexandre Pouget, Cristina Savin, Terrence Sejnow...

work page 2023
[35]

Root Mean Square Layer Normalization, October 2019

Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization, October 2019. Comment: NeurIPS 2019

work page 2019
[36]

1 − n δdk H#L (45) A.2.4 Error Rate Upper Bound The overall error rate r for the entire deep network is bounded by: r = 1 − P (complete success) < 1 −

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-Time Training Done Right, May 2025. Comment: 32 pages, 11 figures. A Detailed Mathematical Proofs A.1 Proof of Memory Capacity Theorem 6 Proof. Consider the associative memory formed by n CS-US pairs: S = α nX j=1 f(kj)⊤g(vj) ...

work page 2025

[1] [1]

Wickliffe C. Abraham. Metaplasticity: Tuning synapses and networks for plasticity. Nat Rev Neurosci, 9(5):387–387, May 2008

work page 2008

[2] [2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization, July 2016

work page 2016

[3] [3]

E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J Neurosci, 2(1):32–48, January 1982

work page 1982

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page

[5] [5]

Comment: 40+32 pages

work page

[6] [6]

Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nat Rev Neurosci, 13(1):51–62, January 2012

work page 2012

[7] [7]

Colwell, and Adrian Weller

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking Attention with Performers. In International Conference on Learning Representations, October 2020

work page 2020

[8] [8]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, January 2024

work page 2024

[9] [9]

Twenty-Five Lessons from Computational Neuromodulation

Peter Dayan. Twenty-Five Lessons from Computational Neuromodulation. Neuron, 76(1):240– 256, October 2012

work page 2012

[10] [10]

A Mathematical Framework for Transformer Circuits

Elhage et al. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread, 2021

work page 2021

[11] [11]

Wulfram Gerstner and Werner M. Kistler. Mathematical formulations of Hebbian learning. Biol Cybern, 87(5):404–415, December 2002

work page 2002

[12] [12]

Intensity Generalization: Physiology and Modelling of a Neglected Topic

STEFANO Ghirlanda. Intensity Generalization: Physiology and Modelling of a Neglected Topic. Journal of Theoretical Biology, 214(3):389–404, February 2002

work page 2002

[13] [13]

D. O. Hebb. The Organization of Behavior; a Neuropsychological Theory. The Organization of Behavior; a Neuropsychological Theory. Wiley, Oxford, England, 1949

work page 1949

[14] [14]

Isaacson and Massimo Scanziani

Jeffry S. Isaacson and Massimo Scanziani. How inhibition shapes cortical activity. Neuron, 72(2):231–243, October 2011

work page 2011

[15] [15]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, August 2020

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, August 2020. Comment: ICML 2020, project at https://linear-transformers.com/

work page 2020

[16] [16]

Authors Jack Lindsey†, Wes Gurnee*, Emmanuel Ameisen*, Brian Chen*, Adam Pearce*, Nicholas L. Turner*, Craig Citro*, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cun- ningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimme...

work page 2025

[17] [17]

Neuromodulation of Neuronal Circuits: Back to the Future

Eve Marder. Neuromodulation of Neuronal Circuits: Back to the Future. Neuron, 76(1):1–11, October 2012

work page 2012

[18] [18]

S. Maren. Neurobiology of Pavlovian fear conditioning. Annu Rev Neurosci, 24:897–931, 2001

work page 2001

[19] [19]

Mysore and Eric I

Shreesh P. Mysore and Eric I. Knudsen. The role of a midbrain network in competitive stimulus selection. Curr Opin Neurobiol, 21(4):653–660, August 2011

work page 2011

[20] [20]

E. Oja. A simplified neuron model as a principal component analyzer. J Math Biol, 15(3):267– 273, 1982

work page 1982

[21] [21]

In-context Learning and Induction Heads

Olsson et al. In-context Learning and Induction Heads. Transformer Circuits Thread, 2022

work page 2022

[22] [22]

Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex

P Ivan Pavlov (1927). Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex. Ann Neurosci, 17(3):136–141, July 2010

work page 1927

[23] [23]

The Devil in Linear Transformer, October 2022

Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The Devil in Linear Transformer, October 2022. Comment: accepted to EMNLP2022

work page 2022

[24] [24]

Leaner Transformers: More Heads, Less Depth, May 2025

Hemanth Saratchandran, Damien Teney, and Simon Lucey. Leaner Transformers: More Heads, Less Depth, May 2025

work page 2025

[25] [25]

Linear Transformers Are Secretly Fast Weight Programmers, June 2021

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers, June 2021

work page 2021

[26] [26]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, January 2017

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, January 2017

work page 2017

[27] [27]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

work page

[28] [28]

Dendrites

Edited by Greg Stuart, Nelson Spruston, and and Michael Hausser, editors. Dendrites. Oxford University Press, Oxford, New York, third edition, third edition edition, June 2016

work page 2016

[29] [29]

Retentive Network: A Successor to Transformer for Large Language Models, August 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive Network: A Successor to Transformer for Large Language Models, August 2023

work page 2023

[30] [30]

Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, November 2019

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, November 2019. Comment: EMNLP 2019

work page 2019

[31] [31]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, 2017

work page 2017

[32] [32]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models, October 2022. Comment: Transactions on Machine Learning Research (TMLR), 2022

work page 2022

[33] [33]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024. Comment: Final camera ready

work page 2024

[34] [34]

Tolias, and Doris Tsao

Anthony Zador, Sean Escola, Blake Richards, Bence Ölveczky, Yoshua Bengio, Kwabena Boahen, Matthew Botvinick, Dmitri Chklovskii, Anne Churchland, Claudia Clopath, James Di- Carlo, Surya Ganguli, Jeff Hawkins, Konrad Körding, Alexei Koulakov, Yann LeCun, Timothy Lillicrap, Adam Marblestone, Bruno Olshausen, Alexandre Pouget, Cristina Savin, Terrence Sejnow...

work page 2023

[35] [35]

Root Mean Square Layer Normalization, October 2019

Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization, October 2019. Comment: NeurIPS 2019

work page 2019

[36] [36]

1 − n δdk H#L (45) A.2.4 Error Rate Upper Bound The overall error rate r for the entire deep network is bounded by: r = 1 − P (complete success) < 1 −

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-Time Training Done Right, May 2025. Comment: 32 pages, 11 figures. A Detailed Mathematical Proofs A.1 Proof of Memory Capacity Theorem 6 Proof. Consider the associative memory formed by n CS-US pairs: S = α nX j=1 f(kj)⊤g(vj) ...

work page 2025