Understanding Transformers through the Lens of Pavlovian Conditioning
Pith reviewed 2026-05-19 00:05 UTC · model grok-4.3
The pith
Queries, keys, and values in transformer attention map directly to the stimuli and responses of classical conditioning, so each attention step builds a temporary Hebbian associative memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention's queries, keys, and values map to test stimuli, conditional stimuli, and unconditional stimuli in classical conditioning. Each attention operation therefore constructs a transient associative memory via a Hebbian rule in which CS-US pairs create dynamic associations that later test stimuli can retrieve. The linearized model yields a capacity theorem in which attention heads store O(sqrt(d_k)) associations for worst-case error-free retrieval while average-case fidelity scales as O(d_k), plus an error-propagation analysis that identifies trade-offs among depth, width, and head redundancy, and an account of how biologically plausible learning rules could improve transformers.
What carries the argument
The direct mapping of linear attention to Pavlovian conditioning elements, with queries probing associations, keys acting as retrieval cues, and values supplying response content, all linked by a Hebbian association rule.
If this is right
- Each attention head can store O(sqrt(d_k)) associations with guaranteed error-free retrieval in the worst case.
- Average-case retrieval fidelity improves linearly with d_k.
- Reliability requires balancing model depth against width and head redundancy to limit error propagation.
- Adopting Hebbian-style or other biologically plausible update rules can strengthen transformer performance.
Where Pith is reading between the lines
- The analogy suggests that transformers succeed in part because they replicate association-forming mechanisms that biology already optimized.
- New attention variants could be designed to enforce explicit conditioning-like updates and thereby improve sample efficiency.
- Simple conditioning tasks could be used to measure whether real transformers exhibit the predicted capacity and error-scaling limits.
Load-bearing premise
Linear attention faithfully captures the essential associative dynamics of standard scaled dot-product attention without losing critical behaviors.
What would settle it
A controlled experiment that counts how many distinct associations a single linear-attention head can retrieve without error and finds a scaling that deviates from O(sqrt(d_k)) in the worst case would falsify the capacity claim.
Figures
read the original abstract
Transformer architectures have revolutionized artificial intelligence (AI) through their attention mechanisms, yet the computational principles underlying their success remain opaque. We present a novel theoretical framework that reinterprets the core computation of attention as Pavlovian conditioning. Our model finds a direct mathematical analogue in linear attention, which simplifies the analysis of the underlying associative process. We demonstrate that attention's queries, keys, and values can be mapped to the three elements of classical conditioning: test stimuli that probe associations, conditional stimuli (CS) that serve as retrieval cues, and unconditional stimuli (US) that contain response information. Through this lens, we suggest that each attention operation constructs a transient associative memory via a Hebbian rule, where CS-US pairs form dynamic associations that test stimuli can later retrieve. Our framework yields several theoretical insights grounded in this linearized model: (1) a capacity theorem showing that attention heads can store $O(\sqrt{d_k})$ associations for worst-case, error-free retrieval, while average-case retrieval fidelity scales robustly as $O(d_k)$; (2) an error propagation analysis revealing fundamental architectural trade-offs of balancing model depth, width, and head redundancy to maintain reliability; and (3) an understanding of how biologically plausible learning rules could enhance transformer architectures. By establishing this deep connection, we suggest that the success of modern AI may stem not from architectural novelty alone, but from implementing computational principles analogous to those optimized by biology over millions of years of evolution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a theoretical framework that reinterprets transformer attention as Pavlovian conditioning. It maps queries to test stimuli, keys to conditional stimuli (CS), and values to unconditional stimuli (US), with each attention step constructing a transient associative memory via a Hebbian rule. The framework is specialized to linear attention to derive a capacity theorem (O(sqrt(d_k)) worst-case error-free associations, O(d_k) average-case fidelity) and an error-propagation analysis that identifies architectural trade-offs among depth, width, and head redundancy.
Significance. If the mapping is shown to be faithful to the linear-attention formula and the capacity and error results are derived independently of the analogy, the work could supply a biologically grounded lens on attention dynamics and suggest concrete design principles for balancing capacity and reliability. The explicit capacity bounds and error analysis would be the primary contributions; without rigorous verification that the Hebbian outer-product update reproduces standard linear attention (Q(K^TV) with its usual scaling), the architectural trade-offs remain tied to the model rather than to transformers themselves.
major comments (3)
- [Section 3 (Pavlovian Conditioning Framework) and the statement of the capacity theorem] The central mapping (queries as test stimuli, keys as CS, values as US) and the claim that attention implements a Hebbian outer-product update are load-bearing for both the capacity theorem and the error analysis. The manuscript must explicitly state the precise linear-attention formula used (including any scaling, averaging, or decay terms) and demonstrate that it matches the Hebbian rule without additional normalization; any discrepancy would make the O(sqrt(d_k)) bound and the depth-width trade-offs specific to the analogy rather than to attention.
- [Section 5 (Capacity Theorem)] The capacity theorem (worst-case O(sqrt(d_k)) error-free retrieval) requires a full derivation that isolates the contribution of the Hebbian rule from the initial analogy. Please supply the key steps or equation that establishes the bound and confirm that it does not restate consequences already built into the conditioning model.
- [Section 6 (Error Propagation Analysis)] The error-propagation analysis that yields trade-offs among depth, width, and head redundancy must include the concrete recurrence or matrix equation governing error accumulation across layers. Without it, the claimed architectural recommendations cannot be evaluated for correctness or generality.
minor comments (2)
- [Section 3] Notation for the Hebbian update and the retrieval operation should be introduced with a single consistent equation early in the framework section to avoid ambiguity when the same symbols are reused in the capacity and error sections.
- [Introduction and Related Work] The manuscript should cite prior work on linear attention (e.g., the original linear-attention formulations and analyses of their approximation error) and on Hebbian associative memory models to situate the novelty of the conditioning lens.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below, indicating where revisions will be made to strengthen the presentation of our theoretical framework.
read point-by-point responses
-
Referee: [Section 3 (Pavlovian Conditioning Framework) and the statement of the capacity theorem] The central mapping (queries as test stimuli, keys as CS, values as US) and the claim that attention implements a Hebbian outer-product update are load-bearing for both the capacity theorem and the error analysis. The manuscript must explicitly state the precise linear-attention formula used (including any scaling, averaging, or decay terms) and demonstrate that it matches the Hebbian rule without additional normalization; any discrepancy would make the O(sqrt(d_k)) bound and the depth-width trade-offs specific to the analogy rather than to attention.
Authors: We agree that an explicit statement of the linear attention formula is essential for establishing the fidelity of the mapping. In Section 3 of the manuscript, we define linear attention as A = Q (K^T V) / d_k, where the division by d_k serves as an averaging term to normalize the outer product. This directly corresponds to the Hebbian update rule in our Pavlovian model, where the association matrix is updated as CS * US^T without additional softmax normalization, as linear attention approximates the attention mechanism in the limit. We will revise the manuscript to include a dedicated paragraph or equation block that derives this equivalence step-by-step, showing that the transient associative memory construction matches exactly. This ensures the capacity bounds apply to the standard linear attention formulation. revision: yes
-
Referee: [Section 5 (Capacity Theorem)] The capacity theorem (worst-case O(sqrt(d_k)) error-free retrieval) requires a full derivation that isolates the contribution of the Hebbian rule from the initial analogy. Please supply the key steps or equation that establishes the bound and confirm that it does not restate consequences already built into the conditioning model.
Authors: The capacity theorem is derived independently in Section 5 by analyzing the retrieval error in the Hebbian associative memory model. The key steps involve bounding the interference from multiple stored associations using concentration inequalities on the inner products between query and key vectors in d_k dimensions. Specifically, for error-free retrieval in the worst case, the number of associations m must satisfy m = O(sqrt(d_k)) to ensure that the signal from the correct association dominates the noise from others with high probability. This derivation relies on the properties of random vectors in high dimensions and the outer-product storage, not on the conditioning analogy per se. We will expand Section 5 to include these intermediate equations, such as the expression for the retrieved value and the error term bound. revision: yes
-
Referee: [Section 6 (Error Propagation Analysis)] The error-propagation analysis that yields trade-offs among depth, width, and head redundancy must include the concrete recurrence or matrix equation governing error accumulation across layers. Without it, the claimed architectural recommendations cannot be evaluated for correctness or generality.
Authors: We acknowledge that the error propagation analysis in Section 6 would benefit from an explicit recurrence relation. In the revised manuscript, we will introduce the matrix equation for error accumulation: E_{l+1} = E_l * W_l + delta_l, where E_l is the error at layer l, W_l the weight matrix influenced by attention, and delta_l the local perturbation from the Hebbian update. This recurrence allows us to derive the trade-offs by analyzing the spectral norm growth across layers, leading to recommendations for balancing depth (to avoid error amplification) with width and redundant heads (to average out errors). We will add this equation and the subsequent analysis to make the architectural insights fully verifiable. revision: yes
Circularity Check
No significant circularity; results are derived consequences within the proposed analogy model
full rationale
The paper defines a mapping of attention components (Q, K, V) to conditioning elements and posits a Hebbian update rule as the core of linear attention. It then derives capacity bounds O(sqrt(d_k)) and error propagation results explicitly as mathematical consequences of this linearized model. No load-bearing self-citations, fitted parameters renamed as predictions, or definitional equivalences are present in the provided text. The derivation chain remains self-contained: the analogy supplies the model, and the theorems are obtained by analyzing that model rather than restating its construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The computation in linear attention directly corresponds to Pavlovian conditioning with queries as test stimuli, keys as CS, and values as US, forming associations via a Hebbian rule.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 6 (Associative Memory Capacity) ... n < sqrt(ϵ δ d_k)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Wickliffe C. Abraham. Metaplasticity: Tuning synapses and networks for plasticity. Nat Rev Neurosci, 9(5):387–387, May 2008
work page 2008
-
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization, July 2016
work page 2016
-
[3]
E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J Neurosci, 2(1):32–48, January 1982
work page 1982
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
-
[5]
Comment: 40+32 pages
-
[6]
Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nat Rev Neurosci, 13(1):51–62, January 2012
work page 2012
-
[7]
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking Attention with Performers. In International Conference on Learning Representations, October 2020
work page 2020
-
[8]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, January 2024
work page 2024
-
[9]
Twenty-Five Lessons from Computational Neuromodulation
Peter Dayan. Twenty-Five Lessons from Computational Neuromodulation. Neuron, 76(1):240– 256, October 2012
work page 2012
-
[10]
A Mathematical Framework for Transformer Circuits
Elhage et al. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread, 2021
work page 2021
-
[11]
Wulfram Gerstner and Werner M. Kistler. Mathematical formulations of Hebbian learning. Biol Cybern, 87(5):404–415, December 2002
work page 2002
-
[12]
Intensity Generalization: Physiology and Modelling of a Neglected Topic
STEFANO Ghirlanda. Intensity Generalization: Physiology and Modelling of a Neglected Topic. Journal of Theoretical Biology, 214(3):389–404, February 2002
work page 2002
-
[13]
D. O. Hebb. The Organization of Behavior; a Neuropsychological Theory. The Organization of Behavior; a Neuropsychological Theory. Wiley, Oxford, England, 1949
work page 1949
-
[14]
Isaacson and Massimo Scanziani
Jeffry S. Isaacson and Massimo Scanziani. How inhibition shapes cortical activity. Neuron, 72(2):231–243, October 2011
work page 2011
-
[15]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, August 2020
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, August 2020. Comment: ICML 2020, project at https://linear-transformers.com/
work page 2020
-
[16]
Authors Jack Lindsey†, Wes Gurnee*, Emmanuel Ameisen*, Brian Chen*, Adam Pearce*, Nicholas L. Turner*, Craig Citro*, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cun- ningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimme...
work page 2025
-
[17]
Neuromodulation of Neuronal Circuits: Back to the Future
Eve Marder. Neuromodulation of Neuronal Circuits: Back to the Future. Neuron, 76(1):1–11, October 2012
work page 2012
-
[18]
S. Maren. Neurobiology of Pavlovian fear conditioning. Annu Rev Neurosci, 24:897–931, 2001
work page 2001
-
[19]
Shreesh P. Mysore and Eric I. Knudsen. The role of a midbrain network in competitive stimulus selection. Curr Opin Neurobiol, 21(4):653–660, August 2011
work page 2011
-
[20]
E. Oja. A simplified neuron model as a principal component analyzer. J Math Biol, 15(3):267– 273, 1982
work page 1982
-
[21]
In-context Learning and Induction Heads
Olsson et al. In-context Learning and Induction Heads. Transformer Circuits Thread, 2022
work page 2022
-
[22]
Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex
P Ivan Pavlov (1927). Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex. Ann Neurosci, 17(3):136–141, July 2010
work page 1927
-
[23]
The Devil in Linear Transformer, October 2022
Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The Devil in Linear Transformer, October 2022. Comment: accepted to EMNLP2022
work page 2022
-
[24]
Leaner Transformers: More Heads, Less Depth, May 2025
Hemanth Saratchandran, Damien Teney, and Simon Lucey. Leaner Transformers: More Heads, Less Depth, May 2025
work page 2025
-
[25]
Linear Transformers Are Secretly Fast Weight Programmers, June 2021
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers, June 2021
work page 2021
-
[26]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, January 2017
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, January 2017
work page 2017
-
[27]
Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- [28]
-
[29]
Retentive Network: A Successor to Transformer for Large Language Models, August 2023
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive Network: A Successor to Transformer for Large Language Models, August 2023
work page 2023
-
[30]
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, November 2019. Comment: EMNLP 2019
work page 2019
-
[31]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, 2017
work page 2017
-
[32]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models, October 2022. Comment: Transactions on Machine Learning Research (TMLR), 2022
work page 2022
-
[33]
Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024. Comment: Final camera ready
work page 2024
-
[34]
Anthony Zador, Sean Escola, Blake Richards, Bence Ölveczky, Yoshua Bengio, Kwabena Boahen, Matthew Botvinick, Dmitri Chklovskii, Anne Churchland, Claudia Clopath, James Di- Carlo, Surya Ganguli, Jeff Hawkins, Konrad Körding, Alexei Koulakov, Yann LeCun, Timothy Lillicrap, Adam Marblestone, Bruno Olshausen, Alexandre Pouget, Cristina Savin, Terrence Sejnow...
work page 2023
-
[35]
Root Mean Square Layer Normalization, October 2019
Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization, October 2019. Comment: NeurIPS 2019
work page 2019
-
[36]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-Time Training Done Right, May 2025. Comment: 32 pages, 11 figures. A Detailed Mathematical Proofs A.1 Proof of Memory Capacity Theorem 6 Proof. Consider the associative memory formed by n CS-US pairs: S = α nX j=1 f(kj)⊤g(vj) ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.