Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought
Pith reviewed 2026-05-11 01:11 UTC · model grok-4.3
The pith
With the right parameters, chain-of-thought generation in a linear transformer performs repeated temporal difference learning updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a policy evaluation setup with linear Transformer, the CoT generation process with specific parameters is equivalent to repeatedly executing temporal difference learning updates. The policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. The desired Transformer parameters are a global minimizer of the pretraining loss.
What carries the argument
Linear Transformer parameters that equate chain-of-thought token generation to iterative temporal difference learning updates.
If this is right
- In-context adaptation to new tasks occurs by executing internal TD updates without any weight changes.
- Longer chain-of-thought sequences improve policy evaluation accuracy at a geometric rate until context length caps the gain.
- Pretraining loss minimization naturally produces parameters that enable this in-context RL behavior.
- Explicit finite-sample bounds quantify how quickly evaluation error converges with additional thought steps.
Where Pith is reading between the lines
- The same internal update mechanism may explain why chain-of-thought improves performance on planning and reasoning tasks outside reinforcement learning.
- Approximate versions of the equivalence could appear in nonlinear transformers used in practice, offering a testable prediction for model inspection.
- Pretraining objectives could be modified to encourage parameters that support more iterations or faster internal convergence.
- One could verify the claim by extracting intermediate activations during chain-of-thought and checking whether they match TD value estimates.
Load-bearing premise
The proof assumes a linear attention mechanism and restricts the task to policy evaluation rather than full policy optimization.
What would settle it
Train a linear transformer to the claimed global minimum of pretraining loss, then compare its actual chain-of-thought outputs token-by-token against the sequence of temporal difference updates on held-out tasks.
Figures
read the original abstract
In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes in-context reinforcement learning (ICRL) with Chain-of-Thought (CoT) in a linear Transformer under a policy-evaluation MDP. It proves that specific Transformer parameters make the CoT generation process algebraically equivalent to repeated temporal-difference (TD) updates, establishes finite-sample bounds showing that the policy-evaluation error contracts geometrically with CoT length before saturating at a statistical floor set by context length, and shows that these parameters are a global minimizer of the pretraining loss.
Significance. If the derivations hold, the work supplies the first rigorous account of how CoT length controls ICRL performance in the linear case and links the emergence of effective parameters directly to pretraining-loss minimization. The explicit equivalence and geometric convergence results are concrete strengths that could guide the choice of CoT length in practice; the global-minimizer argument is especially useful because it explains why the required parameters arise without hand-tuning.
major comments (2)
- [§3] §3 (equivalence theorem): the algebraic identity between linear-attention CoT and the TD operator is derived from the closed-form attention update and holds only for linear attention in pure policy evaluation; once softmax nonlinearity or policy improvement is introduced the identity ceases to hold, yet the paper invokes the result to explain empirical emergence in practical (nonlinear, full-RL) models without supplying an approximation or robustness argument.
- [§4] §4 (finite-sample convergence): the geometric rate and saturation floor are stated with respect to the linear-Transformer forward pass; the proof sketch relies on the contraction property of the TD operator being preserved exactly by the attention matrix, but the finite-sample bound does not quantify the additional error introduced when the learned attention matrix deviates from the exact TD operator during pretraining.
minor comments (2)
- [Notation] The notation for the linear attention matrix and the TD target should be unified across the equivalence and convergence sections to avoid reader confusion.
- [Figure 2] Figure 2 (convergence curves) would benefit from an additional panel showing the dependence on context length, as the statistical floor is a central claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§3] §3 (equivalence theorem): the algebraic identity between linear-attention CoT and the TD operator is derived from the closed-form attention update and holds only for linear attention in pure policy evaluation; once softmax nonlinearity or policy improvement is introduced the identity ceases to hold, yet the paper invokes the result to explain empirical emergence in practical (nonlinear, full-RL) models without supplying an approximation or robustness argument.
Authors: The manuscript restricts its claims to linear Transformers under policy evaluation, as stated in the abstract and Section 2. The equivalence is exact only in this setting. The paper does not assert that the algebraic identity carries over to softmax attention or policy improvement; it presents the linear case as a rigorous foundation that can help interpret broader empirical phenomena. We will add explicit scope statements in the introduction and a limitations paragraph in the discussion section to prevent over-interpretation, while leaving approximation arguments for nonlinear cases to future work. revision: partial
-
Referee: [§4] §4 (finite-sample convergence): the geometric rate and saturation floor are stated with respect to the linear-Transformer forward pass; the proof sketch relies on the contraction property of the TD operator being preserved exactly by the attention matrix, but the finite-sample bound does not quantify the additional error introduced when the learned attention matrix deviates from the exact TD operator during pretraining.
Authors: Section 4 derives the geometric bound under the assumption that the attention parameters exactly match the TD operator, which is justified by the global-minimizer result of Section 5. We agree that the current statement does not quantify the effect of finite-pretraining deviations. We will revise the theorem to state the exact-parameter assumption explicitly and add a remark that uses matrix perturbation theory to bound the change in contraction rate for small deviations, thereby addressing the additional error term. revision: yes
Circularity Check
No circularity: algebraic equivalence and loss minimization proven directly from linear Transformer equations
full rationale
The paper's central results consist of an algebraic identity showing that specific linear-attention parameters make the CoT forward pass identical to repeated TD updates, a finite-sample geometric convergence bound on the resulting policy-evaluation error, and a direct proof that those same parameters globally minimize the pretraining loss. All three follow from the closed-form expression for linear attention and the explicit definition of the pretraining objective; neither the TD equivalence nor the minimizer property is obtained by fitting a parameter to the target quantity and relabeling it. No self-citation chain, uniqueness theorem, or ansatz is invoked to close the argument. The analysis is therefore self-contained within its stated linear policy-evaluation setting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Transformer is linear and operates in a policy-evaluation setting.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1: with (P,Q) in (9), CoT step (7) on prompt (8) yields exactly wk+1 = wk + (α/n) Σ δj(wk) xj (batch TD)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MSPBE L(w) = ||C^{-1/2}(Aw-b)||^2 and contraction under η ∈ (0,μ/L]
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Emergence: θ* globally minimizes Jk(θ;D) as k→∞ under finite-sample conditions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the International Conference on Machine Learning , year =
Amir Moeini and Minjae Kwon and Alper Kamil Bozkurt and Yuichi Motai and Rohan Chandra and Lu Feng and Shangtong Zhang , title =. Proceedings of the International Conference on Machine Learning , year =
-
[2]
Zixuan Xie and Xinyu Liu and Claire Chen and Shuze Daniel Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =
-
[3]
Gandharv Patil and L. A. Prashanth and Dheeraj Nagaraj and Doina Precup , title =. Proceedings of the International Conference on Artificial Intelligence and Statistics , year =
-
[4]
Proceedings of the Conference on Learning Theory , year =
Sergey Samsonov and Daniil Tiapkin and Alexey Naumov and Eric Moulines , title =. Proceedings of the Conference on Learning Theory , year =
-
[5]
Wei-Cheng Lee and Francesco Orabona , title =. arXiv preprint , year =
-
[6]
Finite-Sample Analysis of LSTD , booktitle =
Alessandro Lazaric and Mohammad Ghavamzadeh and R. Finite-Sample Analysis of LSTD , booktitle =
-
[7]
Roberts, Gareth O. and Rosenthal, Jeffrey S. , journal=. General state space
- [8]
-
[9]
Asymptotic Theory of Weakly Dependent Random Processes , author=. 2017 , publisher=
work page 2017
-
[10]
and Cao, Yuan and Narasimhan, Karthik , title =
Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=
work page 2023
-
[11]
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=
work page 2023
-
[12]
Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=
work page 2024
-
[13]
Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=
work page 2022
-
[14]
VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=
work page 2020
-
[15]
Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=
work page 2018
-
[16]
International Conference on Machine Learning , year=
Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=
-
[17]
A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=
work page 2018
-
[18]
Proceedings of the International Conference on Machine Learning , year=
Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=
-
[19]
Proceedings of the International Conference on Machine Learning , year=
Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=
-
[20]
Proceedings of the International Conference on Machine Learning , year=
Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=
-
[21]
NeurIPS Foundation Models for Decision Making Workshop , year=
Towards General-Purpose In-Context Learning Agents , author =. NeurIPS Foundation Models for Decision Making Workshop , year=
-
[22]
Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =
-
[23]
Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=
work page 2022
-
[24]
Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=
work page 2022
-
[25]
RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=
work page 2022
-
[26]
Transactions on Machine Learning Research , year=
Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=
-
[27]
Proceedings of the International Conference on Machine Learning , year=
Generalization to New Sequential Decision Making Tasks with In-Context Learning , author=. Proceedings of the International Conference on Machine Learning , year=
-
[28]
Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=
-
[29]
Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=
-
[30]
Proceedings of the Conference on Robot Learning , year=
LocoFormer: Generalist Locomotion via Long-context Adaptation , author =. Proceedings of the Conference on Robot Learning , year=
-
[31]
Proceedings of the International Conference on Machine Learning , year=
Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=
-
[32]
Foundations of Computational Mathematics , year =
User-Friendly Tail Bounds for Sums of Random Matrices , author =. Foundations of Computational Mathematics , year =
-
[33]
The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach , author =. 2015 , journal =
work page 2015
-
[34]
Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =
work page 2024
-
[35]
A Survey and Some Open Questions , author =
Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =
-
[36]
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? , author=. 2024 , booktitle =
work page 2024
-
[37]
Rethinking Attention with Performers , author=. 2021 , booktitle =
work page 2021
-
[38]
Transformers are RNNs: fast autoregressive transformers with linear attention , year =
Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are RNNs: fast autoregressive transformers with linear attention , year =
-
[39]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2022 , booktitle=
work page 2022
-
[40]
Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =
work page 2026
-
[41]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=
A Survey on In-context Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=
work page 2024
-
[42]
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. 2024 , booktitle =
work page 2024
-
[43]
A Tutorial on Meta-Reinforcement Learning , author=. 2025 , journal =
work page 2025
-
[44]
A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =
work page 2025
-
[45]
Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =
work page 2024
-
[46]
Huang, Sili and Hu, Jifeng and Chen, Hechang and Sun, Lichao and Yang, Bo , title =. 2024 , booktitle =
work page 2024
-
[47]
Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =
work page 2025
-
[48]
In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =
work page 2024
-
[49]
Shi, Lucy Xiaoyang and Jiang, Yunfan and Grigsby, Jake and Fan, Linxi Jim and Zhu, Yuke , title =. 2023 , booktitle =
work page 2023
-
[50]
Proceedings of the International Conference on Machine Learning , year =
Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =
-
[51]
AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =
work page 2024
-
[52]
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =
work page 2024
-
[53]
Lu, Chris and Schroecker, Yannick and Gu, Albert and Parisotto, Emilio and Foerster, Jakob and Singh, Satinder and Behbahani, Feryal , title =. 2023 , booktitle =
work page 2023
-
[54]
Introducing Symmetries to Black Box Meta Reinforcement Learning , author=. 2022 , booktitle=
work page 2022
-
[55]
MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =
Shan, Kaiyu and Wang, Yongtao and Tang, Zhi and Chen, Ying and Li, Yangyan , booktitle =. MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =
-
[56]
TEA: Temporal Excitation and Aggregation for Action Recognition , year=
Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=. TEA: Temporal Excitation and Aggregation for Action Recognition , year=
-
[57]
Proceedings of the IEEE International Conference on Computer Vision , year=
TSM: Temporal Shift Module for Efficient Video Understanding , author=. Proceedings of the IEEE International Conference on Computer Vision , year=
-
[58]
International Conference on Learning Representations , year=
A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning , author=. International Conference on Learning Representations , year=
-
[59]
Advances in Neural Information Processing Systems , year =
Wang, Jiuqi and Chandra, Rohan and Zhang, Shangtong , title =. Advances in Neural Information Processing Systems , year =
-
[60]
International Conference on Learning Representations , year =
Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning , author =. International Conference on Learning Representations , year =
-
[61]
Proceedings of the International Conference on Machine Learning , year =
Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , author =. Proceedings of the International Conference on Machine Learning , year =
-
[62]
Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =
Kane, Daniel and Liu, Sihan and Lovett, Shachar and Mahajan, Gaurav and Szepesv. Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =
-
[63]
SIAM Journal on Control and Optimization , year=
A small gain analysis of single timescale actor critic , author=. SIAM Journal on Control and Optimization , year=
-
[64]
Advances in Neural Information Processing Systems , year=
Finite-time analysis of single-timescale actor-critic , author=. Advances in Neural Information Processing Systems , year=
-
[65]
Proceedings of the International Conference on Machine Learning , year=
A Generalized Reinforcement-Learning Model: Convergence and Applications , author=. Proceedings of the International Conference on Machine Learning , year=
-
[66]
Beck, Carolyn L. and Srikant, R. , booktitle =. Improved upper bounds on the expected error in constant step-size Q-learning , year =
-
[67]
Advances in Neural Information Processing Systems , year=
On the convergence and sample complexity analysis of deep q-networks with -greedy exploration , author=. Advances in Neural Information Processing Systems , year=
-
[69]
Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller , title =
-
[70]
Fanghui Liu and Luca Viano and Volkan Cevher , title =
-
[71]
SIAM Journal on Mathematics of Data Science , year =
Zaiwei Chen and John Paul Clarke and Siva Theja Maguluri , title =. SIAM Journal on Mathematics of Data Science , year =
- [72]
-
[73]
Devraj, Adithya M. and Meyn, Sean P , title =. 2022 , journal =
work page 2022
-
[74]
Melo, Francisco S.and Ribeiro, M. Isabel , booktitle =. Q-Learning with Linear Function Approximation , year =
- [75]
-
[76]
Chen, Zaiwei and Zhang, Sheng and Doan, Thinh T and Maguluri, Siva Theja and Clarke, John-Paul , journal =. Performance of
-
[77]
Joan, Bas-Serrano and Sebastian, Curi and Andreas, Krause and Gergely, Neu , title =
-
[78]
Gopalan, Aditya and Thoppe, Gugan , title =. ArXiv Preprint , year =
-
[79]
Gao, Bolin and Pavel, Lacra , title =
-
[80]
The Projected Bellman Equation in Reinforcement Learning , year =
Meyn, Sean , journal =. The Projected Bellman Equation in Reinforcement Learning , year =
-
[81]
Shangtong Zhang and Remi Tachet and Romain Laroche , title =. 2022 , journal =
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.