Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

Claire Chen; Rohan Chandra; Shangtong Zhang; Shuze Daniel Liu; Xinyu Liu; Zixuan Xie

arxiv: 2605.07333 · v2 · pith:QVL7Y3HRnew · submitted 2026-05-08 · 💻 cs.LG

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

Zixuan Xie , Xinyu Liu , Claire Chen , Shuze Daniel Liu , Rohan Chandra , Shangtong Zhang This is my paper

Pith reviewed 2026-05-20 23:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords in-context reinforcement learningsoftmax attentiontemporal difference learningtransformerspolicy evaluationkernel spacepretraining loss

0 comments

The pith

Softmax Transformers implement in-context reinforcement learning by matching iterative weighted softmax TD updates across layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that standard softmax attention in Transformers can perform in-context policy evaluation for reinforcement learning tasks. By choosing specific parameters, the computation in each Transformer layer corresponds exactly to an update step in a new algorithm called weighted softmax temporal difference learning, which operates in kernel space and includes linear and tabular TD as special cases. Under a contraction condition, the error in estimating the policy value decreases as more layers are added. These same parameters turn out to be the global minimizer of the pretraining loss function, which explains their appearance in experiments. This provides a theoretical basis for why pretrained Transformers can adapt to new RL problems just by conditioning on context data.

Core claim

With certain parameters, the layerwise forward pass of a Transformer with softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Under a certain contraction condition, the policy evaluation error decays as the number of layers grows with these parameters. Those parameters are a global minimizer of a pretraining loss.

What carries the argument

The equivalence of softmax attention to iterative updates in the weighted softmax TD learning algorithm for policy evaluation in kernel space.

If this is right

Transformers can adapt to new tasks in-context by performing policy evaluation through their layered computations.
The policy evaluation error decreases with increasing model depth under the contraction condition.
Weighted softmax TD learning generalizes both linear TD and tabular TD methods.
Pretraining naturally leads to parameters that enable this in-context RL behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large language models might implicitly solve RL problems when prompted with task examples.
Architectural choices in attention could be designed to implement other RL algorithms like Q-learning.
Empirical tests could verify the error decay in deep Transformer models on simple MDPs.
This equivalence might extend to other sequence models beyond standard Transformers.

Load-bearing premise

There exist specific parameter settings that make the Transformer's forward pass precisely replicate the updates of the weighted softmax TD algorithm, and a contraction condition holds to ensure error reduction with depth.

What would settle it

Run a simple MDP where the Transformer's outputs after each layer are compared directly to the value function estimates from running the weighted softmax TD algorithm; mismatch would disprove the equivalence.

Figures

Figures reproduced from arXiv: 2605.07333 by Claire Chen, Rohan Chandra, Shangtong Zhang, Shuze Daniel Liu, Xinyu Liu, Zixuan Xie.

**Figure 2.** Figure 2: Original vs. shifted memory rows. 5 Inference-Time Convergence We now address (Q2) by establishing the convergence of softmax ICTD to the true value function vπ as the number of layers L → ∞ and the context length n → ∞. Throughout the rest of the paper, we assume the trajectory τn = (S0, R1, . . . , Sn) visits every state, i.e., {S0, S1, . . . , Sn−1} = S. To analyze the recursion (17) on S, for a given c… view at source ↗

**Figure 3.** Figure 3: Emergence of the learned TD block. (a) Learned [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 4.** Figure 4: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Emergence vs. training steps under the default mask and [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 5.** Figure 5: Emergence vs. training steps under the default mask and [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Mask relaxation on Boyan’s chain. We compare the full-mask setting in [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 6.** Figure 6: Mask relaxation on Boyan’s chain. We compare the full-mask setting in [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Kernel weighted TD verification. Layer-wise log discrepancy [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 7.** Figure 7: Kernel weighted TD verification. Layer-wise log discrepancy [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows softmax attention can match weighted softmax TD updates layer by layer and ties the parameters to pretraining loss minimizers, but the contraction needed for error decay with depth is stated rather than tightly verified for those parameters.

read the letter

The main takeaway is that a Transformer using ordinary softmax attention, with the right parameters, performs the same updates as a new algorithm the authors call weighted softmax TD learning. Each layer corresponds to one step of policy evaluation in kernel space, and the parameters turn out to be global minimizers of a pretraining loss. This is the first time the actual softmax has been analyzed this way instead of the linear simplification used in earlier ICRL theory.

Referee Report

2 major / 2 minor

Summary. The paper claims that with certain parameters, the layerwise forward pass of a Transformer using standard softmax attention is equivalent to iterative updates of a new weighted softmax TD learning algorithm (which performs policy evaluation in kernel space and generalizes linear and tabular TD). It further proves that under a contraction condition on the induced operator, the policy evaluation error decays with depth, and that these parameters are global minimizers of a pretraining loss, with numerical experiments cited as supporting evidence.

Significance. If the equivalence, contraction, and minimizer results hold with explicit constructions and bounds, this would be a notable advance: the first theoretical account of in-context RL that uses the practical softmax attention rather than linear simplifications. The explicit link to a kernel-space TD algorithm and the global-minimizer property would help explain why RL-like adaptation emerges in pretrained Transformers.

major comments (2)

[Abstract] Abstract: the error-decay claim invokes a contraction condition (spectral radius <1) for the weighted softmax TD operator in kernel space, yet this condition is stated only abstractly in terms of the kernel and transition kernel. No explicit bound or verification is supplied showing that the condition holds for the same parameters that also minimize the pretraining loss; this is load-bearing for the decay result with depth.
[Abstract] Abstract: the equivalence between the softmax attention forward pass and one step of weighted softmax TD is asserted for 'certain parameters,' but the manuscript must explicitly construct or derive these parameters to demonstrate they simultaneously achieve the per-layer match, satisfy the contraction, and are global minimizers; without this, the central claim remains unverified.

minor comments (2)

The abstract mentions numerical experiments but provides no details on the MDPs, kernels, or hyper-parameters used; adding a brief experimental section or table would strengthen the supporting evidence.
Clarify the precise definition of the pretraining loss whose global minimizer is claimed; this would remove any appearance of circularity in the parameter selection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of establishing a theoretical link between standard softmax attention and in-context RL. We address the two major comments point by point below, clarifying the content of the full manuscript while indicating revisions that will strengthen the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the error-decay claim invokes a contraction condition (spectral radius <1) for the weighted softmax TD operator in kernel space, yet this condition is stated only abstractly in terms of the kernel and transition kernel. No explicit bound or verification is supplied showing that the condition holds for the same parameters that also minimize the pretraining loss; this is load-bearing for the decay result with depth.

Authors: The full manuscript explicitly constructs the relevant parameters in Section 3 (Theorem 3.1) to realize the per-layer equivalence with one step of weighted softmax TD. Section 4 then proves that these same parameters induce a contraction on the policy-evaluation error whose spectral radius is bounded above by a quantity strictly less than 1 (explicitly depending on the discount factor and the minimal eigenvalue of the kernel Gram matrix). Section 5 separately establishes that the identical parameter values are global minimizers of the pretraining loss. While the abstract summarizes these results at a high level, we agree that the linkage between the contraction and the minimizing parameters could be stated more directly. We will revise the abstract to note that the contraction holds for the explicitly constructed parameters that also minimize the loss. revision: yes
Referee: [Abstract] Abstract: the equivalence between the softmax attention forward pass and one step of weighted softmax TD is asserted for 'certain parameters,' but the manuscript must explicitly construct or derive these parameters to demonstrate they simultaneously achieve the per-layer match, satisfy the contraction, and are global minimizers; without this, the central claim remains unverified.

Authors: The manuscript already supplies the explicit construction and simultaneous verification in the main body: Theorem 3.1 derives the precise query, key, and value matrices realizing the equivalence; the contraction proof in Section 4 applies directly to those matrices; and the global-minimizer result in Section 5 uses the same matrices. The abstract employs the phrase 'certain parameters' purely as a concise summary. To address the referee's concern, we will update the abstract to replace 'certain parameters' with language that references the explicit construction and indicates that the same parameters satisfy the equivalence, contraction, and minimization properties simultaneously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; equivalence, contraction, and minimizer results are independent derivations

full rationale

The paper derives an equivalence between the layerwise softmax Transformer forward pass and iterative weighted softmax TD updates for specific parameters, proves policy evaluation error decay under a stated contraction condition on the induced operator, and separately proves that the identified parameters globally minimize a pretraining loss. These steps are presented as mathematical results rather than reductions to self-definitions or fitted quantities renamed as predictions. The contraction condition is invoked as an assumption to guarantee decay but does not make the equivalence or minimizer claims circular by construction. Numerical experiments are cited only to illustrate emergence of the parameters, not to define them. The derivation chain remains self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatzes smuggled from prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claim depends on unspecified parameters, a contraction condition, and a newly introduced algorithm whose details are absent from the abstract.

free parameters (1)

certain parameters
Parameters that make the layerwise Transformer forward pass equivalent to weighted softmax TD updates and that globally minimize the pretraining loss.

axioms (1)

domain assumption contraction condition
Condition under which policy evaluation error decays as the number of layers increases.

invented entities (1)

weighted softmax TD learning algorithm no independent evidence
purpose: RL algorithm performing policy evaluation in kernel space that generalizes linear TD and tabular TD.
New algorithm introduced to characterize the behavior of softmax attention in the ICRL setting.

pith-pipeline@v0.9.0 · 5725 in / 1320 out tokens · 41138 ms · 2026-05-20T23:21:50.884774+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

under a certain contraction condition, the policy evaluation error decays as the number of layers grows

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages

[1]

Proceedings of the International Conference on Machine Learning , year=

In-Context Deep Learning via Transformer Models , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[2]

Proceedings of the International Conference on Machine Learning , year=

Universal Approximation with Softmax Attention , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[3]

Proceedings of the International Conference on Learning Representations , year=

In-Context Algorithm Emulation in Fixed-Weight Transformers , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[4]

2026 , booktitle =

Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

work page 2026
[5]

2025 , booktitle =

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

work page 2025
[6]

ArXiv Preprint , year =

Zixuan Xie and Xinyu Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

work page
[7]

and Rheinboldt, Werner C

Ortega, James M. and Rheinboldt, Werner C. , title =. 2000 , note =

work page 2000
[8]

Journal of Machine Learning Research , year =

Jianqing Fan and Bai Jiang and Qiang Sun , title =. Journal of Machine Learning Research , year =

work page
[9]

2018 , publisher=

High-Dimensional Probability: An Introduction with Applications in Data Science , author=. 2018 , publisher=

work page 2018
[10]

2024 , journal =

Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

work page 2024
[11]

A Survey and Some Open Questions , author =

Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

work page
[12]

2025 , journal =

Softmax Linear: Transformers May Learn to Classify In-Context by Kernel Gradient Descent , author =. 2025 , journal =

work page 2025
[13]

Advances in Neural Information Processing Systems , year =

Towards Understanding How Transformers Learn In-Context Through a Representation Learning Lens , author =. Advances in Neural Information Processing Systems , year =

work page
[14]

and Cao, Yuan and Narasimhan, Karthik , title =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

work page 2023
[15]

2023 , booktitle=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

work page 2023
[16]

2024 , booktitle=

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

work page 2024
[17]

2022 , booktitle=

Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

work page 2022
[18]

2020 , booktitle=

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

work page 2020
[19]

2018 , journal=

Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

work page 2018
[20]

International Conference on Machine Learning , year=

Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

work page
[21]

2018 , booktitle=

A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

work page 2018
[22]

Theory of Probability and its Applications , year=

On estimating regression , author=. Theory of Probability and its Applications , year=

work page
[23]

Smooth regression analysis , author=. Sankhy

work page
[24]

Learning with Kernels , author=

work page
[25]

Gaussian Processes for Machine Learning , author=

work page
[26]

Transformers learn to implement preconditioned gradient descent for in-context learning , year =

Ahn, Kwangjun and Cheng, Xiang and Daneshmand, Hadi and Sra, Suvrit , journal =. Transformers learn to implement preconditioned gradient descent for in-context learning , year =

work page
[27]

Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and Chauhan, Geeta and Chourdia, Anjali and Constable, Will and Desmaison, Alban and DeVito, Zachary and Ellison, Elias and Feng, Will and Gong, Jiong and Gschwind, Michael and ...

work page
[28]

International conference on machine learning , title =

Azar, Mohammad Gheshlaghi and Osband, Ian and Munos, R. International conference on machine learning , title =

work page
[29]

Proceedings of the International Conference on Machine Learning , year=

Human-timescale adaptation in an open-ended task space , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[30]

A survey of meta-reinforcement learning , year =

Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon , journal =. A survey of meta-reinforcement learning , year =

work page
[31]

, booktitle =

Boyan, Justin A. , booktitle =. Least-Squares Temporal Difference Learning , year =

work page
[32]

Proceedings of the International Conference on Learning Representations , year=

Randomized ensembled double q-learning: Learning fast without a model , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[33]

Contextual bandits with linear payoff functions , year =

Chu, Wei and Li, Lihong and Reyzin, Lev and Schapire, Robert , booktitle =. Contextual bandits with linear payoff functions , year =

work page
[34]

2024 , booktitle =

In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

work page 2024
[35]

Duan, Yan and Schulman, John and Chen, Xi and Bartlett, Peter L and Sutskever, Ilya and Abbeel, Pieter , journal =

work page
[36]

2024 , booktitle =

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

work page 2024
[37]

2024 , booktitle =

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

work page 2024
[38]

and Millman, K

Harris, Charles R. and Millman, K. Jarrod and van der Walt, St. Nature , title =

work page
[39]

Proceedings of the International Conference on Machine Learning , year=

In-context decision transformer: reinforcement learning via hierarchical chain-of-thought , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[40]

2024 , booktitle =

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

work page 2024
[41]

Hunter, J. D. , journal =. Matplotlib: A 2D graphics environment , year =

work page
[42]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Introducing symmetries to black box meta reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

work page
[43]

Proceedings of the International Conference on Learning Representations , year=

In-context reinforcement learning with algorithm distillation , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[44]

Proceedings of the International Conference on Learning Representations , year=

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[45]

Proceedings of the International Conference on Machine Learning , year=

Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[46]

Advances in Neural Information Processing Systems , year=

Structured state space models for in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

work page
[47]

and Veness, Joel and Bellemare, Marc G

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin A. and Fidjeland, Andreas and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

work page
[48]

Proceedings of the International Conference on Machine Learning , title =

Mnih, Volodymyr and Badia, Adri. Proceedings of the International Conference on Machine Learning , title =

work page
[49]

ArXiv Preprint , year=

Safe in-context reinforcement learning , author=. ArXiv Preprint , year=

work page
[50]

2025 , journal =

A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

work page 2025
[51]

and Zhang, Kaiqing , booktitle=

Park, Chanwoo and Liu, Xiangyu and Ozdaglar, Asuman E. and Zhang, Kaiqing , booktitle=. Do

work page
[52]

Proceedings of the International Conference on Machine Learning , year=

Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[53]

Markov decision processes: discrete stochastic dynamic programming , year =

Puterman, Martin L , publisher =. Markov decision processes: discrete stochastic dynamic programming , year =

work page
[54]

A tutorial on thompson sampling , year =

Russo, Daniel J and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , journal =. A tutorial on thompson sampling , year =

work page
[55]

and Moritz, Philipp , booktitle =

Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael I. and Moritz, Philipp , booktitle =. Trust Region Policy Optimization , year =

work page
[56]

Proximal Policy Optimization Algorithms , year =

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal =. Proximal Policy Optimization Algorithms , year =

work page
[57]

Machine Learning , year=

A primal-dual perspective of online learning algorithms , author=. Machine Learning , year=

work page
[58]

Advances in Neural Information Processing Systems , year=

Cross-episodic curriculum for transformer agents , author=. Advances in Neural Information Processing Systems , year=

work page
[59]

, journal =

Sutton, Richard S. , journal =. Learning to Predict by the Methods of Temporal Differences , year =

work page
[60]

Reinforcement Learning: An Introduction (2nd Edition) , year =

Sutton, Richard S and Barto, Andrew G , publisher =. Reinforcement Learning: An Introduction (2nd Edition) , year =

work page
[61]

and Maei, Hamid R

Sutton, Richard S. and Maei, Hamid R. and Szepesv. Advances in Neural Information Processing Systems , title =

work page
[62]

and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv

Sutton, Richard S. and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv. Proceedings of the International Conference on Machine Learning , title =

work page
[63]

Tarasov, Denis and Nikulin, Alexander and Zisman, Ilya and Klepach, Albina and Polubarov, Andrei and Nikita, Lyubaykin and Derevyagin, Alexander and Kiselev, Igor and Kurenkov, Vladislav , booktitle=. Yes,

work page
[64]

Attention is All you Need , year =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

work page
[65]

Learning to reinforcement learn , year =

Wang, Jane X and Kurth-Nelson, Zeb and Tirumala, Dhruva and Soyer, Hubert and Leibo, Joel Z and Munos, Remi and Blundell, Charles and Kumaran, Dharshan and Botvinick, Matt , journal =. Learning to reinforcement learn , year =

work page
[66]

Proceedings of the International Conference on Learning Representations , year=

Transformers can learn temporal difference methods for in-context reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[67]

Proceedings of the International Conference on Machine Learning , year =

Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

work page
[68]

Journal of Machine Learning Research , year=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , year=

work page
[69]

Proceedings of the International Conference on Machine Learning , year=

Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[70]

Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

work page
[71]

Proceedings of the International Conference on Machine Learning , year=

Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

work page
[72]

NeurIPS Foundation Models for Decision Making Workshop , year=

Towards General-Purpose In-Context Learning Agents , author=. NeurIPS Foundation Models for Decision Making Workshop , year=

work page
[73]

2022 , booktitle=

Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

work page 2022
[74]

2022 , booktitle=

Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

work page 2022
[75]

2022 , booktitle=

RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

work page 2022
[76]

Transactions on Machine Learning Research , year=

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

work page
[77]

Proceedings of the International Conference on Machine Learning , year=

Generalization to New Sequential Decision Making Tasks with In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year=

work page
[78]

ArXiv preprint , year=

Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

work page
[79]

Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

work page
[80]

Proceedings of the Conference on Robot Learning , year=

LocoFormer: Generalist Locomotion via Long-context Adaptation , author=. Proceedings of the Conference on Robot Learning , year=

work page

Showing first 80 references.

[1] [1]

Proceedings of the International Conference on Machine Learning , year=

In-Context Deep Learning via Transformer Models , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[2] [2]

Proceedings of the International Conference on Machine Learning , year=

Universal Approximation with Softmax Attention , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[3] [3]

Proceedings of the International Conference on Learning Representations , year=

In-Context Algorithm Emulation in Fixed-Weight Transformers , author=. Proceedings of the International Conference on Learning Representations , year=

work page

[4] [4]

2026 , booktitle =

Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

work page 2026

[5] [5]

2025 , booktitle =

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

work page 2025

[6] [6]

ArXiv Preprint , year =

Zixuan Xie and Xinyu Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

work page

[7] [7]

and Rheinboldt, Werner C

Ortega, James M. and Rheinboldt, Werner C. , title =. 2000 , note =

work page 2000

[8] [8]

Journal of Machine Learning Research , year =

Jianqing Fan and Bai Jiang and Qiang Sun , title =. Journal of Machine Learning Research , year =

work page

[9] [9]

2018 , publisher=

High-Dimensional Probability: An Introduction with Applications in Data Science , author=. 2018 , publisher=

work page 2018

[10] [10]

2024 , journal =

Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

work page 2024

[11] [11]

A Survey and Some Open Questions , author =

Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

work page

[12] [12]

2025 , journal =

Softmax Linear: Transformers May Learn to Classify In-Context by Kernel Gradient Descent , author =. 2025 , journal =

work page 2025

[13] [13]

Advances in Neural Information Processing Systems , year =

Towards Understanding How Transformers Learn In-Context Through a Representation Learning Lens , author =. Advances in Neural Information Processing Systems , year =

work page

[14] [14]

and Cao, Yuan and Narasimhan, Karthik , title =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

work page 2023

[15] [15]

2023 , booktitle=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

work page 2023

[16] [16]

2024 , booktitle=

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

work page 2024

[17] [17]

2022 , booktitle=

Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

work page 2022

[18] [18]

2020 , booktitle=

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

work page 2020

[19] [19]

2018 , journal=

Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

work page 2018

[20] [20]

International Conference on Machine Learning , year=

Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

work page

[21] [21]

2018 , booktitle=

A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

work page 2018

[22] [22]

Theory of Probability and its Applications , year=

On estimating regression , author=. Theory of Probability and its Applications , year=

work page

[23] [23]

Smooth regression analysis , author=. Sankhy

work page

[24] [24]

Learning with Kernels , author=

work page

[25] [25]

Gaussian Processes for Machine Learning , author=

work page

[26] [26]

Transformers learn to implement preconditioned gradient descent for in-context learning , year =

Ahn, Kwangjun and Cheng, Xiang and Daneshmand, Hadi and Sra, Suvrit , journal =. Transformers learn to implement preconditioned gradient descent for in-context learning , year =

work page

[27] [27]

Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and Chauhan, Geeta and Chourdia, Anjali and Constable, Will and Desmaison, Alban and DeVito, Zachary and Ellison, Elias and Feng, Will and Gong, Jiong and Gschwind, Michael and ...

work page

[28] [28]

International conference on machine learning , title =

Azar, Mohammad Gheshlaghi and Osband, Ian and Munos, R. International conference on machine learning , title =

work page

[29] [29]

Proceedings of the International Conference on Machine Learning , year=

Human-timescale adaptation in an open-ended task space , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[30] [30]

A survey of meta-reinforcement learning , year =

Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon , journal =. A survey of meta-reinforcement learning , year =

work page

[31] [31]

, booktitle =

Boyan, Justin A. , booktitle =. Least-Squares Temporal Difference Learning , year =

work page

[32] [32]

Proceedings of the International Conference on Learning Representations , year=

Randomized ensembled double q-learning: Learning fast without a model , author=. Proceedings of the International Conference on Learning Representations , year=

work page

[33] [33]

Contextual bandits with linear payoff functions , year =

Chu, Wei and Li, Lihong and Reyzin, Lev and Schapire, Robert , booktitle =. Contextual bandits with linear payoff functions , year =

work page

[34] [34]

2024 , booktitle =

In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

work page 2024

[35] [35]

Duan, Yan and Schulman, John and Chen, Xi and Bartlett, Peter L and Sutskever, Ilya and Abbeel, Pieter , journal =

work page

[36] [36]

2024 , booktitle =

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

work page 2024

[37] [37]

2024 , booktitle =

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

work page 2024

[38] [38]

and Millman, K

Harris, Charles R. and Millman, K. Jarrod and van der Walt, St. Nature , title =

work page

[39] [39]

Proceedings of the International Conference on Machine Learning , year=

In-context decision transformer: reinforcement learning via hierarchical chain-of-thought , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[40] [40]

2024 , booktitle =

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

work page 2024

[41] [41]

Hunter, J. D. , journal =. Matplotlib: A 2D graphics environment , year =

work page

[42] [42]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Introducing symmetries to black box meta reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

work page

[43] [43]

Proceedings of the International Conference on Learning Representations , year=

In-context reinforcement learning with algorithm distillation , author=. Proceedings of the International Conference on Learning Representations , year=

work page

[44] [44]

Proceedings of the International Conference on Learning Representations , year=

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining , author=. Proceedings of the International Conference on Learning Representations , year=

work page

[45] [45]

Proceedings of the International Conference on Machine Learning , year=

Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[46] [46]

Advances in Neural Information Processing Systems , year=

Structured state space models for in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

work page

[47] [47]

and Veness, Joel and Bellemare, Marc G

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin A. and Fidjeland, Andreas and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

work page

[48] [48]

Proceedings of the International Conference on Machine Learning , title =

Mnih, Volodymyr and Badia, Adri. Proceedings of the International Conference on Machine Learning , title =

work page

[49] [49]

ArXiv Preprint , year=

Safe in-context reinforcement learning , author=. ArXiv Preprint , year=

work page

[50] [50]

2025 , journal =

A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

work page 2025

[51] [51]

and Zhang, Kaiqing , booktitle=

Park, Chanwoo and Liu, Xiangyu and Ozdaglar, Asuman E. and Zhang, Kaiqing , booktitle=. Do

work page

[52] [52]

Proceedings of the International Conference on Machine Learning , year=

Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[53] [53]

Markov decision processes: discrete stochastic dynamic programming , year =

Puterman, Martin L , publisher =. Markov decision processes: discrete stochastic dynamic programming , year =

work page

[54] [54]

A tutorial on thompson sampling , year =

Russo, Daniel J and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , journal =. A tutorial on thompson sampling , year =

work page

[55] [55]

and Moritz, Philipp , booktitle =

Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael I. and Moritz, Philipp , booktitle =. Trust Region Policy Optimization , year =

work page

[56] [56]

Proximal Policy Optimization Algorithms , year =

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal =. Proximal Policy Optimization Algorithms , year =

work page

[57] [57]

Machine Learning , year=

A primal-dual perspective of online learning algorithms , author=. Machine Learning , year=

work page

[58] [58]

Advances in Neural Information Processing Systems , year=

Cross-episodic curriculum for transformer agents , author=. Advances in Neural Information Processing Systems , year=

work page

[59] [59]

, journal =

Sutton, Richard S. , journal =. Learning to Predict by the Methods of Temporal Differences , year =

work page

[60] [60]

Reinforcement Learning: An Introduction (2nd Edition) , year =

Sutton, Richard S and Barto, Andrew G , publisher =. Reinforcement Learning: An Introduction (2nd Edition) , year =

work page

[61] [61]

and Maei, Hamid R

Sutton, Richard S. and Maei, Hamid R. and Szepesv. Advances in Neural Information Processing Systems , title =

work page

[62] [62]

and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv

Sutton, Richard S. and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv. Proceedings of the International Conference on Machine Learning , title =

work page

[63] [63]

Tarasov, Denis and Nikulin, Alexander and Zisman, Ilya and Klepach, Albina and Polubarov, Andrei and Nikita, Lyubaykin and Derevyagin, Alexander and Kiselev, Igor and Kurenkov, Vladislav , booktitle=. Yes,

work page

[64] [64]

Attention is All you Need , year =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

work page

[65] [65]

Learning to reinforcement learn , year =

Wang, Jane X and Kurth-Nelson, Zeb and Tirumala, Dhruva and Soyer, Hubert and Leibo, Joel Z and Munos, Remi and Blundell, Charles and Kumaran, Dharshan and Botvinick, Matt , journal =. Learning to reinforcement learn , year =

work page

[66] [66]

Proceedings of the International Conference on Learning Representations , year=

Transformers can learn temporal difference methods for in-context reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , year=

work page

[67] [67]

Proceedings of the International Conference on Machine Learning , year =

Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

work page

[68] [68]

Journal of Machine Learning Research , year=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , year=

work page

[69] [69]

Proceedings of the International Conference on Machine Learning , year=

Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[70] [70]

Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

work page

[71] [71]

Proceedings of the International Conference on Machine Learning , year=

Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

work page

[72] [72]

NeurIPS Foundation Models for Decision Making Workshop , year=

Towards General-Purpose In-Context Learning Agents , author=. NeurIPS Foundation Models for Decision Making Workshop , year=

work page

[73] [73]

2022 , booktitle=

Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

work page 2022

[74] [74]

2022 , booktitle=

Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

work page 2022

[75] [75]

2022 , booktitle=

RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

work page 2022

[76] [76]

Transactions on Machine Learning Research , year=

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

work page

[77] [77]

Proceedings of the International Conference on Machine Learning , year=

Generalization to New Sequential Decision Making Tasks with In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year=

work page

[78] [78]

ArXiv preprint , year=

Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

work page

[79] [79]

Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

work page

[80] [80]

Proceedings of the Conference on Robot Learning , year=

LocoFormer: Generalist Locomotion via Long-context Adaptation , author=. Proceedings of the Conference on Robot Learning , year=

work page