pith. sign in

arxiv: 2605.05755 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.AI· cs.LG

Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

Pith reviewed 2026-05-08 05:33 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords in-context reinforcement learningtransformerspolicy improvementsemi-gradient SARSAactor-criticself-attentiongradient flow convergence
0
0 comments X

The pith

A linear self-attention transformer block can execute policy-improvement steps from reinforcement learning algorithms when supplied with trajectory data in context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transformers are capable of performing in-context reinforcement learning by implementing policy improvement directly from input examples rather than through weight updates at deployment time. It provides explicit constructions showing how a linear self-attention block can replicate the updates used in semi-gradient SARSA and actor-critic methods. A sympathetic reader would care because this supplies a concrete mechanism by which models can improve their behavior on new tasks simply by processing past trajectories. The work further demonstrates that a teacher-mimicking training procedure causes gradient flow to converge to the parameters realizing these updates, provided the training distribution of MDPs meets richness conditions.

Core claim

A linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, the paper analyzes a teacher-mimicking training procedure and establishes that, under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, models trained on randomly generated tabular MDPs recover the predicted parameter structure and achieve strong control performance on unseen MDPs.

What carries the argument

linear self-attention transformer block, which the paper shows can be parameterized to compute the policy-improvement update by combining action-value estimates drawn from the input context.

If this is right

  • The transformer applies semi-gradient SARSA updates to new trajectories presented in its context without any parameter changes.
  • The same block can implement actor-critic style policy improvements when the appropriate parameters are used.
  • Gradient flow during teacher-mimicking training reaches the parameter manifold that realizes these RL updates.
  • Once trained, the model delivers strong in-context control performance on MDPs never seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction may extend to show how transformers can internalize variants of Q-learning or other value-based methods.
  • The convergence guarantee suggests that dataset design for in-context RL should prioritize diversity across MDPs rather than depth within a single environment.
  • Observed strong few-shot behavior in large language models on sequential decision tasks could partly arise from similar internalized policy-improvement mechanisms.

Load-bearing premise

The distribution of MDPs used during training must be rich enough for gradient flow to converge locally and exponentially to the target parameter manifold.

What would settle it

Train the transformer on a narrow distribution of similar MDPs and observe whether the resulting weights fail to match the explicit constructions or produce weak performance on new MDPs.

Figures

Figures reproduced from arXiv: 2605.05755 by Haodong Liang, Lifeng Lai.

Figure 1
Figure 1. Figure 1: A simplified pipeline of transformer’s ICRL implementation. The agent samples a trajectory view at source ↗
Figure 2
Figure 2. Figure 2: Final learned P and V matrices after training, for the SARSA transformer (top row) and the Actor-Critic transformer (bottom row). In both settings, the non-zero structure matches the theoretical constructions (P⋆ , V⋆ ) in (5) and (14). entries (gain 0.1). Training is performed over K = 10,000 independently sampled MDPs using Adam optimizer [Kingma and Ba, 2015] with initial learning rate η = 10−3 and expo… view at source ↗
Figure 3
Figure 3. Figure 3: Closed-loop control performance on 100 held-out random MDPs versus in-context update step t. Curves show the per-MDP return averaged across MDPs (shaded: 25–75% interquartile band). The transformer curve closely tracks the teacher curve in both settings, indicating that the trained model approximately implements the analytical update rule at deployment. We compare against analytical teacher updates (SARSA … view at source ↗
Figure 4
Figure 4. Figure 4: Training loss Lˆ(θ (t) ) over training iterations for the SARSA transformer (left) and the Actor-Critic transformer (right). In both settings, the loss decays approximately exponentially to near zero, consistent with the local convergence guarantee in Theorem 3.2. I.2 Experimental Detail The trainings and evaluations of the transformers used in our experiment were conducted on a Windows 11 machine with the… view at source ↗
read the original abstract

We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a linear self-attention transformer block can provably implement in-context policy-improvement algorithms (semi-gradient SARSA and actor-critic) via explicit parameter constructions. It further shows that a teacher-mimicking training loss, when optimized by gradient flow, converges locally and exponentially to the target parameter manifold under suitable richness conditions on the distribution of training MDPs. Experiments on randomly generated tabular MDPs confirm that trained models recover the constructed parameter structure and achieve strong in-context control on unseen MDPs.

Significance. If the constructions and convergence result hold, the work supplies the first explicit mechanistic account of how transformers can internalize classical RL updates in context, together with the first convergence guarantee in the ICRL literature. The explicit constructions and the gradient-flow analysis are concrete strengths that could guide both architecture design and training procedures for in-context learners.

major comments (3)
  1. [Theorem 4.3] Theorem 4.3 (gradient-flow convergence): the local exponential convergence guarantee is conditioned on 'suitable richness conditions' on the MDP distribution, yet the paper does not provide an explicit, checkable statement of these conditions nor verify that they hold for the randomly generated tabular MDPs used in §6. Without this verification the exponential rate does not necessarily apply to the reported experiments.
  2. [§5.2] §5.2 (basin of attraction): the analysis establishes local convergence to the target manifold but does not characterize the size of the basin or show that typical random initializations lie inside it. This leaves open whether the training procedure reliably reaches the desired RL update in practice.
  3. [§6.1] §6.1 (empirical validation): the experiments are restricted to tabular MDPs with no error bars, no ablation of the richness assumption, and no comparison against non-transformer baselines that could isolate the effect of the attention mechanism.
minor comments (2)
  1. [Eq. (3)] Notation for the linear self-attention block (Eq. (3)) should explicitly distinguish the query/key/value projections from the subsequent value mixing that realizes the SARSA update.
  2. [Figure 3] Figure 3 caption should state the exact number of random seeds and MDPs used to generate the reported curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our theoretical and empirical results. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theorem 4.3] Theorem 4.3 (gradient-flow convergence): the local exponential convergence guarantee is conditioned on 'suitable richness conditions' on the MDP distribution, yet the paper does not provide an explicit, checkable statement of these conditions nor verify that they hold for the randomly generated tabular MDPs used in §6. Without this verification the exponential rate does not necessarily apply to the reported experiments.

    Authors: We agree that the richness conditions should be stated explicitly to allow verification. In the revised manuscript, we will provide a precise, checkable formulation: the training distribution must ensure that the Gram matrix of the concatenated state-action-next-state features across sampled trajectories has full rank equal to the dimension of the target parameter space. We will also add a short appendix verifying that the random tabular MDP generator (with uniform sampling over states, actions, and transitions) satisfies this condition with high probability for the dimensions used in §6, thereby confirming applicability of the exponential rate to the experiments. revision: yes

  2. Referee: [§5.2] §5.2 (basin of attraction): the analysis establishes local convergence to the target manifold but does not characterize the size of the basin or show that typical random initializations lie inside it. This leaves open whether the training procedure reliably reaches the desired RL update in practice.

    Authors: The result in §5.2 is deliberately local, as is standard for gradient-flow analyses of non-convex losses; a full global characterization of the basin would require substantially stronger assumptions on the loss landscape that lie outside the paper's scope. To address practical reliability, we will expand §5.2 with a brief discussion of sufficient conditions for a large basin (e.g., when the teacher loss is strongly convex near the manifold) and will report, in the experiments, the fraction of random initializations that successfully converge to the target structure, providing empirical evidence that typical initializations lie inside the basin for the considered settings. revision: partial

  3. Referee: [§6.1] §6.1 (empirical validation): the experiments are restricted to tabular MDPs with no error bars, no ablation of the richness assumption, and no comparison against non-transformer baselines that could isolate the effect of the attention mechanism.

    Authors: We acknowledge these limitations in the current experimental section. In the revision we will (i) add error bars by repeating all runs over at least 10 random seeds, (ii) include an ablation that varies MDP richness (e.g., by restricting transition diversity) and reports the resulting convergence behavior, and (iii) add comparisons against non-transformer baselines (MLPs and LSTMs) trained with the identical teacher-mimicking objective, thereby isolating the contribution of the linear self-attention mechanism to in-context policy improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; explicit constructions and convergence analysis remain independent of self-referential inputs.

full rationale

The paper derives explicit parameter constructions in a linear self-attention block that realize semi-gradient SARSA and actor-critic updates by direct mapping from the external RL algorithms. The subsequent teacher-mimicking loss is defined against an independent teacher that already executes those same external updates, so the target manifold is fixed by the RL methods rather than by the training procedure itself. Gradient-flow analysis then proves local exponential convergence to that manifold under stated richness conditions on the MDP distribution. This chain relies on external benchmarks (classical RL algorithms) and does not reduce any central claim to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. The empirical confirmation on randomly generated MDPs further tests recovery of the externally defined structure rather than tautological self-matching.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the existence of explicit linear-attention weight matrices that realize the RL updates and on the richness assumption needed for the gradient-flow convergence theorem; no additional free parameters or invented entities are introduced beyond standard transformer and MDP formalisms.

axioms (1)
  • domain assumption Suitable richness conditions on the training MDP distribution
    Invoked to guarantee local exponential convergence of gradient flow to the target parameter manifold.

pith-pipeline@v0.9.0 · 5470 in / 1257 out tokens · 49777 ms · 2026-05-08T05:33:53.117110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    2024 , eprint=

    A Survey on In-context Learning , author=. 2024 , eprint=

  2. [2]

    Language Models are Few-Shot Learners , year =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  3. [3]

    Proceedings of Neural Information Processing Systems , year=

    What can transformers learn in-context? a case study of simple function classes , author=. Proceedings of Neural Information Processing Systems , year=

  4. [4]

    The Thirteenth International Conference on Learning Representations , year=

    Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning , author=. The Thirteenth International Conference on Learning Representations , year=

  5. [5]

    2018 , edition =

    Reinforcement Learning: An Introduction , author =. 2018 , edition =

  6. [6]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =. 2017 , month =

  7. [7]

    The Eleventh International Conference on Learning Representations , year=

    In-context Reinforcement Learning with Algorithm Distillation , author=. The Eleventh International Conference on Learning Representations , year=

  8. [8]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Supervised Pretraining Can Learn In-Context Reinforcement Learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  9. [9]

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal =

  10. [10]

    2024 , eprint=

    OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and Lama Ahmad and Ilge Akkaya and Florencia Leoni Aleman and Diogo Almeida and Janko Altenschmidt and Sam Altman and Shyamal Anadkat and others , journal =. 2024 , eprint=

  11. [11]

    Proceedings of the 40th International Conference on Machine Learning , year =

    Transformers Learn In-Context by Gradient Descent , author =. Proceedings of the 40th International Conference on Machine Learning , year =

  12. [12]

    What learning algorithm is in-context learning?

    Aky. What learning algorithm is in-context learning?. The Eleventh International Conference on Learning Representations , year =

  13. [13]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Transformers learn to implement preconditioned gradient descent for in-context learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  14. [14]

    2025 , eprint=

    A Survey of In-Context Reinforcement Learning , author=. 2025 , eprint=

  15. [15]

    and Ba, Jimmy , title =

    Kingma, Diederik P. and Ba, Jimmy , title =. International Conference on Learning Representations (ICLR) , year =

  16. [16]

    The Twelfth International Conference on Learning Representations , year =

    One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention , author =. The Twelfth International Conference on Learning Representations , year =

  17. [17]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  18. [18]

    Journal of Machine Learning Research , year =

    Trained Transformers Learn Linear Models In-Context , author =. Journal of Machine Learning Research , year =

  19. [19]

    Thirty-seventh Conference on Neural Information Processing Systems , year =

    Large Language Models can Implement Policy Iteration , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

  20. [20]

    The Thirteenth International Conference on Learning Representations , year=

    Transformers Handle Endogeneity in In-Context Linear Regression , author=. The Thirteenth International Conference on Learning Representations , year=

  21. [21]

    NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

    Learning to Solve New sequential decision-making Tasks with In-Context Learning , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

  22. [22]

    The Twelfth International Conference on Learning Representations , year =

    Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining , author =. The Twelfth International Conference on Learning Representations , year =

  23. [23]

    2023 , eprint=

    In-Context Convergence of Transformers , author=. 2023 , eprint=

  24. [24]

    2024 , eprint=

    Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality , author=. 2024 , eprint=

  25. [25]

    Thirty-seventh Conference on Neural Information Processing Systems , year =

    Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

  26. [26]

    Forty-first International Conference on Machine Learning , year=

    How Transformers Learn Causal Structure with Gradient Descent , author=. Forty-first International Conference on Machine Learning , year=

  27. [27]

    Advances in Neural Information Processing Systems , editor=

    Decision Transformer: Reinforcement Learning via Sequence Modeling , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

  28. [28]

    Offline Reinforcement Learning as One Big Sequence Modeling Problem , url =

    Janner, Michael and Li, Qiyang and Levine, Sergey , booktitle =. Offline Reinforcement Learning as One Big Sequence Modeling Problem , url =

  29. [29]

    2024 , url=

    Jake Grigsby and Linxi Fan and Yuke Zhu , booktitle=. 2024 , url=

  30. [30]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Can large language models explore in-context? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=