pith. sign in

arxiv: 2605.07333 · v2 · pith:QVL7Y3HRnew · submitted 2026-05-08 · 💻 cs.LG

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

Pith reviewed 2026-05-20 23:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords in-context reinforcement learningsoftmax attentiontemporal difference learningtransformerspolicy evaluationkernel spacepretraining loss
0
0 comments X

The pith

Softmax Transformers implement in-context reinforcement learning by matching iterative weighted softmax TD updates across layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that standard softmax attention in Transformers can perform in-context policy evaluation for reinforcement learning tasks. By choosing specific parameters, the computation in each Transformer layer corresponds exactly to an update step in a new algorithm called weighted softmax temporal difference learning, which operates in kernel space and includes linear and tabular TD as special cases. Under a contraction condition, the error in estimating the policy value decreases as more layers are added. These same parameters turn out to be the global minimizer of the pretraining loss function, which explains their appearance in experiments. This provides a theoretical basis for why pretrained Transformers can adapt to new RL problems just by conditioning on context data.

Core claim

With certain parameters, the layerwise forward pass of a Transformer with softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Under a certain contraction condition, the policy evaluation error decays as the number of layers grows with these parameters. Those parameters are a global minimizer of a pretraining loss.

What carries the argument

The equivalence of softmax attention to iterative updates in the weighted softmax TD learning algorithm for policy evaluation in kernel space.

If this is right

  • Transformers can adapt to new tasks in-context by performing policy evaluation through their layered computations.
  • The policy evaluation error decreases with increasing model depth under the contraction condition.
  • Weighted softmax TD learning generalizes both linear TD and tabular TD methods.
  • Pretraining naturally leads to parameters that enable this in-context RL behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large language models might implicitly solve RL problems when prompted with task examples.
  • Architectural choices in attention could be designed to implement other RL algorithms like Q-learning.
  • Empirical tests could verify the error decay in deep Transformer models on simple MDPs.
  • This equivalence might extend to other sequence models beyond standard Transformers.

Load-bearing premise

There exist specific parameter settings that make the Transformer's forward pass precisely replicate the updates of the weighted softmax TD algorithm, and a contraction condition holds to ensure error reduction with depth.

What would settle it

Run a simple MDP where the Transformer's outputs after each layer are compared directly to the value function estimates from running the weighted softmax TD algorithm; mismatch would disprove the equivalence.

Figures

Figures reproduced from arXiv: 2605.07333 by Claire Chen, Rohan Chandra, Shangtong Zhang, Shuze Daniel Liu, Xinyu Liu, Zixuan Xie.

Figure 1
Figure 1. Figure 1: In-context policy evaluation with a 15-layer dual-head Transformer using softmax attention. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Original vs. shifted memory rows. 5 Inference-Time Convergence We now address (Q2) by establishing the convergence of softmax ICTD to the true value function vπ as the number of layers L → ∞ and the context length n → ∞. Throughout the rest of the paper, we assume the trajectory τn = (S0, R1, . . . , Sn) visits every state, i.e., {S0, S1, . . . , Sn−1} = S. To analyze the recursion (17) on S, for a given c… view at source ↗
Figure 3
Figure 3. Figure 3: Emergence of the learned TD block. (a) Learned [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Emergence vs. training steps under the default mask and [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Emergence vs. training steps under the default mask and [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mask relaxation on Boyan’s chain. We compare the full-mask setting in [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mask relaxation on Boyan’s chain. We compare the full-mask setting in [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernel weighted TD verification. Layer-wise log discrepancy [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernel weighted TD verification. Layer-wise log discrepancy [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
read the original abstract

In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that with certain parameters, the layerwise forward pass of a Transformer using standard softmax attention is equivalent to iterative updates of a new weighted softmax TD learning algorithm (which performs policy evaluation in kernel space and generalizes linear and tabular TD). It further proves that under a contraction condition on the induced operator, the policy evaluation error decays with depth, and that these parameters are global minimizers of a pretraining loss, with numerical experiments cited as supporting evidence.

Significance. If the equivalence, contraction, and minimizer results hold with explicit constructions and bounds, this would be a notable advance: the first theoretical account of in-context RL that uses the practical softmax attention rather than linear simplifications. The explicit link to a kernel-space TD algorithm and the global-minimizer property would help explain why RL-like adaptation emerges in pretrained Transformers.

major comments (2)
  1. [Abstract] Abstract: the error-decay claim invokes a contraction condition (spectral radius <1) for the weighted softmax TD operator in kernel space, yet this condition is stated only abstractly in terms of the kernel and transition kernel. No explicit bound or verification is supplied showing that the condition holds for the same parameters that also minimize the pretraining loss; this is load-bearing for the decay result with depth.
  2. [Abstract] Abstract: the equivalence between the softmax attention forward pass and one step of weighted softmax TD is asserted for 'certain parameters,' but the manuscript must explicitly construct or derive these parameters to demonstrate they simultaneously achieve the per-layer match, satisfy the contraction, and are global minimizers; without this, the central claim remains unverified.
minor comments (2)
  1. The abstract mentions numerical experiments but provides no details on the MDPs, kernels, or hyper-parameters used; adding a brief experimental section or table would strengthen the supporting evidence.
  2. Clarify the precise definition of the pretraining loss whose global minimizer is claimed; this would remove any appearance of circularity in the parameter selection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of establishing a theoretical link between standard softmax attention and in-context RL. We address the two major comments point by point below, clarifying the content of the full manuscript while indicating revisions that will strengthen the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the error-decay claim invokes a contraction condition (spectral radius <1) for the weighted softmax TD operator in kernel space, yet this condition is stated only abstractly in terms of the kernel and transition kernel. No explicit bound or verification is supplied showing that the condition holds for the same parameters that also minimize the pretraining loss; this is load-bearing for the decay result with depth.

    Authors: The full manuscript explicitly constructs the relevant parameters in Section 3 (Theorem 3.1) to realize the per-layer equivalence with one step of weighted softmax TD. Section 4 then proves that these same parameters induce a contraction on the policy-evaluation error whose spectral radius is bounded above by a quantity strictly less than 1 (explicitly depending on the discount factor and the minimal eigenvalue of the kernel Gram matrix). Section 5 separately establishes that the identical parameter values are global minimizers of the pretraining loss. While the abstract summarizes these results at a high level, we agree that the linkage between the contraction and the minimizing parameters could be stated more directly. We will revise the abstract to note that the contraction holds for the explicitly constructed parameters that also minimize the loss. revision: yes

  2. Referee: [Abstract] Abstract: the equivalence between the softmax attention forward pass and one step of weighted softmax TD is asserted for 'certain parameters,' but the manuscript must explicitly construct or derive these parameters to demonstrate they simultaneously achieve the per-layer match, satisfy the contraction, and are global minimizers; without this, the central claim remains unverified.

    Authors: The manuscript already supplies the explicit construction and simultaneous verification in the main body: Theorem 3.1 derives the precise query, key, and value matrices realizing the equivalence; the contraction proof in Section 4 applies directly to those matrices; and the global-minimizer result in Section 5 uses the same matrices. The abstract employs the phrase 'certain parameters' purely as a concise summary. To address the referee's concern, we will update the abstract to replace 'certain parameters' with language that references the explicit construction and indicates that the same parameters satisfy the equivalence, contraction, and minimization properties simultaneously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; equivalence, contraction, and minimizer results are independent derivations

full rationale

The paper derives an equivalence between the layerwise softmax Transformer forward pass and iterative weighted softmax TD updates for specific parameters, proves policy evaluation error decay under a stated contraction condition on the induced operator, and separately proves that the identified parameters globally minimize a pretraining loss. These steps are presented as mathematical results rather than reductions to self-definitions or fitted quantities renamed as predictions. The contraction condition is invoked as an assumption to guarantee decay but does not make the equivalence or minimizer claims circular by construction. Numerical experiments are cited only to illustrate emergence of the parameters, not to define them. The derivation chain remains self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatzes smuggled from prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claim depends on unspecified parameters, a contraction condition, and a newly introduced algorithm whose details are absent from the abstract.

free parameters (1)
  • certain parameters
    Parameters that make the layerwise Transformer forward pass equivalent to weighted softmax TD updates and that globally minimize the pretraining loss.
axioms (1)
  • domain assumption contraction condition
    Condition under which policy evaluation error decays as the number of layers increases.
invented entities (1)
  • weighted softmax TD learning algorithm no independent evidence
    purpose: RL algorithm performing policy evaluation in kernel space that generalizes linear TD and tabular TD.
    New algorithm introduced to characterize the behavior of softmax attention in the ICRL setting.

pith-pipeline@v0.9.0 · 5725 in / 1320 out tokens · 41138 ms · 2026-05-20T23:21:50.884774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages

  1. [1]

    Proceedings of the International Conference on Machine Learning , year=

    In-Context Deep Learning via Transformer Models , author=. Proceedings of the International Conference on Machine Learning , year=

  2. [2]

    Proceedings of the International Conference on Machine Learning , year=

    Universal Approximation with Softmax Attention , author=. Proceedings of the International Conference on Machine Learning , year=

  3. [3]

    Proceedings of the International Conference on Learning Representations , year=

    In-Context Algorithm Emulation in Fixed-Weight Transformers , author=. Proceedings of the International Conference on Learning Representations , year=

  4. [4]

    2026 , booktitle =

    Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

  5. [5]

    2025 , booktitle =

    Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

  6. [6]

    ArXiv Preprint , year =

    Zixuan Xie and Xinyu Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

  7. [7]

    and Rheinboldt, Werner C

    Ortega, James M. and Rheinboldt, Werner C. , title =. 2000 , note =

  8. [8]

    Journal of Machine Learning Research , year =

    Jianqing Fan and Bai Jiang and Qiang Sun , title =. Journal of Machine Learning Research , year =

  9. [9]

    2018 , publisher=

    High-Dimensional Probability: An Introduction with Applications in Data Science , author=. 2018 , publisher=

  10. [10]

    2024 , journal =

    Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

  11. [11]

    A Survey and Some Open Questions , author =

    Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

  12. [12]

    2025 , journal =

    Softmax Linear: Transformers May Learn to Classify In-Context by Kernel Gradient Descent , author =. 2025 , journal =

  13. [13]

    Advances in Neural Information Processing Systems , year =

    Towards Understanding How Transformers Learn In-Context Through a Representation Learning Lens , author =. Advances in Neural Information Processing Systems , year =

  14. [14]

    and Cao, Yuan and Narasimhan, Karthik , title =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

  15. [15]

    2023 , booktitle=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

  16. [16]

    2024 , booktitle=

    Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

  17. [17]

    2022 , booktitle=

    Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

  18. [18]

    2020 , booktitle=

    VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

  19. [19]

    2018 , journal=

    Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

  20. [20]

    International Conference on Machine Learning , year=

    Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

  21. [21]

    2018 , booktitle=

    A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

  22. [22]

    Theory of Probability and its Applications , year=

    On estimating regression , author=. Theory of Probability and its Applications , year=

  23. [23]

    Smooth regression analysis , author=. Sankhy

  24. [24]

    Learning with Kernels , author=

  25. [25]

    Gaussian Processes for Machine Learning , author=

  26. [26]

    Transformers learn to implement preconditioned gradient descent for in-context learning , year =

    Ahn, Kwangjun and Cheng, Xiang and Daneshmand, Hadi and Sra, Suvrit , journal =. Transformers learn to implement preconditioned gradient descent for in-context learning , year =

  27. [27]

    Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and Chauhan, Geeta and Chourdia, Anjali and Constable, Will and Desmaison, Alban and DeVito, Zachary and Ellison, Elias and Feng, Will and Gong, Jiong and Gschwind, Michael and ...

  28. [28]

    International conference on machine learning , title =

    Azar, Mohammad Gheshlaghi and Osband, Ian and Munos, R. International conference on machine learning , title =

  29. [29]

    Proceedings of the International Conference on Machine Learning , year=

    Human-timescale adaptation in an open-ended task space , author=. Proceedings of the International Conference on Machine Learning , year=

  30. [30]

    A survey of meta-reinforcement learning , year =

    Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon , journal =. A survey of meta-reinforcement learning , year =

  31. [31]

    , booktitle =

    Boyan, Justin A. , booktitle =. Least-Squares Temporal Difference Learning , year =

  32. [32]

    Proceedings of the International Conference on Learning Representations , year=

    Randomized ensembled double q-learning: Learning fast without a model , author=. Proceedings of the International Conference on Learning Representations , year=

  33. [33]

    Contextual bandits with linear payoff functions , year =

    Chu, Wei and Li, Lihong and Reyzin, Lev and Schapire, Robert , booktitle =. Contextual bandits with linear payoff functions , year =

  34. [34]

    2024 , booktitle =

    In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

  35. [35]

    Duan, Yan and Schulman, John and Chen, Xi and Bartlett, Peter L and Sutskever, Ilya and Abbeel, Pieter , journal =

  36. [36]

    2024 , booktitle =

    AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

  37. [37]

    2024 , booktitle =

    AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

  38. [38]

    and Millman, K

    Harris, Charles R. and Millman, K. Jarrod and van der Walt, St. Nature , title =

  39. [39]

    Proceedings of the International Conference on Machine Learning , year=

    In-context decision transformer: reinforcement learning via hierarchical chain-of-thought , author=. Proceedings of the International Conference on Machine Learning , year=

  40. [40]

    2024 , booktitle =

    Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

  41. [41]

    Hunter, J. D. , journal =. Matplotlib: A 2D graphics environment , year =

  42. [42]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    Introducing symmetries to black box meta reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  43. [43]

    Proceedings of the International Conference on Learning Representations , year=

    In-context reinforcement learning with algorithm distillation , author=. Proceedings of the International Conference on Learning Representations , year=

  44. [44]

    Proceedings of the International Conference on Learning Representations , year=

    Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining , author=. Proceedings of the International Conference on Learning Representations , year=

  45. [45]

    Proceedings of the International Conference on Machine Learning , year=

    Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

  46. [46]

    Advances in Neural Information Processing Systems , year=

    Structured state space models for in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

  47. [47]

    and Veness, Joel and Bellemare, Marc G

    Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin A. and Fidjeland, Andreas and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

  48. [48]

    Proceedings of the International Conference on Machine Learning , title =

    Mnih, Volodymyr and Badia, Adri. Proceedings of the International Conference on Machine Learning , title =

  49. [49]

    ArXiv Preprint , year=

    Safe in-context reinforcement learning , author=. ArXiv Preprint , year=

  50. [50]

    2025 , journal =

    A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

  51. [51]

    and Zhang, Kaiqing , booktitle=

    Park, Chanwoo and Liu, Xiangyu and Ozdaglar, Asuman E. and Zhang, Kaiqing , booktitle=. Do

  52. [52]

    Proceedings of the International Conference on Machine Learning , year=

    Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

  53. [53]

    Markov decision processes: discrete stochastic dynamic programming , year =

    Puterman, Martin L , publisher =. Markov decision processes: discrete stochastic dynamic programming , year =

  54. [54]

    A tutorial on thompson sampling , year =

    Russo, Daniel J and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , journal =. A tutorial on thompson sampling , year =

  55. [55]

    and Moritz, Philipp , booktitle =

    Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael I. and Moritz, Philipp , booktitle =. Trust Region Policy Optimization , year =

  56. [56]

    Proximal Policy Optimization Algorithms , year =

    Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal =. Proximal Policy Optimization Algorithms , year =

  57. [57]

    Machine Learning , year=

    A primal-dual perspective of online learning algorithms , author=. Machine Learning , year=

  58. [58]

    Advances in Neural Information Processing Systems , year=

    Cross-episodic curriculum for transformer agents , author=. Advances in Neural Information Processing Systems , year=

  59. [59]

    , journal =

    Sutton, Richard S. , journal =. Learning to Predict by the Methods of Temporal Differences , year =

  60. [60]

    Reinforcement Learning: An Introduction (2nd Edition) , year =

    Sutton, Richard S and Barto, Andrew G , publisher =. Reinforcement Learning: An Introduction (2nd Edition) , year =

  61. [61]

    and Maei, Hamid R

    Sutton, Richard S. and Maei, Hamid R. and Szepesv. Advances in Neural Information Processing Systems , title =

  62. [62]

    and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv

    Sutton, Richard S. and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv. Proceedings of the International Conference on Machine Learning , title =

  63. [63]

    Tarasov, Denis and Nikulin, Alexander and Zisman, Ilya and Klepach, Albina and Polubarov, Andrei and Nikita, Lyubaykin and Derevyagin, Alexander and Kiselev, Igor and Kurenkov, Vladislav , booktitle=. Yes,

  64. [64]

    Attention is All you Need , year =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

  65. [65]

    Learning to reinforcement learn , year =

    Wang, Jane X and Kurth-Nelson, Zeb and Tirumala, Dhruva and Soyer, Hubert and Leibo, Joel Z and Munos, Remi and Blundell, Charles and Kumaran, Dharshan and Botvinick, Matt , journal =. Learning to reinforcement learn , year =

  66. [66]

    Proceedings of the International Conference on Learning Representations , year=

    Transformers can learn temporal difference methods for in-context reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , year=

  67. [67]

    Proceedings of the International Conference on Machine Learning , year =

    Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

  68. [68]

    Journal of Machine Learning Research , year=

    Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , year=

  69. [69]

    Proceedings of the International Conference on Machine Learning , year=

    Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

  70. [70]

    Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

  71. [71]

    Proceedings of the International Conference on Machine Learning , year=

    Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

  72. [72]

    NeurIPS Foundation Models for Decision Making Workshop , year=

    Towards General-Purpose In-Context Learning Agents , author=. NeurIPS Foundation Models for Decision Making Workshop , year=

  73. [73]

    2022 , booktitle=

    Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

  74. [74]

    2022 , booktitle=

    Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

  75. [75]

    2022 , booktitle=

    RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

  76. [76]

    Transactions on Machine Learning Research , year=

    Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

  77. [77]

    Proceedings of the International Conference on Machine Learning , year=

    Generalization to New Sequential Decision Making Tasks with In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year=

  78. [78]

    ArXiv preprint , year=

    Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

  79. [79]

    Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

  80. [80]

    Proceedings of the Conference on Robot Learning , year=

    LocoFormer: Generalist Locomotion via Long-context Adaptation , author=. Proceedings of the Conference on Robot Learning , year=

Showing first 80 references.