Softmax Transformers implement in-context RL through equivalence to weighted softmax TD updates, with error decay under contraction and parameters as global minimizers of pretraining loss.
arXiv preprint arXiv:2307.11046 , title =
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
With specific linear Transformer parameters, CoT generation equals iterative TD updates, yielding geometric error decay with CoT length until a context-length statistical floor, and those parameters globally minimize the pretraining loss.
Prospective Learning with Control proves ERM asymptotically achieves the Bayes optimal policy in non-stationary reset-free settings and outperforms time-aware RL on a 1D foraging benchmark.
LIFE is a proposed agentic framework that combines four components to enable incremental, flexible, and energy-efficient continual learning for HPC operations such as latency spike mitigation.
citing papers explorer
-
Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
Softmax Transformers implement in-context RL through equivalence to weighted softmax TD updates, with error decay under contraction and parameters as global minimizers of pretraining loss.
-
Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought
With specific linear Transformer parameters, CoT generation equals iterative TD updates, yielding geometric error decay with CoT length until a context-length statistical floor, and those parameters globally minimize the pretraining loss.
-
Optimal control of the future via prospective learning with control
Prospective Learning with Control proves ERM asymptotically achieves the Bayes optimal policy in non-stationary reset-free settings and outperforms time-aware RL on a 1D foraging benchmark.
-
LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems
LIFE is a proposed agentic framework that combines four components to enable incremental, flexible, and energy-efficient continual learning for HPC operations such as latency spike mitigation.