Decision Transformer: Reinforcement Learning via Sequence Modeling
Pith reviewed 2026-05-18 15:06 UTC · model grok-4.3
The pith
By conditioning a Transformer on a desired return along with past states and actions, Decision Transformer generates future actions that achieve the target reward in reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decision Transformer casts the problem of RL as conditional sequence modeling. Given offline trajectories, a causally masked Transformer is trained to autoregressively output actions conditioned on the desired return, past states, and past actions, thereby generating behavior that achieves the specified return without explicit value estimation or policy gradients.
What carries the argument
The Decision Transformer: a causally masked autoregressive Transformer that predicts actions to realize a specified target return when conditioned on returns-to-go, states, and actions from prior timesteps.
If this is right
- The model matches or exceeds the performance of prior model-free offline RL algorithms on Atari games, OpenAI Gym continuous control tasks, and the Key-to-Door environment.
- No online environment interaction or credit assignment is required during training; all learning occurs via standard sequence prediction on fixed datasets.
- Higher target returns can be requested at inference time to elicit stronger performance without retraining the model.
- Long-horizon tasks become approachable because the Transformer models entire future sequences toward the goal return in one forward pass.
Where Pith is reading between the lines
- If sufficiently large and diverse offline datasets become available, the same scaling trends observed in language models could appear in learned control policies.
- The conditioning mechanism could be extended to include goal images or language instructions, allowing the same architecture to handle visual or language-conditioned tasks.
- A natural test would be to measure whether the generated actions discover strategies absent from the training trajectories or merely replay high-return fragments.
- Hybrid use with limited online fine-tuning could address domains where purely offline data leaves gaps in coverage.
Load-bearing premise
The offline trajectory data already contains near-optimal behavior sequences that the model can recover simply by conditioning on a high target return value.
What would settle it
The central claim would be falsified by an experiment showing that, on an environment where offline data contains only suboptimal trajectories, conditioning on the highest possible return still yields actions whose realized cumulative reward falls far short of the target.
read the original abstract
We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Decision Transformer, which abstracts reinforcement learning as a conditional sequence modeling problem. An autoregressive Transformer is conditioned on a desired return (reward), past states, and actions to generate future actions that achieve the target return. Unlike value-function or policy-gradient methods, it leverages causal masking and standard Transformer components. Experiments show it matches or exceeds model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Significance. If the results hold, the work demonstrates that sequence-modeling advances can be directly applied to offline RL, yielding a simpler architecture without explicit credit assignment or online exploration. The approach is grounded in reproducible benchmarks and offers a parameter-light way to recover high-return behavior from trajectory data when such behavior is present.
major comments (1)
- [Experiments (Atari/Gym/Key-to-Door sections)] The central claim that conditioning on high target returns recovers superior actions presupposes that the offline dataset contains near-optimal trajectories. While the paper evaluates on standard benchmarks that include expert or mixed data, it does not report results on deliberately suboptimal datasets; this assumption is load-bearing for the generality of the method beyond the tested suites.
minor comments (2)
- [Model Architecture] Provide the precise tokenization and embedding scheme for returns, states, and actions in the input sequence (e.g., how continuous returns are discretized or normalized).
- [Experimental Results] Include statistical significance tests or multiple random seeds with error bars for the reported performance comparisons against baselines.
Simulated Author's Rebuttal
We thank the referee for their positive summary and recommendation for minor revision. We address the single major comment below by clarifying the scope of our offline RL approach and propose textual revisions to make the data-quality assumption explicit.
read point-by-point responses
-
Referee: [Experiments (Atari/Gym/Key-to-Door sections)] The central claim that conditioning on high target returns recovers superior actions presupposes that the offline dataset contains near-optimal trajectories. While the paper evaluates on standard benchmarks that include expert or mixed data, it does not report results on deliberately suboptimal datasets; this assumption is load-bearing for the generality of the method beyond the tested suites.
Authors: We agree that Decision Transformer, like other offline RL methods, relies on the presence of high-return trajectories in the dataset to achieve superior performance when conditioning on high target returns. The approach is explicitly designed to recover the best available behavior from offline data rather than to synthesize optimality from purely suboptimal trajectories. This is consistent with the standard offline RL setting and is already implicit in our evaluations on benchmarks containing expert or mixed data (e.g., Atari expert demonstrations and Gym datasets). We do not claim generality to arbitrary low-quality datasets, as the method cannot exceed the maximum return present in the data. To address the referee's concern, we will revise the introduction, method, and discussion sections to explicitly state this assumption and compare it to related offline RL algorithms such as BCQ and CQL. We believe this textual clarification sufficiently strengthens the paper without necessitating new experiments on artificially degraded datasets. revision: yes
Circularity Check
No significant circularity; model is a data-driven sequence modeling proposal tested on external benchmarks
full rationale
The paper frames RL as conditional sequence modeling and introduces Decision Transformer as an autoregressive architecture that conditions on target return, states, and actions to output future actions. This is a modeling proposal trained on offline trajectories and evaluated against standard benchmarks (Atari, Gym, Key-to-Door). No equations or claims reduce the output performance to a fitted parameter or self-citation by construction. The central claim depends on the empirical presence of high-return sequences in the data, which is an external assumption tested via benchmarks rather than a definitional tautology. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in a way that collapses the result to prior author work. The derivation chain remains self-contained against external data and evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A causally masked Transformer can model long-range dependencies in state-action-reward sequences sufficiently well to recover near-optimal behavior.
Forward citations
Cited by 18 Pith papers
-
Offline Reinforcement Learning with Implicit Q-Learning
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
Gradient Boosting within a Single Attention Layer
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
-
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion en...
-
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions
DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models
Large vision-language models applied to multi-scale remote sensing imagery can generate recommendations on built environment design, constructability, land use, and risks for smart city decision-making.
Reference graph
Works this paper leans on
-
[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, 2017
work page 2017
-
[2]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Stabilizing transformers for reinforcement learning
Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayaku- mar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2020
work page 2020
-
[5]
Deep reinforcement learning with relational inductive biases
Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations , 2018
work page 2018
-
[6]
Reinforcement learning: An introduction
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018
work page 2018
-
[7]
Optimizing agent behavior over long time scales by transporting value
Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):1–12, 2019
work page 2019
-
[8]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[9]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018
work page 2018
-
[10]
The arcade learning environment: An evaluation platform for general agents
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013
work page 2013
-
[11]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
Counterfactual credit assignment in model-free reinforcement learning
Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, et al. Counterfactual credit assignment in model-free reinforcement learning. arXiv preprint arXiv:2011.09464, 2020
-
[13]
An optimistic perspective on offline reinforcement learning
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020
work page 2020
-
[14]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[15]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990
work page 1990
-
[17]
When to trust your model: Model-based policy optimization
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems , pages 12498–12509, 2019. 14
work page 2019
-
[18]
Stabilizing off-policy q-learning via bootstrapping error reduction.CoRR, abs/1906.00949, 2019
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019
-
[19]
Behavior Regularized Offline Reinforcement Learning
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[20]
Human-level control through deep reinforcement learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015
work page 2015
-
[21]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
Distributional reinforcement learning with quantile regression
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Conference on Artificial Intelligence, 2018
work page 2018
-
[23]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[24]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[25]
Synthetic returns for long-term credit assignment
David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, and Francis Song. Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425, 2021
-
[26]
Reinforcement Learning from Imperfect Demonstrations
Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online rein- forcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[28]
arXiv preprint arXiv:1901.10995 , year=
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019
-
[29]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019
work page 2019
-
[30]
Stabilizing off-policy q-learning via bootstrapping error reduction
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019
work page 2019
-
[31]
Keep doing what worked: Behavioral modelling priors for offline reinforcement learning
Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020
work page 2020
-
[32]
Morel: Model-based offline reinforcement learning
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[33]
Mopo: Model-based offline policy optimization
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[34]
Opal: Of- fline primitive discovery for accelerating offline reinforcement learning
Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Of- fline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611, 2020
-
[35]
Explore, discover and learn: Unsupervised discovery of state-covering skills
Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 2020
work page 2020
-
[36]
Accelerating reinforcement learning with learned skill priors
Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020. 15
-
[37]
Parrot: Data-driven behavioral priors for reinforcement learning
Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021
work page 2021
-
[38]
Diversity is all you need: Learning skills without a reward function
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019
work page 2019
-
[39]
Reset-free lifelong learning with skill-space planning
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020
-
[40]
Dynamics- aware unsupervised discovery of skills
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. In International Conference on Learning Representa- tions, 2020
work page 2020
-
[41]
Christopher Watkins. Learning from delayed rewards. 01 1989
work page 1989
-
[42]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[43]
Training agents using upside-down reinforcement learning
Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Ja ´skowski, and Jürgen Schmidhuber. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019
-
[44]
Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019
-
[45]
Acting without rewards. 2019. URL https://ogma.ai/2019/08/ acting-without-rewards/
work page 2019
-
[46]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020
work page 2020
-
[47]
Learning to reach goals without reinforcement learning
Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019
-
[48]
Planning from pixels using inverse dynamics models
Keiran Paster, Sheila A McIlraith, and Jimmy Ba. Planning from pixels using inverse dynamics models. arXiv preprint arXiv:2012.02419, 2020
-
[49]
Reinforcement learning as one big sequence modeling problem
Michael Janner, Qiyang Li, and Sergey Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021
-
[50]
Self-attentional credit assignment for transfer in reinforcement learning
Johan Ferret, Raphaël Marinier, Matthieu Geist, and Olivier Pietquin. Self-attentional credit assignment for transfer in reinforcement learning. arXiv preprint arXiv:1907.08027, 2019
-
[51]
Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. arXiv preprint arXiv:1912.02503, 2019
-
[52]
Rudder: Return decomposition for delayed rewards
Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018
-
[53]
Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning
Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, and Jian Peng. Sequence mod- eling of temporal credit assignment for episodic reinforcement learning. arXiv preprint arXiv:1905.13420, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[54]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[55]
Hafez: an interactive poetry generation system
Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proceedings of ACL, System Demonstrations, 2017. 16
work page 2017
-
[56]
Controllable neural text generation
Lilian Weng. Controllable neural text generation. lilianweng.github.io/lil- log, 2021. URL https://lilianweng.github.io/lil-log/2021/01/02/ controllable-neural-text-generation.html
work page 2021
-
[57]
Controlling Linguistic Style Aspects in Neural Language Generation
Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[58]
Toward controlled generation of text
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In International Conference on Machine Learning, 2017
work page 2017
-
[59]
Explain Yourself! Leveraging Language Models for Commonsense Reasoning
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[60]
Seqgan: Sequence generative adversarial nets with policy gradient
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI conference on artificial intelligence, 2017
work page 2017
-
[61]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[62]
CTRL: A Conditional Transformer Language Model for Controllable Generation
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[63]
Plug and play language models: A simple approach to controlled text generation
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019
-
[64]
Learning to Write with Cooperative Discriminators
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[65]
Gedi: Generative discriminator guided sequence generation,
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020
-
[66]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[67]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020
work page 2020
-
[68]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[69]
Rapid task-solving in novel environments
Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick, and David Raposo. Rapid task-solving in novel environments. arXiv preprint arXiv:2006.03662, 2020
-
[70]
Transformers for one-shot visual imitation
Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020
-
[71]
Imitating interactive intelligence
Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020
-
[72]
Transformers: State-of- the-art natural language processing
Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. Transformers: State-of- the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020
work page 2020
-
[73]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 17 A Experimental Details Code for experiments can be found in the supplementary material. A.1 Atari We build our Decision Transformer implementation for Atari games off of minGPT ( https:// github.com/karpathy/minGPT), a publicly available re-i...
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.