pith. sign in

arxiv: 2106.01345 · v2 · pith:PV7ZDPVRnew · submitted 2021-06-02 · 💻 cs.LG · cs.AI

Decision Transformer: Reinforcement Learning via Sequence Modeling

Pith reviewed 2026-05-18 15:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningsequence modelingtransformeroffline RLautoregressive modelsdecision transformerreturn conditioning
0
0 comments X

The pith

By conditioning a Transformer on a desired return along with past states and actions, Decision Transformer generates future actions that achieve the target reward in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes reinforcement learning as conditional sequence modeling instead of fitting value functions or optimizing policies with gradients. It trains an autoregressive Transformer to predict the next action given a target return-to-go, historical states, and actions, so that executing the sequence yields the specified cumulative reward. This draws on the scaling properties of models like those used in language tasks to handle control problems directly from offline trajectory data. A sympathetic reader cares because the method suggests that standard supervised learning on sequences could replace much of the machinery in offline RL, provided the data contains suitable high-return examples.

Core claim

Decision Transformer casts the problem of RL as conditional sequence modeling. Given offline trajectories, a causally masked Transformer is trained to autoregressively output actions conditioned on the desired return, past states, and past actions, thereby generating behavior that achieves the specified return without explicit value estimation or policy gradients.

What carries the argument

The Decision Transformer: a causally masked autoregressive Transformer that predicts actions to realize a specified target return when conditioned on returns-to-go, states, and actions from prior timesteps.

If this is right

  • The model matches or exceeds the performance of prior model-free offline RL algorithms on Atari games, OpenAI Gym continuous control tasks, and the Key-to-Door environment.
  • No online environment interaction or credit assignment is required during training; all learning occurs via standard sequence prediction on fixed datasets.
  • Higher target returns can be requested at inference time to elicit stronger performance without retraining the model.
  • Long-horizon tasks become approachable because the Transformer models entire future sequences toward the goal return in one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If sufficiently large and diverse offline datasets become available, the same scaling trends observed in language models could appear in learned control policies.
  • The conditioning mechanism could be extended to include goal images or language instructions, allowing the same architecture to handle visual or language-conditioned tasks.
  • A natural test would be to measure whether the generated actions discover strategies absent from the training trajectories or merely replay high-return fragments.
  • Hybrid use with limited online fine-tuning could address domains where purely offline data leaves gaps in coverage.

Load-bearing premise

The offline trajectory data already contains near-optimal behavior sequences that the model can recover simply by conditioning on a high target return value.

What would settle it

The central claim would be falsified by an experiment showing that, on an environment where offline data contains only suboptimal trajectories, conditioning on the highest possible return still yields actions whose realized cumulative reward falls far short of the target.

read the original abstract

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Decision Transformer, which abstracts reinforcement learning as a conditional sequence modeling problem. An autoregressive Transformer is conditioned on a desired return (reward), past states, and actions to generate future actions that achieve the target return. Unlike value-function or policy-gradient methods, it leverages causal masking and standard Transformer components. Experiments show it matches or exceeds model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Significance. If the results hold, the work demonstrates that sequence-modeling advances can be directly applied to offline RL, yielding a simpler architecture without explicit credit assignment or online exploration. The approach is grounded in reproducible benchmarks and offers a parameter-light way to recover high-return behavior from trajectory data when such behavior is present.

major comments (1)
  1. [Experiments (Atari/Gym/Key-to-Door sections)] The central claim that conditioning on high target returns recovers superior actions presupposes that the offline dataset contains near-optimal trajectories. While the paper evaluates on standard benchmarks that include expert or mixed data, it does not report results on deliberately suboptimal datasets; this assumption is load-bearing for the generality of the method beyond the tested suites.
minor comments (2)
  1. [Model Architecture] Provide the precise tokenization and embedding scheme for returns, states, and actions in the input sequence (e.g., how continuous returns are discretized or normalized).
  2. [Experimental Results] Include statistical significance tests or multiple random seeds with error bars for the reported performance comparisons against baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary and recommendation for minor revision. We address the single major comment below by clarifying the scope of our offline RL approach and propose textual revisions to make the data-quality assumption explicit.

read point-by-point responses
  1. Referee: [Experiments (Atari/Gym/Key-to-Door sections)] The central claim that conditioning on high target returns recovers superior actions presupposes that the offline dataset contains near-optimal trajectories. While the paper evaluates on standard benchmarks that include expert or mixed data, it does not report results on deliberately suboptimal datasets; this assumption is load-bearing for the generality of the method beyond the tested suites.

    Authors: We agree that Decision Transformer, like other offline RL methods, relies on the presence of high-return trajectories in the dataset to achieve superior performance when conditioning on high target returns. The approach is explicitly designed to recover the best available behavior from offline data rather than to synthesize optimality from purely suboptimal trajectories. This is consistent with the standard offline RL setting and is already implicit in our evaluations on benchmarks containing expert or mixed data (e.g., Atari expert demonstrations and Gym datasets). We do not claim generality to arbitrary low-quality datasets, as the method cannot exceed the maximum return present in the data. To address the referee's concern, we will revise the introduction, method, and discussion sections to explicitly state this assumption and compare it to related offline RL algorithms such as BCQ and CQL. We believe this textual clarification sufficiently strengthens the paper without necessitating new experiments on artificially degraded datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model is a data-driven sequence modeling proposal tested on external benchmarks

full rationale

The paper frames RL as conditional sequence modeling and introduces Decision Transformer as an autoregressive architecture that conditions on target return, states, and actions to output future actions. This is a modeling proposal trained on offline trajectories and evaluated against standard benchmarks (Atari, Gym, Key-to-Door). No equations or claims reduce the output performance to a fitted parameter or self-citation by construction. The central claim depends on the empirical presence of high-return sequences in the data, which is an external assumption tested via benchmarks rather than a definitional tautology. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in a way that collapses the result to prior author work. The derivation chain remains self-contained against external data and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard Transformer causal-masking assumption and the offline RL premise that high-return trajectories exist in the dataset; no new free parameters or invented entities are introduced beyond those inherited from the Transformer and RL literature.

axioms (1)
  • domain assumption A causally masked Transformer can model long-range dependencies in state-action-reward sequences sufficiently well to recover near-optimal behavior.
    Invoked when the model is trained to predict actions conditioned on past tokens and target return.

pith-pipeline@v0.9.0 · 5690 in / 1135 out tokens · 34608 ms · 2026-05-18T15:06:03.963840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Offline Reinforcement Learning with Implicit Q-Learning

    cs.LG 2021-10 unverdicted novelty 8.0

    IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

  2. ASH: Agents that Self-Hone via Embodied Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

  3. Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

    cs.NI 2026-05 conditional novelty 7.0

    Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.

  4. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  5. Gradient Boosting within a Single Attention Layer

    cs.LG 2026-04 conditional novelty 7.0

    Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...

  6. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    cs.RO 2023-10 conditional novelty 7.0

    SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.

  7. RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

    cs.AI 2026-05 unverdicted novelty 6.0

    RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.

  8. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  9. When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

    cs.AI 2026-04 unverdicted novelty 6.0

    Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.

  10. Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

    cs.LG 2026-04 unverdicted novelty 6.0

    ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion en...

  11. DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

    cs.LG 2025-09 unverdicted novelty 6.0

    DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.

  12. Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    cs.AI 2025-07 conditional novelty 6.0

    Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

  13. VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    cs.RO 2025-05 conditional novelty 6.0

    VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

  14. A Roadmap to Pluralistic Alignment

    cs.AI 2024-02 unverdicted novelty 6.0

    The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

  15. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  16. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    cs.RO 2021-08 accept novelty 6.0

    A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...

  17. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  18. Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

    cs.CL 2026-05 unverdicted novelty 3.0

    Large vision-language models applied to multi-scale remote sensing imagery can generate recommendations on built environment design, constructability, land use, and risks for smart city decision-making.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 18 Pith papers · 21 internal anchors

  1. [1]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, 2017

  2. [2]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  3. [3]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021

  4. [4]

    Stabilizing transformers for reinforcement learning

    Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayaku- mar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2020

  5. [5]

    Deep reinforcement learning with relational inductive biases

    Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations , 2018

  6. [6]

    Reinforcement learning: An introduction

    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018

  7. [7]

    Optimizing agent behavior over long time scales by transporting value

    Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):1–12, 2019

  8. [8]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

  9. [9]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

  10. [10]

    The arcade learning environment: An evaluation platform for general agents

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013

  11. [11]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  12. [12]

    Counterfactual credit assignment in model-free reinforcement learning

    Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, et al. Counterfactual credit assignment in model-free reinforcement learning. arXiv preprint arXiv:2011.09464, 2020

  13. [13]

    An optimistic perspective on offline reinforcement learning

    Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020

  14. [14]

    Conservative q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020

  15. [15]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  16. [16]

    Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990

  17. [17]

    When to trust your model: Model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems , pages 12498–12509, 2019. 14

  18. [18]

    Stabilizing off-policy q-learning via bootstrapping error reduction.CoRR, abs/1906.00949, 2019

    Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019

  19. [19]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

  20. [20]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

  21. [21]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  22. [22]

    Distributional reinforcement learning with quantile regression

    Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Conference on Artificial Intelligence, 2018

  23. [23]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  24. [24]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

  25. [25]

    Synthetic returns for long-term credit assignment

    David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, and Francis Song. Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425, 2021

  26. [26]

    Reinforcement Learning from Imperfect Demonstrations

    Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018

  27. [27]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online rein- forcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020

  28. [28]

    arXiv preprint arXiv:1901.10995 , year=

    Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019

  29. [29]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019

  30. [30]

    Stabilizing off-policy q-learning via bootstrapping error reduction

    Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019

  31. [31]

    Keep doing what worked: Behavioral modelling priors for offline reinforcement learning

    Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020

  32. [32]

    Morel: Model-based offline reinforcement learning

    Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020

  33. [33]

    Mopo: Model-based offline policy optimization

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, 2020

  34. [34]

    Opal: Of- fline primitive discovery for accelerating offline reinforcement learning

    Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Of- fline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611, 2020

  35. [35]

    Explore, discover and learn: Unsupervised discovery of state-covering skills

    Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 2020

  36. [36]

    Accelerating reinforcement learning with learned skill priors

    Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020. 15

  37. [37]

    Parrot: Data-driven behavioral priors for reinforcement learning

    Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021

  38. [38]

    Diversity is all you need: Learning skills without a reward function

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019

  39. [39]

    Reset-free lifelong learning with skill-space planning

    Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020

  40. [40]

    Dynamics- aware unsupervised discovery of skills

    Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. In International Conference on Learning Representa- tions, 2020

  41. [41]

    Learning from delayed rewards

    Christopher Watkins. Learning from delayed rewards. 01 1989

  42. [42]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  43. [43]

    Training agents using upside-down reinforcement learning

    Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Ja ´skowski, and Jürgen Schmidhuber. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019

  44. [44]

    Reward-conditioned policies

    Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019

  45. [45]

    Acting without rewards. 2019. URL https://ogma.ai/2019/08/ acting-without-rewards/

  46. [46]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

  47. [47]

    Learning to reach goals without reinforcement learning

    Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019

  48. [48]

    Planning from pixels using inverse dynamics models

    Keiran Paster, Sheila A McIlraith, and Jimmy Ba. Planning from pixels using inverse dynamics models. arXiv preprint arXiv:2012.02419, 2020

  49. [49]

    Reinforcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021

  50. [50]

    Self-attentional credit assignment for transfer in reinforcement learning

    Johan Ferret, Raphaël Marinier, Matthieu Geist, and Olivier Pietquin. Self-attentional credit assignment for transfer in reinforcement learning. arXiv preprint arXiv:1907.08027, 2019

  51. [51]

    Hindsight credit assignment

    Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. arXiv preprint arXiv:1912.02503, 2019

  52. [52]

    Rudder: Return decomposition for delayed rewards

    Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018

  53. [53]

    Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

    Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, and Jian Peng. Sequence mod- eling of temporal credit assignment for episodic reinforcement learning. arXiv preprint arXiv:1905.13420, 2019

  54. [54]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition, 2019

  55. [55]

    Hafez: an interactive poetry generation system

    Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proceedings of ACL, System Demonstrations, 2017. 16

  56. [56]

    Controllable neural text generation

    Lilian Weng. Controllable neural text generation. lilianweng.github.io/lil- log, 2021. URL https://lilianweng.github.io/lil-log/2021/01/02/ controllable-neural-text-generation.html

  57. [57]

    Controlling Linguistic Style Aspects in Neural Language Generation

    Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633, 2017

  58. [58]

    Toward controlled generation of text

    Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In International Conference on Machine Learning, 2017

  59. [59]

    Explain Yourself! Leveraging Language Models for Commonsense Reasoning

    Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019

  60. [60]

    Seqgan: Sequence generative adversarial nets with policy gradient

    Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI conference on artificial intelligence, 2017

  61. [61]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

  62. [62]

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

  63. [63]

    Plug and play language models: A simple approach to controlled text generation

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019

  64. [64]

    Learning to Write with Cooperative Discriminators

    Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087, 2018

  65. [65]

    Gedi: Generative discriminator guided sequence generation,

    Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020

  66. [66]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  67. [67]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020

  68. [68]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  69. [69]

    Rapid task-solving in novel environments

    Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick, and David Raposo. Rapid task-solving in novel environments. arXiv preprint arXiv:2006.03662, 2020

  70. [70]

    Transformers for one-shot visual imitation

    Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020

  71. [71]

    Imitating interactive intelligence

    Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020

  72. [72]

    Transformers: State-of- the-art natural language processing

    Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. Transformers: State-of- the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020

  73. [73]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 17 A Experimental Details Code for experiments can be found in the supplementary material. A.1 Atari We build our Decision Transformer implementation for Atari games off of minGPT ( https:// github.com/karpathy/minGPT), a publicly available re-i...