Decision Transformer: Reinforcement Learning via Sequence Modeling

Aditya Grover; Aravind Rajeswaran; Aravind Srinivas; Igor Mordatch; Kevin Lu; Kimin Lee; Lili Chen; Michael Laskin; Pieter Abbeel

arxiv: 2106.01345 · v2 · pith:PV7ZDPVRnew · submitted 2021-06-02 · 💻 cs.LG · cs.AI

Decision Transformer: Reinforcement Learning via Sequence Modeling

Lili Chen , Kevin Lu , Aravind Rajeswaran , Kimin Lee , Aditya Grover , Michael Laskin , Pieter Abbeel , Aravind Srinivas

show 1 more author

Igor Mordatch

This is my paper

Pith reviewed 2026-05-18 15:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningsequence modelingtransformeroffline RLautoregressive modelsdecision transformerreturn conditioning

0 comments

The pith

By conditioning a Transformer on a desired return along with past states and actions, Decision Transformer generates future actions that achieve the target reward in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes reinforcement learning as conditional sequence modeling instead of fitting value functions or optimizing policies with gradients. It trains an autoregressive Transformer to predict the next action given a target return-to-go, historical states, and actions, so that executing the sequence yields the specified cumulative reward. This draws on the scaling properties of models like those used in language tasks to handle control problems directly from offline trajectory data. A sympathetic reader cares because the method suggests that standard supervised learning on sequences could replace much of the machinery in offline RL, provided the data contains suitable high-return examples.

Core claim

Decision Transformer casts the problem of RL as conditional sequence modeling. Given offline trajectories, a causally masked Transformer is trained to autoregressively output actions conditioned on the desired return, past states, and past actions, thereby generating behavior that achieves the specified return without explicit value estimation or policy gradients.

What carries the argument

The Decision Transformer: a causally masked autoregressive Transformer that predicts actions to realize a specified target return when conditioned on returns-to-go, states, and actions from prior timesteps.

If this is right

The model matches or exceeds the performance of prior model-free offline RL algorithms on Atari games, OpenAI Gym continuous control tasks, and the Key-to-Door environment.
No online environment interaction or credit assignment is required during training; all learning occurs via standard sequence prediction on fixed datasets.
Higher target returns can be requested at inference time to elicit stronger performance without retraining the model.
Long-horizon tasks become approachable because the Transformer models entire future sequences toward the goal return in one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If sufficiently large and diverse offline datasets become available, the same scaling trends observed in language models could appear in learned control policies.
The conditioning mechanism could be extended to include goal images or language instructions, allowing the same architecture to handle visual or language-conditioned tasks.
A natural test would be to measure whether the generated actions discover strategies absent from the training trajectories or merely replay high-return fragments.
Hybrid use with limited online fine-tuning could address domains where purely offline data leaves gaps in coverage.

Load-bearing premise

The offline trajectory data already contains near-optimal behavior sequences that the model can recover simply by conditioning on a high target return value.

What would settle it

The central claim would be falsified by an experiment showing that, on an environment where offline data contains only suboptimal trajectories, conditioning on the highest possible return still yields actions whose realized cumulative reward falls far short of the target.

read the original abstract

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Decision Transformer, which abstracts reinforcement learning as a conditional sequence modeling problem. An autoregressive Transformer is conditioned on a desired return (reward), past states, and actions to generate future actions that achieve the target return. Unlike value-function or policy-gradient methods, it leverages causal masking and standard Transformer components. Experiments show it matches or exceeds model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Significance. If the results hold, the work demonstrates that sequence-modeling advances can be directly applied to offline RL, yielding a simpler architecture without explicit credit assignment or online exploration. The approach is grounded in reproducible benchmarks and offers a parameter-light way to recover high-return behavior from trajectory data when such behavior is present.

major comments (1)

[Experiments (Atari/Gym/Key-to-Door sections)] The central claim that conditioning on high target returns recovers superior actions presupposes that the offline dataset contains near-optimal trajectories. While the paper evaluates on standard benchmarks that include expert or mixed data, it does not report results on deliberately suboptimal datasets; this assumption is load-bearing for the generality of the method beyond the tested suites.

minor comments (2)

[Model Architecture] Provide the precise tokenization and embedding scheme for returns, states, and actions in the input sequence (e.g., how continuous returns are discretized or normalized).
[Experimental Results] Include statistical significance tests or multiple random seeds with error bars for the reported performance comparisons against baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary and recommendation for minor revision. We address the single major comment below by clarifying the scope of our offline RL approach and propose textual revisions to make the data-quality assumption explicit.

read point-by-point responses

Referee: [Experiments (Atari/Gym/Key-to-Door sections)] The central claim that conditioning on high target returns recovers superior actions presupposes that the offline dataset contains near-optimal trajectories. While the paper evaluates on standard benchmarks that include expert or mixed data, it does not report results on deliberately suboptimal datasets; this assumption is load-bearing for the generality of the method beyond the tested suites.

Authors: We agree that Decision Transformer, like other offline RL methods, relies on the presence of high-return trajectories in the dataset to achieve superior performance when conditioning on high target returns. The approach is explicitly designed to recover the best available behavior from offline data rather than to synthesize optimality from purely suboptimal trajectories. This is consistent with the standard offline RL setting and is already implicit in our evaluations on benchmarks containing expert or mixed data (e.g., Atari expert demonstrations and Gym datasets). We do not claim generality to arbitrary low-quality datasets, as the method cannot exceed the maximum return present in the data. To address the referee's concern, we will revise the introduction, method, and discussion sections to explicitly state this assumption and compare it to related offline RL algorithms such as BCQ and CQL. We believe this textual clarification sufficiently strengthens the paper without necessitating new experiments on artificially degraded datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model is a data-driven sequence modeling proposal tested on external benchmarks

full rationale

The paper frames RL as conditional sequence modeling and introduces Decision Transformer as an autoregressive architecture that conditions on target return, states, and actions to output future actions. This is a modeling proposal trained on offline trajectories and evaluated against standard benchmarks (Atari, Gym, Key-to-Door). No equations or claims reduce the output performance to a fitted parameter or self-citation by construction. The central claim depends on the empirical presence of high-return sequences in the data, which is an external assumption tested via benchmarks rather than a definitional tautology. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in a way that collapses the result to prior author work. The derivation chain remains self-contained against external data and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard Transformer causal-masking assumption and the offline RL premise that high-return trajectories exist in the dataset; no new free parameters or invented entities are introduced beyond those inherited from the Transformer and RL literature.

axioms (1)

domain assumption A causally masked Transformer can model long-range dependencies in state-action-reward sequences sufficiently well to recover near-optimal behavior.
Invoked when the model is trained to predict actions conditioned on past tokens and target return.

pith-pipeline@v0.9.0 · 5690 in / 1135 out tokens · 34608 ms · 2026-05-18T15:06:03.963840+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Offline Reinforcement Learning with Implicit Q-Learning
cs.LG 2021-10 unverdicted novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
cs.NI 2026-05 conditional novelty 7.0

Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Gradient Boosting within a Single Attention Layer
cs.LG 2026-04 conditional novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
cs.RO 2023-10 conditional novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
cs.AI 2026-05 unverdicted novelty 6.0

RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
cs.AI 2026-04 unverdicted novelty 6.0

Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
cs.LG 2026-04 unverdicted novelty 6.0

ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion en...
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions
cs.LG 2025-09 unverdicted novelty 6.0

DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
cs.AI 2025-07 conditional novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
cs.RO 2025-05 conditional novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
A Roadmap to Pluralistic Alignment
cs.AI 2024-02 unverdicted novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
cs.RO 2021-08 accept novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models
cs.CL 2026-05 unverdicted novelty 3.0

Large vision-language models applied to multi-scale remote sensing imagery can generate recommendations on built environment design, constructability, land use, and risks for smart city decision-making.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 18 Pith papers · 21 internal anchors

[1]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, 2017

work page 2017
[2]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Stabilizing transformers for reinforcement learning

Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayaku- mar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2020

work page 2020
[5]

Deep reinforcement learning with relational inductive biases

Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations , 2018

work page 2018
[6]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018

work page 2018
[7]

Optimizing agent behavior over long time scales by transporting value

Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):1–12, 2019

work page 2019
[8]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[9]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018
[10]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253–279, 2013

work page 2013
[11]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Counterfactual credit assignment in model-free reinforcement learning

Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, et al. Counterfactual credit assignment in model-free reinforcement learning. arXiv preprint arXiv:2011.09464, 2020

work page arXiv 2011
[13]

An optimistic perspective on ofﬂine reinforcement learning

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on ofﬂine reinforcement learning. In International Conference on Machine Learning, 2020

work page 2020
[14]

Conservative q-learning for ofﬂine reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, 2020

work page 2020
[15]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990

work page 1990
[17]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems , pages 12498–12509, 2019. 14

work page 2019
[18]

Stabilizing off-policy q-learning via bootstrapping error reduction.CoRR, abs/1906.00949, 2019

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019

work page arXiv 1906
[19]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[20]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

work page 2015
[21]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

Distributional reinforcement learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Conference on Artiﬁcial Intelligence, 2018

work page 2018
[23]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[24]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[25]

Synthetic returns for long-term credit assignment

David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, and Francis Song. Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425, 2021

work page arXiv 2021
[26]

Reinforcement Learning from Imperfect Demonstrations

Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online rein- forcement learning with ofﬂine datasets. arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[28]

arXiv preprint arXiv:1901.10995 , year=

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019

work page arXiv 1901
[29]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019

work page 2019
[30]

Stabilizing off-policy q-learning via bootstrapping error reduction

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019

work page 2019
[31]

Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning

Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning. In International Conference on Learning Representations, 2020

work page 2020
[32]

Morel: Model-based ofﬂine reinforcement learning

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, 2020

work page 2020
[33]

Mopo: Model-based ofﬂine policy optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based ofﬂine policy optimization. In Advances in Neural Information Processing Systems, 2020

work page 2020
[34]

Opal: Of- ﬂine primitive discovery for accelerating ofﬂine reinforcement learning

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Oﬁr Nachum. Opal: Of- ﬂine primitive discovery for accelerating ofﬂine reinforcement learning. arXiv preprint arXiv:2010.13611, 2020

work page arXiv 2010
[35]

Explore, discover and learn: Unsupervised discovery of state-covering skills

Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 2020

work page 2020
[36]

Accelerating reinforcement learning with learned skill priors

Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020. 15

work page arXiv 2010
[37]

Parrot: Data-driven behavioral priors for reinforcement learning

Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021

work page 2021
[38]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019

work page 2019
[39]

Reset-free lifelong learning with skill-space planning

Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020

work page arXiv 2012
[40]

Dynamics- aware unsupervised discovery of skills

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. In International Conference on Learning Representa- tions, 2020

work page 2020
[41]

Learning from delayed rewards

Christopher Watkins. Learning from delayed rewards. 01 1989

work page 1989
[42]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[43]

Training agents using upside-down reinforcement learning

Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Ja ´skowski, and Jürgen Schmidhuber. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019

work page arXiv 1912
[44]

Reward-conditioned policies

Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019

work page arXiv 1912
[45]

Acting without rewards. 2019. URL https://ogma.ai/2019/08/ acting-without-rewards/

work page 2019
[46]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

work page 2020
[47]

Learning to reach goals without reinforcement learning

Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019

work page arXiv 1912
[48]

Planning from pixels using inverse dynamics models

Keiran Paster, Sheila A McIlraith, and Jimmy Ba. Planning from pixels using inverse dynamics models. arXiv preprint arXiv:2012.02419, 2020

work page arXiv 2012
[49]

Reinforcement learning as one big sequence modeling problem

Michael Janner, Qiyang Li, and Sergey Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021

work page arXiv 2021
[50]

Self-attentional credit assignment for transfer in reinforcement learning

Johan Ferret, Raphaël Marinier, Matthieu Geist, and Olivier Pietquin. Self-attentional credit assignment for transfer in reinforcement learning. arXiv preprint arXiv:1907.08027, 2019

work page arXiv 1907
[51]

Hindsight credit assignment

Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. arXiv preprint arXiv:1912.02503, 2019

work page arXiv 1912
[52]

Rudder: Return decomposition for delayed rewards

Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018

work page arXiv 2018
[53]

Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, and Jian Peng. Sequence mod- eling of temporal credit assignment for episodic reinforcement learning. arXiv preprint arXiv:1905.13420, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[54]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[55]

Hafez: an interactive poetry generation system

Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proceedings of ACL, System Demonstrations, 2017. 16

work page 2017
[56]

Controllable neural text generation

Lilian Weng. Controllable neural text generation. lilianweng.github.io/lil- log, 2021. URL https://lilianweng.github.io/lil-log/2021/01/02/ controllable-neural-text-generation.html

work page 2021
[57]

Controlling Linguistic Style Aspects in Neural Language Generation

Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Toward controlled generation of text

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In International Conference on Machine Learning, 2017

work page 2017
[59]

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[60]

Seqgan: Sequence generative adversarial nets with policy gradient

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI conference on artiﬁcial intelligence, 2017

work page 2017
[61]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[62]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[63]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019

work page arXiv 1912
[64]

Learning to Write with Cooperative Discriminators

Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[65]

Gedi: Generative discriminator guided sequence generation,

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shaﬁq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020

work page arXiv 2009
[66]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[67]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020

work page 2020
[68]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[69]

Rapid task-solving in novel environments

Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick, and David Raposo. Rapid task-solving in novel environments. arXiv preprint arXiv:2006.03662, 2020

work page arXiv 2006
[70]

Transformers for one-shot visual imitation

Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020

work page arXiv 2011
[71]

Imitating interactive intelligence

Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020

work page arXiv 2012
[72]

Transformers: State-of- the-art natural language processing

Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. Transformers: State-of- the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020

work page 2020
[73]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 17 A Experimental Details Code for experiments can be found in the supplementary material. A.1 Atari We build our Decision Transformer implementation for Atari games off of minGPT ( https:// github.com/karpathy/minGPT), a publicly available re-i...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, 2017

work page 2017

[2] [2]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[3] [3]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Stabilizing transformers for reinforcement learning

Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayaku- mar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2020

work page 2020

[5] [5]

Deep reinforcement learning with relational inductive biases

Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations , 2018

work page 2018

[6] [6]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018

work page 2018

[7] [7]

Optimizing agent behavior over long time scales by transporting value

Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):1–12, 2019

work page 2019

[8] [8]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[9] [9]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018

[10] [10]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253–279, 2013

work page 2013

[11] [11]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Counterfactual credit assignment in model-free reinforcement learning

Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, et al. Counterfactual credit assignment in model-free reinforcement learning. arXiv preprint arXiv:2011.09464, 2020

work page arXiv 2011

[13] [13]

An optimistic perspective on ofﬂine reinforcement learning

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on ofﬂine reinforcement learning. In International Conference on Machine Learning, 2020

work page 2020

[14] [14]

Conservative q-learning for ofﬂine reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, 2020

work page 2020

[15] [15]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990

work page 1990

[17] [17]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems , pages 12498–12509, 2019. 14

work page 2019

[18] [18]

Stabilizing off-policy q-learning via bootstrapping error reduction.CoRR, abs/1906.00949, 2019

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019

work page arXiv 1906

[19] [19]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[20] [20]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

work page 2015

[21] [21]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[22] [22]

Distributional reinforcement learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Conference on Artiﬁcial Intelligence, 2018

work page 2018

[23] [23]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[24] [24]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[25] [25]

Synthetic returns for long-term credit assignment

David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, and Francis Song. Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425, 2021

work page arXiv 2021

[26] [26]

Reinforcement Learning from Imperfect Demonstrations

Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online rein- forcement learning with ofﬂine datasets. arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[28] [28]

arXiv preprint arXiv:1901.10995 , year=

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019

work page arXiv 1901

[29] [29]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019

work page 2019

[30] [30]

Stabilizing off-policy q-learning via bootstrapping error reduction

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019

work page 2019

[31] [31]

Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning

Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning. In International Conference on Learning Representations, 2020

work page 2020

[32] [32]

Morel: Model-based ofﬂine reinforcement learning

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, 2020

work page 2020

[33] [33]

Mopo: Model-based ofﬂine policy optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based ofﬂine policy optimization. In Advances in Neural Information Processing Systems, 2020

work page 2020

[34] [34]

Opal: Of- ﬂine primitive discovery for accelerating ofﬂine reinforcement learning

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Oﬁr Nachum. Opal: Of- ﬂine primitive discovery for accelerating ofﬂine reinforcement learning. arXiv preprint arXiv:2010.13611, 2020

work page arXiv 2010

[35] [35]

Explore, discover and learn: Unsupervised discovery of state-covering skills

Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 2020

work page 2020

[36] [36]

Accelerating reinforcement learning with learned skill priors

Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020. 15

work page arXiv 2010

[37] [37]

Parrot: Data-driven behavioral priors for reinforcement learning

Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021

work page 2021

[38] [38]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019

work page 2019

[39] [39]

Reset-free lifelong learning with skill-space planning

Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020

work page arXiv 2012

[40] [40]

Dynamics- aware unsupervised discovery of skills

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. In International Conference on Learning Representa- tions, 2020

work page 2020

[41] [41]

Learning from delayed rewards

Christopher Watkins. Learning from delayed rewards. 01 1989

work page 1989

[42] [42]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[43] [43]

Training agents using upside-down reinforcement learning

Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Ja ´skowski, and Jürgen Schmidhuber. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019

work page arXiv 1912

[44] [44]

Reward-conditioned policies

Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019

work page arXiv 1912

[45] [45]

Acting without rewards. 2019. URL https://ogma.ai/2019/08/ acting-without-rewards/

work page 2019

[46] [46]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

work page 2020

[47] [47]

Learning to reach goals without reinforcement learning

Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019

work page arXiv 1912

[48] [48]

Planning from pixels using inverse dynamics models

Keiran Paster, Sheila A McIlraith, and Jimmy Ba. Planning from pixels using inverse dynamics models. arXiv preprint arXiv:2012.02419, 2020

work page arXiv 2012

[49] [49]

Reinforcement learning as one big sequence modeling problem

Michael Janner, Qiyang Li, and Sergey Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021

work page arXiv 2021

[50] [50]

Self-attentional credit assignment for transfer in reinforcement learning

Johan Ferret, Raphaël Marinier, Matthieu Geist, and Olivier Pietquin. Self-attentional credit assignment for transfer in reinforcement learning. arXiv preprint arXiv:1907.08027, 2019

work page arXiv 1907

[51] [51]

Hindsight credit assignment

Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. arXiv preprint arXiv:1912.02503, 2019

work page arXiv 1912

[52] [52]

Rudder: Return decomposition for delayed rewards

Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018

work page arXiv 2018

[53] [53]

Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, and Jian Peng. Sequence mod- eling of temporal credit assignment for episodic reinforcement learning. arXiv preprint arXiv:1905.13420, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[54] [54]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition, 2019

work page 2019

[55] [55]

Hafez: an interactive poetry generation system

Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proceedings of ACL, System Demonstrations, 2017. 16

work page 2017

[56] [56]

Controllable neural text generation

Lilian Weng. Controllable neural text generation. lilianweng.github.io/lil- log, 2021. URL https://lilianweng.github.io/lil-log/2021/01/02/ controllable-neural-text-generation.html

work page 2021

[57] [57]

Controlling Linguistic Style Aspects in Neural Language Generation

Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[58] [58]

Toward controlled generation of text

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In International Conference on Machine Learning, 2017

work page 2017

[59] [59]

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[60] [60]

Seqgan: Sequence generative adversarial nets with policy gradient

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI conference on artiﬁcial intelligence, 2017

work page 2017

[61] [61]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[62] [62]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[63] [63]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019

work page arXiv 1912

[64] [64]

Learning to Write with Cooperative Discriminators

Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[65] [65]

Gedi: Generative discriminator guided sequence generation,

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shaﬁq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020

work page arXiv 2009

[66] [66]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[67] [67]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020

work page 2020

[68] [68]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[69] [69]

Rapid task-solving in novel environments

Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick, and David Raposo. Rapid task-solving in novel environments. arXiv preprint arXiv:2006.03662, 2020

work page arXiv 2006

[70] [70]

Transformers for one-shot visual imitation

Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020

work page arXiv 2011

[71] [71]

Imitating interactive intelligence

Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020

work page arXiv 2012

[72] [72]

Transformers: State-of- the-art natural language processing

Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. Transformers: State-of- the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020

work page 2020

[73] [73]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 17 A Experimental Details Code for experiments can be found in the supplementary material. A.1 Atari We build our Decision Transformer implementation for Atari games off of minGPT ( https:// github.com/karpathy/minGPT), a publicly available re-i...

work page internal anchor Pith review Pith/arXiv arXiv 2017