Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning

Dominik \.Zurek; Kamil Faber; Marcin Pietro\'n; Pawe{\l} Gajewski

arxiv: 2410.06347 · v2 · submitted 2024-10-08 · 💻 cs.RO · cs.AI

Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning

Pawe{\l} Gajewski , Dominik \.Zurek , Marcin Pietro\'n , Kamil Faber This is my paper

Pith reviewed 2026-05-23 19:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords goal-conditioned reinforcement learningoffline RLdecision transformermulti-goal roboticsFranka Pandasparse rewardssequence modeling

0 comments

The pith

A goal-conditioned Decision Transformer learns multi-goal robotics policies from offline data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the Decision Transformer to handle goal-conditioned policies in an offline reinforcement learning setting for robotics. It adds goal states directly into the input sequence so the model can address many different tasks from the same pre-collected dataset without further environment interaction. The method is tested on a newly released offline dataset collected on the Franka Emika Panda arm and is shown to exceed online reinforcement learning baselines on harder tasks while remaining stable under sparse rewards and small amounts of expert data. A reader would care because the approach removes the usual requirement for costly online trial-and-error when a robot must achieve multiple goals.

Core claim

By explicitly incorporating goal states into the sequence modeling framework, the Goal-Conditioned Decision Transformer solves varying tasks using only pre-collected data. On the Franka Emika Panda offline dataset the approach outperforms state-of-the-art online baselines in complex tasks and maintains robustness in sparse-reward settings even with limited expert demonstrations.

What carries the argument

The Goal-Conditioned Decision Transformer, which inserts goal states into the transformer sequence to produce actions conditioned on different target goals.

If this is right

Multi-goal robotic policies can be obtained without any online environment interaction after the initial data collection.
The same trained model handles both dense and sparse reward conditions across multiple target goals.
Performance remains high even when the offline dataset contains only limited expert trajectories.
The method extends standard Decision Transformer training by adding an explicit goal token to the sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar goal-conditioning could be added to other sequence-based offline RL algorithms beyond the Decision Transformer.
If the dataset coverage assumption holds, the approach may transfer to additional robot platforms once comparable offline multi-goal datasets are collected.
The offline multi-goal setting reduces the usual sample-efficiency barrier that prevents transformer models from being used directly on physical robots.

Load-bearing premise

The new offline dataset for the Franka Emika Panda contains enough coverage of different goals and task distributions for the learned policy to generalize.

What would settle it

Evaluating the trained policy on a set of goals that are sparsely or never visited in the offline dataset and observing consistent failure to reach them would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2410.06347 by Dominik \.Zurek, Kamil Faber, Marcin Pietro\'n, Pawe{\l} Gajewski.

**Figure 3.** Figure 3: A plot of the influence of the data size vs. success rate [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: A plot of the influence of the expert percentages vs. success rate [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Reinforcement learning (RL) in robotics faces significant hurdles regarding sample efficiency and generalization across varying goals. While Offline RL mitigates the need for costly online interactions, its integration with goal-conditioned policies and transformer-based architectures remains underexplored. We introduce a Goal-Conditioned Decision Transformer adapted for offline multi-goal robotics. By explicitly incorporating goal states into the sequence modeling framework, our approach efficiently solves varying tasks using only pre-collected data. We validate this method on a newly released offline dataset for the Franka Emika Panda platform. Experimental results demonstrate that our approach outperforms state-of-the-art online baselines in complex tasks and maintains robustness in sparse-reward settings, even with limited expert demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward goal-conditioned Decision Transformer for offline multi-goal robot RL on a new Panda dataset, but the outperformance claims lack any numbers or dataset details to back them up.

read the letter

The paper takes the Decision Transformer and adds explicit goal conditioning so it can handle multiple goals from offline data only. They release a new dataset collected on the Franka Emika Panda and test the method there, claiming it beats online baselines on complex tasks and stays robust even with sparse rewards and few expert trajectories. Releasing the dataset is the most concrete contribution if the trajectories actually span the goal space they claim. The framing around sample efficiency and generalization in robotics is reasonable and matches known pain points. The method itself is a direct extension rather than a deep architectural change, which keeps it simple to implement. The main weakness is that nothing in the abstract or stress-test note shows quantitative results, baseline comparisons, ablations, or basic dataset statistics such as number of distinct goals, trajectories per goal, or state coverage. Without those, the robustness and outperformance statements cannot be evaluated and could easily be driven by dataset properties rather than the goal conditioning. The coverage concern in the stress-test note lands directly because offline goal-conditioned methods stand or fall on whether the data actually visits the relevant goal regions. This is the kind of work that might interest people already running Decision Transformer variants on robot arms and looking for a multi-goal offline variant plus a new Panda dataset. It is not broad enough or grounded enough for a general reading group. I would send it to peer review because the idea is clear, the dataset release has potential value, and the experiments can be fixed with proper reporting, but it would need heavy revision on the results section.

Referee Report

2 major / 0 minor

Summary. The paper introduces a Goal-Conditioned Decision Transformer (GCDT) that extends the Decision Transformer architecture by explicitly conditioning on goal states within the sequence modeling framework for offline multi-goal reinforcement learning. It claims to solve varying robotic tasks using only pre-collected data, with validation on a newly released offline dataset for the Franka Emika Panda platform, reporting outperformance over state-of-the-art online baselines in complex tasks and robustness in sparse-reward settings even with limited expert demonstrations.

Significance. If the empirical claims hold with adequate controls, the work would provide a practical demonstration of combining transformer-based sequence modeling with goal conditioning in offline RL for robotics, potentially improving generalization across goals without online interaction. The release of a new Panda dataset could be a useful contribution if it demonstrates sufficient diversity.

major comments (2)

[Dataset Description / Experiments] Dataset section: The central claim of robustness in sparse-reward multi-goal settings with limited expert demonstrations requires that the newly released Franka Emika Panda offline dataset supplies adequate goal variation and task distribution coverage. No quantitative metrics (e.g., number of unique goals, trajectories per goal, or state-space coverage statistics) are referenced to establish this precondition for generalization.
[Experiments] Experimental results: The abstract and results claim outperformance over online baselines and robustness, yet the provided description supplies no quantitative metrics, baseline implementation details, number of evaluation runs, statistical tests, or ablation studies on the goal-conditioning component, making it impossible to verify whether the data supports the stated claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the presentation of our dataset and experimental results. We address each major comment below and will revise the manuscript to include the requested details.

read point-by-point responses

Referee: [Dataset Description / Experiments] Dataset section: The central claim of robustness in sparse-reward multi-goal settings with limited expert demonstrations requires that the newly released Franka Emika Panda offline dataset supplies adequate goal variation and task distribution coverage. No quantitative metrics (e.g., number of unique goals, trajectories per goal, or state-space coverage statistics) are referenced to establish this precondition for generalization.

Authors: We agree that quantitative metrics are necessary to substantiate claims about dataset diversity and coverage. In the revised manuscript, we will expand the dataset section with a table and accompanying text reporting the number of unique goals, trajectories per goal, total trajectories, and state-space coverage statistics (e.g., via histograms or variance measures across dimensions). This will directly support the robustness claims in sparse-reward multi-goal settings. revision: yes
Referee: [Experiments] Experimental results: The abstract and results claim outperformance over online baselines and robustness, yet the provided description supplies no quantitative metrics, baseline implementation details, number of evaluation runs, statistical tests, or ablation studies on the goal-conditioning component, making it impossible to verify whether the data supports the stated claims.

Authors: We acknowledge the absence of these details in the current version. The revised experiments section will include: concrete performance metrics with numerical values and comparisons, full baseline implementation details (hyperparameters, code references if available), number of evaluation runs with seeds, statistical significance tests (e.g., mean ± std over runs), and an ablation study on the goal-conditioning component. These additions will enable verification of the outperformance and robustness claims. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical method validated on external dataset

full rationale

The paper introduces a Goal-Conditioned Decision Transformer as an empirical architecture for offline multi-goal RL and reports experimental outperformance on a newly released Franka Emika Panda dataset. No equations, derivations, or parameter-fitting steps are described in the abstract or framing that could reduce to self-definition or fitted-input predictions. The central claims rest on external dataset properties and baseline comparisons rather than any internal reduction by construction. This matches the default expectation of no significant circularity for an empirical robotics paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all modeling choices remain implicit.

pith-pipeline@v0.9.0 · 5652 in / 985 out tokens · 33146 ms · 2026-05-23T19:17:55.724494+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

[1]

Continuous improvement of self-driving cars using dynamic confidence-aware reinforcement learning,

Z. W. Cao Z, Jiang K, “Continuous improvement of self-driving cars using dynamic confidence-aware reinforcement learning,” Nature Machine Intelligence, 2023

work page 2023
[2]

Efficient reinforcement learning for autonomous driving with parameterized skills and priors,

L. Wang, J. Liu, H. Shao, W. Wang, R. Chen, Y . Liu, and S. L. Waslander, “Efficient reinforcement learning for autonomous driving with parameterized skills and priors,” 2023

work page 2023
[3]

Accelerating reinforcement learning for autonomous driving using task-agnostic and ego-centric motion skills,

T. Zhou, L. Wang, R. Chen, W. Wang, and Y . Liu, “Accelerating reinforcement learning for autonomous driving using task-agnostic and ego-centric motion skills,” arXiv preprint arXiv:2209.12072 , 2022

work page arXiv 2022
[4]

A review paper on implementing reinforcement learning technique in optimising games performance,

M. A. Samsuden, N. M. Diah, and N. A. Rahman, “A review paper on implementing reinforcement learning technique in optimising games performance,” in 2019 IEEE 9th International Conference on System Engineering and Technology (ICSET) , 2019, pp. 258–263

work page 2019
[5]

Modeling decisions in games using reinforcement learning,

H. Singal, P. Aggarwal, and V . Dutt, “Modeling decisions in games using reinforcement learning,” in 2017 International Conference on Machine Learning and Data Science (MLDS) , 2017, pp. 98–105

work page 2017
[6]

A unified game-theoretic approach to multiagent reinforcement learning,

M. Lanctot, V . Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. P ´erolat, D. Silver, and T. Graepel, “A unified game-theoretic approach to multiagent reinforcement learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 4193–4206

work page 2017
[7]

Accelerating robotic rein- forcement learning via parameterized action primitives,

M. Dalal, D. Pathak, and R. Salakhutdinov, “Accelerating robotic rein- forcement learning via parameterized action primitives,” in NeurIPS, 2021

work page 2021
[8]

Review of deep reinforcement learning for robot manipulation,

H. Nguyen and H. La, “Review of deep reinforcement learning for robot manipulation,” in 2019 Third IEEE International Conference on Robotic Computing (IRC) , 2019, pp. 590–595

work page 2019
[9]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc- tion. Cambridge, MA, USA: A Bradford Book, 2018

work page 2018
[10]

OpenAI Gym

G. Brockman, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,

Q. Gallou ´edec, N. Cazin, E. Dellandr ´ea, and L. Chen, “panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,” 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS, 2021

work page 2021
[12]

Sim-to- real transfer of robotic control with dynamics randomization,

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to- real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA) . IEEE, 2018, pp. 3803–3810

work page 2018
[13]

Imitation learning: A survey of learning methods,

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Comput. Surv. , vol. 50, no. 2, apr 2017. [Online]. Available: https://doi.org/10.1145/3054912

work page doi:10.1145/3054912 2017
[14]

An optimistic perspective on offline reinforcement learning,

R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in Proceedings of the 37th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol

work page
[15]

PMLR, 13–18 Jul 2020, pp. 104–114. [Online]. Available: https://proceedings.mlr.press/v119/agarwal20c.html

work page 2020
[16]

Latte: Language trajectory transformer,

A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, S. Vemprala, and R. Bonatti, “Latte: Language trajectory transformer,” in2023 IEEE International Conference on Robotics and Automation (ICRA) , 2023, pp. 7287–7294

work page 2023
[17]

Perceiver-actor: A multi- task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” in Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

work page 2022
[18]

On transforming re- inforcement learning with transformers: The development trajectory,

S. Hu, L. Shen, Y . Zhang, Y . Chen, and D. Tao, “On transforming re- inforcement learning with transformers: The development trajectory,” IEEE Transactions on Pattern Analysis and Machine Intelligence , pp. 1–20, 2024

work page 2024
[19]

A survey on transformers in reinforcement learning,

W. Li, H. Luo, Z. Lin, C. Zhang, Z. Lu, and D. Ye, “A survey on transformers in reinforcement learning,” Transactions on Machine Learning Research , 2023, survey Certification. [Online]. Available: https://openreview.net/forum?id=r30yuDPvf2

work page 2023
[20]

Decision transformer: Reinforcement learning via sequence modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” 2022

work page 2022
[21]

Transformers are adaptable task planners,

V . Jain, Y . Lin, E. Undersander, Y . Bisk, and A. Rai, “Transformers are adaptable task planners,” in Proceedings of The 6th Conference on Robot Learning , ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14–18 Dec 2023, pp. 1011–1037. [Online]. Available: https: //proceedings.mlr.press/v205/jain23a.html

work page 2023
[22]

Prompting decision transformer for few-shot policy generalization,

M. Xu, Y . Shen, S. Zhang, Y . Lu, D. Zhao, J. Tenenbaum, and C. Gan, “Prompting decision transformer for few-shot policy generalization,” in International Conference on Machine Learning . PMLR, 2022, pp. 24 631–24 645

work page 2022
[23]

Multi-game decision transformers,

K.-H. Lee, O. Nachum, S. Yang, L. Lee, C. D. Freeman, S. Guadarrama, I. Fischer, W. Xu, E. Jang, H. Michalewski, and I. Mordatch, “Multi-game decision transformers,” in Advances in Neural Information Processing Systems , A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=0gouO5saq6K

work page 2022
[24]

Reinforcement learning: An introduction,

R. Sutton and A. Barto, “Reinforcement learning: An introduction,” 1979

work page 1979
[25]

Self- learned intelligence for integrated decision and control of automated vehicles at signalized intersections. ieee transactions on intelligent transportation systems,

Y . Ren, J. Jiang, G. Zhan, S. E. Li, C. Chen, K. Li, and J. Duan, “Self- learned intelligence for integrated decision and control of automated vehicles at signalized intersections. ieee transactions on intelligent transportation systems,” vol. 23 (12), p. 24145–24156, 2022

work page 2022
[26]

Community energy storage operation via reinforcement learning with eligibility traces,

“Community energy storage operation via reinforcement learning with eligibility traces,” 2022

work page 2022
[27]

Hind- sight experience replay,

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin- der, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hind- sight experience replay,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[28]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” ICML, pp. 1856–1865, 2018

work page 2018
[29]

Controlling overestimation bias with truncated mixture of continuous distributional quantile critics,

A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov, “Controlling overestimation bias with truncated mixture of continuous distributional quantile critics,” ICML’20: Proceedings of the 37th International Conference on Machine Learning , pp. 5556–5566

work page
[30]

Playing atari with deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” NIPS, 2013

work page 2013
[31]

Reinforcement learning and markov decision processes,

M. van Otterlo and M. Wiering, “Reinforcement learning and markov decision processes,” Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12, pp. 3–42, 2012

work page 2012
[32]

Deep reinforcement learning with double q-learning,

H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” AAAI, 2016

work page 2016
[33]

Deep reinforcement learning policies learn shared adversarial features across mdps,

E. Korkmaz, “Deep reinforcement learning policies learn shared adversarial features across mdps,” Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) , vol. 36 (7), p. 7229–7238, 2022

work page 2022
[34]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” https://arxiv.org/abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Implementation matters in deep rl: A case study on ppo and trpo,

L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry, “Implementation matters in deep rl: A case study on ppo and trpo,” ICLR, 2019

work page 2019
[36]

Trust Region Policy Optimization

J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust region policy optimization,” https://arxiv.org/pdf/1502.05477, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Asynchronous methods for deep reinforcement learning,

V . Mnih, A. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” ICML, pp. 1928–1937, 2016

work page 1928
[38]

Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,

J. Duan, Y . Guan, and S. Li, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,” IEEE Transactions on Neural Networks and Learning Systems , vol. 33 (11), p. 6584–6598, 2021

work page 2021
[39]

An introduction to deep reinforcement learning,

V . Francois-Lavet, P. Henderson, R. Islam, M. Bellemare, and J. Pineau, “An introduction to deep reinforcement learning,” Foun- dations and Trends in Machine Learning , 2018

work page 2018
[40]

Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research

M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V . Kumar, and W. Zaremba, “Multi-goal reinforcement learn- ing: Challenging robotics environments and request for research,” https://arxiv.org/pdf/1802.09464

work page internal anchor Pith review Pith/arXiv arXiv
[41]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” In European Conference on Computer Vision , 2020

work page 2020
[42]

Bert: Pre- training of deep bidirectional transformers for language understand- ing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understand- ing,” NAACL-HLT, pp. 4171–4186, 2019

work page 2019
[43]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021

work page 2021
[44]

Attention is all you need. in advances in neural information processing systems,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need. in advances in neural information processing systems,” 2017

work page 2017
[45]

Rapid task-solving in novel environments,

S. Ritter, R. Faulkner, L. Sartran, A. Santoro, M. Botvinick, and D. Raposo, “Rapid task-solving in novel environments,”arXiv preprint arXiv:2006.03662, 2020

work page arXiv 2006
[46]

Multi-objective decision transformers for offline reinforcement learning,

A. Ghanem, P. Ciblat, and M. Ghogho, “Multi-objective decision transformers for offline reinforcement learning,” 2023

work page 2023
[47]

Rl baselines3 zoo,

A. Raffin, “Rl baselines3 zoo,” https://github.com/DLR-RM/ rl-baselines3-zoo, 2020

work page 2020

[1] [1]

Continuous improvement of self-driving cars using dynamic confidence-aware reinforcement learning,

Z. W. Cao Z, Jiang K, “Continuous improvement of self-driving cars using dynamic confidence-aware reinforcement learning,” Nature Machine Intelligence, 2023

work page 2023

[2] [2]

Efficient reinforcement learning for autonomous driving with parameterized skills and priors,

L. Wang, J. Liu, H. Shao, W. Wang, R. Chen, Y . Liu, and S. L. Waslander, “Efficient reinforcement learning for autonomous driving with parameterized skills and priors,” 2023

work page 2023

[3] [3]

Accelerating reinforcement learning for autonomous driving using task-agnostic and ego-centric motion skills,

T. Zhou, L. Wang, R. Chen, W. Wang, and Y . Liu, “Accelerating reinforcement learning for autonomous driving using task-agnostic and ego-centric motion skills,” arXiv preprint arXiv:2209.12072 , 2022

work page arXiv 2022

[4] [4]

A review paper on implementing reinforcement learning technique in optimising games performance,

M. A. Samsuden, N. M. Diah, and N. A. Rahman, “A review paper on implementing reinforcement learning technique in optimising games performance,” in 2019 IEEE 9th International Conference on System Engineering and Technology (ICSET) , 2019, pp. 258–263

work page 2019

[5] [5]

Modeling decisions in games using reinforcement learning,

H. Singal, P. Aggarwal, and V . Dutt, “Modeling decisions in games using reinforcement learning,” in 2017 International Conference on Machine Learning and Data Science (MLDS) , 2017, pp. 98–105

work page 2017

[6] [6]

A unified game-theoretic approach to multiagent reinforcement learning,

M. Lanctot, V . Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. P ´erolat, D. Silver, and T. Graepel, “A unified game-theoretic approach to multiagent reinforcement learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 4193–4206

work page 2017

[7] [7]

Accelerating robotic rein- forcement learning via parameterized action primitives,

M. Dalal, D. Pathak, and R. Salakhutdinov, “Accelerating robotic rein- forcement learning via parameterized action primitives,” in NeurIPS, 2021

work page 2021

[8] [8]

Review of deep reinforcement learning for robot manipulation,

H. Nguyen and H. La, “Review of deep reinforcement learning for robot manipulation,” in 2019 Third IEEE International Conference on Robotic Computing (IRC) , 2019, pp. 590–595

work page 2019

[9] [9]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc- tion. Cambridge, MA, USA: A Bradford Book, 2018

work page 2018

[10] [10]

OpenAI Gym

G. Brockman, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,

Q. Gallou ´edec, N. Cazin, E. Dellandr ´ea, and L. Chen, “panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,” 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS, 2021

work page 2021

[12] [12]

Sim-to- real transfer of robotic control with dynamics randomization,

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to- real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA) . IEEE, 2018, pp. 3803–3810

work page 2018

[13] [13]

Imitation learning: A survey of learning methods,

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Comput. Surv. , vol. 50, no. 2, apr 2017. [Online]. Available: https://doi.org/10.1145/3054912

work page doi:10.1145/3054912 2017

[14] [14]

An optimistic perspective on offline reinforcement learning,

R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in Proceedings of the 37th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol

work page

[15] [15]

PMLR, 13–18 Jul 2020, pp. 104–114. [Online]. Available: https://proceedings.mlr.press/v119/agarwal20c.html

work page 2020

[16] [16]

Latte: Language trajectory transformer,

A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, S. Vemprala, and R. Bonatti, “Latte: Language trajectory transformer,” in2023 IEEE International Conference on Robotics and Automation (ICRA) , 2023, pp. 7287–7294

work page 2023

[17] [17]

Perceiver-actor: A multi- task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” in Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

work page 2022

[18] [18]

On transforming re- inforcement learning with transformers: The development trajectory,

S. Hu, L. Shen, Y . Zhang, Y . Chen, and D. Tao, “On transforming re- inforcement learning with transformers: The development trajectory,” IEEE Transactions on Pattern Analysis and Machine Intelligence , pp. 1–20, 2024

work page 2024

[19] [19]

A survey on transformers in reinforcement learning,

W. Li, H. Luo, Z. Lin, C. Zhang, Z. Lu, and D. Ye, “A survey on transformers in reinforcement learning,” Transactions on Machine Learning Research , 2023, survey Certification. [Online]. Available: https://openreview.net/forum?id=r30yuDPvf2

work page 2023

[20] [20]

Decision transformer: Reinforcement learning via sequence modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” 2022

work page 2022

[21] [21]

Transformers are adaptable task planners,

V . Jain, Y . Lin, E. Undersander, Y . Bisk, and A. Rai, “Transformers are adaptable task planners,” in Proceedings of The 6th Conference on Robot Learning , ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14–18 Dec 2023, pp. 1011–1037. [Online]. Available: https: //proceedings.mlr.press/v205/jain23a.html

work page 2023

[22] [22]

Prompting decision transformer for few-shot policy generalization,

M. Xu, Y . Shen, S. Zhang, Y . Lu, D. Zhao, J. Tenenbaum, and C. Gan, “Prompting decision transformer for few-shot policy generalization,” in International Conference on Machine Learning . PMLR, 2022, pp. 24 631–24 645

work page 2022

[23] [23]

Multi-game decision transformers,

K.-H. Lee, O. Nachum, S. Yang, L. Lee, C. D. Freeman, S. Guadarrama, I. Fischer, W. Xu, E. Jang, H. Michalewski, and I. Mordatch, “Multi-game decision transformers,” in Advances in Neural Information Processing Systems , A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=0gouO5saq6K

work page 2022

[24] [24]

Reinforcement learning: An introduction,

R. Sutton and A. Barto, “Reinforcement learning: An introduction,” 1979

work page 1979

[25] [25]

Self- learned intelligence for integrated decision and control of automated vehicles at signalized intersections. ieee transactions on intelligent transportation systems,

Y . Ren, J. Jiang, G. Zhan, S. E. Li, C. Chen, K. Li, and J. Duan, “Self- learned intelligence for integrated decision and control of automated vehicles at signalized intersections. ieee transactions on intelligent transportation systems,” vol. 23 (12), p. 24145–24156, 2022

work page 2022

[26] [26]

Community energy storage operation via reinforcement learning with eligibility traces,

“Community energy storage operation via reinforcement learning with eligibility traces,” 2022

work page 2022

[27] [27]

Hind- sight experience replay,

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin- der, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hind- sight experience replay,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[28] [28]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” ICML, pp. 1856–1865, 2018

work page 2018

[29] [29]

Controlling overestimation bias with truncated mixture of continuous distributional quantile critics,

A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov, “Controlling overestimation bias with truncated mixture of continuous distributional quantile critics,” ICML’20: Proceedings of the 37th International Conference on Machine Learning , pp. 5556–5566

work page

[30] [30]

Playing atari with deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” NIPS, 2013

work page 2013

[31] [31]

Reinforcement learning and markov decision processes,

M. van Otterlo and M. Wiering, “Reinforcement learning and markov decision processes,” Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12, pp. 3–42, 2012

work page 2012

[32] [32]

Deep reinforcement learning with double q-learning,

H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” AAAI, 2016

work page 2016

[33] [33]

Deep reinforcement learning policies learn shared adversarial features across mdps,

E. Korkmaz, “Deep reinforcement learning policies learn shared adversarial features across mdps,” Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) , vol. 36 (7), p. 7229–7238, 2022

work page 2022

[34] [34]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” https://arxiv.org/abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Implementation matters in deep rl: A case study on ppo and trpo,

L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry, “Implementation matters in deep rl: A case study on ppo and trpo,” ICLR, 2019

work page 2019

[36] [36]

Trust Region Policy Optimization

J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust region policy optimization,” https://arxiv.org/pdf/1502.05477, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Asynchronous methods for deep reinforcement learning,

V . Mnih, A. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” ICML, pp. 1928–1937, 2016

work page 1928

[38] [38]

Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,

J. Duan, Y . Guan, and S. Li, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,” IEEE Transactions on Neural Networks and Learning Systems , vol. 33 (11), p. 6584–6598, 2021

work page 2021

[39] [39]

An introduction to deep reinforcement learning,

V . Francois-Lavet, P. Henderson, R. Islam, M. Bellemare, and J. Pineau, “An introduction to deep reinforcement learning,” Foun- dations and Trends in Machine Learning , 2018

work page 2018

[40] [40]

Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research

M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V . Kumar, and W. Zaremba, “Multi-goal reinforcement learn- ing: Challenging robotics environments and request for research,” https://arxiv.org/pdf/1802.09464

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” In European Conference on Computer Vision , 2020

work page 2020

[42] [42]

Bert: Pre- training of deep bidirectional transformers for language understand- ing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understand- ing,” NAACL-HLT, pp. 4171–4186, 2019

work page 2019

[43] [43]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021

work page 2021

[44] [44]

Attention is all you need. in advances in neural information processing systems,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need. in advances in neural information processing systems,” 2017

work page 2017

[45] [45]

Rapid task-solving in novel environments,

S. Ritter, R. Faulkner, L. Sartran, A. Santoro, M. Botvinick, and D. Raposo, “Rapid task-solving in novel environments,”arXiv preprint arXiv:2006.03662, 2020

work page arXiv 2006

[46] [46]

Multi-objective decision transformers for offline reinforcement learning,

A. Ghanem, P. Ciblat, and M. Ghogho, “Multi-objective decision transformers for offline reinforcement learning,” 2023

work page 2023

[47] [47]

Rl baselines3 zoo,

A. Raffin, “Rl baselines3 zoo,” https://github.com/DLR-RM/ rl-baselines3-zoo, 2020

work page 2020