arxiv: 2605.00159 · v1 · submitted 2026-04-30 · 💻 cs.RO

Recognition: unknown

E²DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation

Kaiyan Zhao , Borong Zhang , Yiming Wang , Xingyu Liu , Xuetao Li , Yuyang Chen , Xiaoguang Niu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:11 UTC · model grok-4.3

classification 💻 cs.RO

keywords reinforcement learningdecision transformerrobotic manipulationexperience samplingdeterminantal point processsample efficiencylong-horizon tasksquality-diversity selection

0 comments

The pith

A Decision Transformer for robotic manipulation learns more efficiently by actively selecting its own training experiences through guided sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents E²DT as a way to fix the limitations of standard Decision Transformers, which rely on uniform replay of experiences and thus suffer from poor sample efficiency and limited exploration in long-horizon robotic tasks. It replaces uniform sampling with a process that selects experiences based on both quality and diversity, using the model's latent embeddings to measure diversity across trajectory segments and a composite score to rank quality. This allows the system to focus on high-return, high-uncertainty, and underrepresented trajectories while preserving variety. A sympathetic reader would care because robotic manipulation often requires many trials to succeed on complex sequences, and better use of collected data could reduce the total number needed without sacrificing final performance.

Core claim

E²DT is a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. The framework is experience-aware, allowing it to be efficient by prioritizing high-return, high-uncertainty, and underrepresented trajectories, and effective by ensuring diversity across trajectory windows to preserve policy optimality. DT's internal latent embeddings measure diversity, while quality is quantified through a composite metric integrating return-to-go quantiles, predictive uncertainty, and inverse-frequency stage coverage; these are combined in a quality-diversity joint kernel that prioritizes the most informative experiences.

What carries the argument

The quality-diversity joint kernel inside a k-Determinantal Point Process, with diversity from DT latent embeddings across trajectory windows and quality from the composite metric of return-to-go quantiles, predictive uncertainty, and inverse-frequency stage coverage.

If this is right

Prioritizing quality and diversity in replay leads to higher sample efficiency than uniform sampling in long-horizon tasks.
Ensuring trajectory diversity through the joint kernel helps maintain policy optimality during learning.
The method produces consistent outperformance over prior approaches on both simulated and real-robot manipulation benchmarks.
Coupling policy learning directly with experience-aware sampling yields a more robust path for scaling to complex robotic sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling logic could be tested on other sequence models in reinforcement learning to see if gains transfer beyond Decision Transformers.
In physical robot settings, focusing on uncertain and diverse experiences might reduce wear on hardware by cutting the total number of trials required.
Adapting the kernel weights online during training could further improve results when task difficulty changes over time.

Load-bearing premise

The composite quality metric and latent-embedding diversity measure together produce an unbiased ranking of informative experiences.

What would settle it

Running E²DT on a new long-horizon robotic manipulation benchmark where it shows no improvement over standard Decision Transformer with uniform replay, or where removing either the quality metric or the diversity kernel leaves performance unchanged.

Figures

Figures reproduced from arXiv: 2605.00159 by Borong Zhang, Kaiyan Zhao, Xiaoguang Niu, Xingyu Liu, Xuetao Li, Yiming Wang, Yuyang Chen.

**Figure 2.** Figure 2: Trajectory of a long-horizon manipulation task (Tar [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The E²DT framework diagram illustrates a closed-loop system that deeply integrates the decision model with active [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Manipulation Tasks:[Block Stacking, Nut Assembly, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Mean success rates (%) comparison in ManiSkill2. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world tasks on Elephant Robotics 280: (a) High [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DT's performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E$^2$DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E$^2$DT to be both efficient, by prioritizing sampling quality, such as high-return, high-uncertainty, and underrepresented trajectories, and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DT's internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage based on inverse frequency. These two dimensions are integrated into a novel quality-diversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E$^2$DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

E²DT combines DT embeddings with a quality-diversity k-DPP kernel for experience sampling in robotic RL, but the abstract's performance claims rest on unseen numbers and controls.

read the letter

The main point is that this paper adds a specific joint kernel to a k-DPP sampler inside the Decision Transformer loop. Quality comes from RTG quantiles, predictive uncertainty, and inverse-frequency stage coverage; diversity comes from the model's own latent embeddings across trajectory windows. The goal is to move beyond uniform replay for long-horizon manipulation tasks without tipping into excessive exploration that slows convergence.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces E²DT, an extension of the Decision Transformer (DT) for long-horizon robotic manipulation. It replaces uniform replay with a DT-guided k-Determinantal Point Process (k-DPP) that selects experiences via a novel quality-diversity joint kernel. Quality is defined as a composite of RTG quantiles, DT predictive uncertainty, and inverse-frequency stage coverage; diversity is measured from the DT's latent embeddings. The authors claim this yields better sample efficiency and consistent outperformance versus standard DT and prior methods on both simulated and real-robot benchmarks.

Significance. If the empirical gains survive proper controls and the composite signals prove well-calibrated, the work would supply a concrete mechanism for coupling policy learning with experience selection in offline RL. The reuse of DT internals for both control and sampling is a clean architectural choice that could transfer to other transformer-based RL agents. The absence of quantitative numbers in the abstract and the presence of free kernel weights and k in the k-DPP, however, leave open the possibility that reported advantages partly reflect hyper-parameter search rather than the proposed framework.

major comments (3)

[Method section (quality-diversity joint kernel)] Method section, quality-diversity joint kernel definition: the composite quality score is formed from RTG quantiles, predictive uncertainty, and stage coverage with explicit kernel weights listed as free parameters. The central claim that this ranking selects experiences that improve policy improvement rests on the untested axiom that the composite is calibrated to true informativeness; without an external validation set or held-out benchmark, the reported efficiency gains could be partly circular.
[Experiments section] Experiments section, ablation and statistical reporting: the abstract asserts consistent outperformance on simulation and real-robot tasks, yet the provided description supplies no quantitative tables, ablation results isolating each quality component, or statistical tests (e.g., paired t-tests across seeds). Without these, it is impossible to determine whether the gains survive removal of any single signal or are sensitive to the choice of k in k-DPP.
[§3.3] §3.3 (k-DPP sampling): the diversity term relies on latent embeddings from the same DT being trained; if these embeddings become overconfident or collapse during training, the k-DPP may fail to enforce genuine trajectory-window diversity, undermining the claim that the method simultaneously achieves efficiency and effectiveness.

minor comments (2)

[Method] Notation for the joint kernel should be introduced once with a single equation rather than piecemeal; the current presentation makes it hard to verify how the three quality terms and the diversity kernel are combined.
[Experiments] Figure captions for the real-robot setup should explicitly state the number of trials, success criterion, and whether the same random seeds were used across compared methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of calibration, empirical rigor, and training stability that we will address through targeted revisions and additions to the manuscript.

read point-by-point responses

Referee: Method section (quality-diversity joint kernel): the composite quality score is formed from RTG quantiles, predictive uncertainty, and stage coverage with explicit kernel weights listed as free parameters. The central claim that this ranking selects experiences that improve policy improvement rests on the untested axiom that the composite is calibrated to true informativeness; without an external validation set or held-out benchmark, the reported efficiency gains could be partly circular.

Authors: We appreciate the referee's point on the need to validate the composite quality score. The kernel weights are hyperparameters selected via grid search to maximize end-to-end task performance on the reported benchmarks. To directly test calibration and rule out circularity, we will add a new analysis in the revised manuscript using a held-out set of trajectories: we will compute the correlation between the composite scores and the measured policy improvement (in return) obtained when those trajectories are added to training. This will provide external evidence that the score prioritizes genuinely informative experiences. revision: yes
Referee: Experiments section, ablation and statistical reporting: the abstract asserts consistent outperformance on simulation and real-robot tasks, yet the provided description supplies no quantitative tables, ablation results isolating each quality component, or statistical tests (e.g., paired t-tests across seeds). Without these, it is impossible to determine whether the gains survive removal of any single signal or are sensitive to the choice of k in k-DPP.

Authors: We agree that quantitative tables, component-wise ablations, and statistical tests are necessary to substantiate the claims. The revised manuscript will include: (1) tables reporting mean and standard deviation of success rates and returns across five random seeds for all baselines and variants; (2) ablation studies that systematically remove each quality term (RTG quantiles, predictive uncertainty, stage coverage) to isolate contributions; (3) paired t-tests assessing statistical significance of improvements; and (4) a sensitivity study varying k in the k-DPP to demonstrate robustness of the reported gains. revision: yes
Referee: §3.3 (k-DPP sampling): the diversity term relies on latent embeddings from the same DT being trained; if these embeddings become overconfident or collapse during training, the k-DPP may fail to enforce genuine trajectory-window diversity, undermining the claim that the method simultaneously achieves efficiency and effectiveness.

Authors: This concern about embedding stability is well-taken. Because the DT is trained with an action-prediction objective, its embeddings are incentivized to distinguish different state-action sequences rather than collapse. The explicit use of predictive uncertainty within the quality kernel further biases sampling toward less confident (hence more diverse) regions. In the revision we will augment §3.3 with a discussion of this mechanism and add an empirical check (average pairwise embedding distance of selected trajectories over training epochs) to confirm that diversity is preserved throughout learning. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper proposes E²DT, a new DT-guided k-DPP framework whose quality-diversity kernel is constructed from RTG quantiles, predictive uncertainty, inverse-frequency coverage, and latent embeddings computed on the replay buffer. These quantities are used to select training experiences for the policy, after which performance is measured empirically on standard robotic manipulation benchmarks. No equation or claim reduces the reported outperformance to a tautology or self-fit by construction; the gains are presented as an empirical outcome rather than a mathematical identity. No load-bearing self-citations, uniqueness theorems, or renamed known results appear in the derivation. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on three untested modeling choices: that DT latent embeddings faithfully capture trajectory diversity, that the three-term composite quality score is a reliable proxy for informativeness, and that the k-DPP sampler preserves policy optimality when the kernel is non-stationary.

free parameters (2)

kernel weights for RTG quantile, uncertainty, and stage coverage
The relative importance of the three quality signals must be chosen or fitted; the abstract does not state whether they are fixed or tuned on the same data used for policy training.
k in k-DPP
The number of trajectories sampled per batch is a free parameter that directly controls the diversity-quality trade-off.

axioms (2)

domain assumption DT internal embeddings provide a meaningful distance for trajectory diversity
Invoked when the paper states that latent embeddings measure diversity across trajectory windows.
ad hoc to paper The composite quality metric ranks experiences by true informativeness for policy improvement
The paper defines quality via RTG quantiles, uncertainty, and inverse frequency without external validation that these signals correlate with downstream task success.

invented entities (1)

quality-diversity joint kernel no independent evidence
purpose: To combine return, uncertainty, and coverage signals into a single DPP kernel for experience selection
Newly defined in the paper; no independent evidence is supplied that the kernel generalizes beyond the reported benchmarks.

pith-pipeline@v0.9.0 · 5606 in / 1675 out tokens · 57018 ms · 2026-05-09T20:11:38.270661+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Review of deep reinforcement learning for robot manipulation,

H. Nguyen and H. La, “Review of deep reinforcement learning for robot manipulation,” in2019 Third IEEE international conference on robotic computing (IRC). IEEE, 2019, pp. 590–595

2019
[2]

Trends and challenges in robot manipula- tion,

A. Billard and D. Kragic, “Trends and challenges in robot manipula- tion,”Science, vol. 364, no. 6446, p. eaat8414, 2019

2019
[3]

Deep reinforcement learning: A brief survey,

K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,”IEEE Signal Process- ing Magazine, 2017

2017
[4]

Deep reinforcement learning for robotics: A survey of real- world successes,

C. Tang, B. Abbatematteo, J. Hu, R. Chandra, R. Mart ´ın-Mart´ın, and P. Stone, “Deep reinforcement learning for robotics: A survey of real- world successes,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 27, 2025, pp. 28 694–28 698

2025
[5]

Modeling the long term future in model-based reinforce- ment learning,

N. R. Ke, A. Singh, A. Touati, A. Goyal, Y . Bengio, D. Parikh, and D. Batra, “Modeling the long term future in model-based reinforce- ment learning,” inInternational Conference on Learning Representa- tions, 2019

2019
[6]

Deep reinforcement learning-based large-scale robot exploration,

Y . Cao, R. Zhao, Y . Wang, B. Xiang, and G. Sartoretti, “Deep reinforcement learning-based large-scale robot exploration,”IEEE Robotics and Automation Letters, vol. 9, no. 5, pp. 4631–4638, 2024

2024
[7]

Exploration in deep reinforcement learning: A survey,

P. Ladosz, L. Weng, M. Kim, and H. Oh, “Exploration in deep reinforcement learning: A survey,”Information Fusion, vol. 85, pp. 1–22, 2022

2022
[8]

Efficient potential- based exploration in reinforcement learning using inverse dynamic bisimulation metric,

Y . Wang, M. Yang, R. Dong, B. Sun, F. Liu,et al., “Efficient potential- based exploration in reinforcement learning using inverse dynamic bisimulation metric,”Advances in Neural Information Processing Systems, vol. 36, pp. 38 786–38 797, 2023

2023
[9]

Rethinking exploration in rein- forcement learning with effective metric-based exploration bonus,

Y . Wang, K. Zhao, F. Liu,et al., “Rethinking exploration in rein- forcement learning with effective metric-based exploration bonus,” Advances in Neural Information Processing Systems, vol. 37, pp. 57 765–57 792, 2024

2024
[10]

URL https://doi.org/10.1109/ICRA48891.2023.10160591

T. Huang, K. Chen, B. Li, Y . Liu, and Q. Dou, “Demonstration- guided reinforcement learning with efficient exploration for task automation of surgical robot,” inIEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. IEEE, 2023, pp. 4640–4647. [Online]. Available: https://doi.org/10.1109/ICRA48891.2023.10160327

work page doi:10.1109/icra48891.2023.10160327 2023
[11]

Sample efficient reinforcement learning with reinforce,

J. Zhang, J. Kim, B. O’Donoghue, and S. Boyd, “Sample efficient reinforcement learning with reinforce,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 10 887– 10 895

2021
[12]

Pre-training goal-based models for sample-efficient reinforcement learning,

H. Yuan, Z. Mu, F. Xie, and Z. Lu, “Pre-training goal-based models for sample-efficient reinforcement learning,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[13]

Sample-efficient learning to solve a real- world labyrinth game using data-augmented model-based reinforce- ment learning,

T. Bi and R. D’Andrea, “Sample-efficient learning to solve a real- world labyrinth game using data-augmented model-based reinforce- ment learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 7455–7460

2024
[14]

Dhrl: A graph-based approach for long-horizon and sparse hierarchical reinforcement learning,

S. Lee, J. Kim, I. Jang, and H. J. Kim, “Dhrl: A graph-based approach for long-horizon and sparse hierarchical reinforcement learning,”Ad- vances in Neural Information Processing Systems, vol. 35, pp. 13 668– 13 678, 2022

2022
[15]

Learning active manipula- tion to target shapes with model-free, long-horizon deep reinforcement learning,

M. Sivertsvik, K. Sumskiy, and E. Misimi, “Learning active manipula- tion to target shapes with model-free, long-horizon deep reinforcement learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5411–5418

2024
[16]

Decision transformer: Reinforcement learning via sequence modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,”Advances in neural information processing systems, vol. 34, pp. 15 084–15 097, 2021

2021
[17]

k-dpps: Fixed-size determinantal point processes,

A. Kulesza and B. Taskar, “k-dpps: Fixed-size determinantal point processes,” inProceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1193–1200

2011
[18]

Determinantal point processes for machine learning,

A. Kulesza, B. Taskar,et al., “Determinantal point processes for machine learning,”Foundations and Trends® in Machine Learning, 2012

2012
[19]

Efficient sampling for k- determinantal point processes,

C. Li, S. Jegelka, and S. Sra, “Efficient sampling for k- determinantal point processes,” 2016. [Online]. Available: https: //arxiv.org/abs/1509.01618

work page arXiv 2016
[20]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Y . Zhu, J. Wong, A. Mandlekar, R. Mart ´ın-Mart´ın, A. Joshi, S. Nasiriany, Y . Zhu, and K. Lin, “robosuite: A modular simulation framework and benchmark for robot learning,” inarXiv preprint arXiv:2009.12293, 2020

work page internal anchor Pith review arXiv 2009
[21]

A survey on deep reinforcement learning algorithms for robotic manipulation,

D. Han, B. Mulyana, V . Stankovic, and S. Cheng, “A survey on deep reinforcement learning algorithms for robotic manipulation,”Sensors, vol. 23, no. 7, 2023

2023
[22]

Bile: an effective behavior- based latent exploration scheme for deep reinforcement learning,

Y . Wang, K. Zhao, Y . Li, and L. H. U, “Bile: an effective behavior- based latent exploration scheme for deep reinforcement learning,” in Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025, pp. 6497–6505

2025
[23]

Efficient diversity- based experience replay for deep reinforcement learning,

K. Zhao, Y . Wang, Y . Chen, Y . Li, X. Niu,et al., “Efficient diversity- based experience replay for deep reinforcement learning,”arXiv preprint arXiv:2410.20487, 2024

work page arXiv 2024
[24]

Active predictive coding: Brain- inspired reinforcement learning for sparse reward robotic control problems,

A. Ororbia and A. A. Mali, “Active predictive coding: Brain- inspired reinforcement learning for sparse reward robotic control problems,” inIEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2,

2023
[25]

URL https://doi.org/10.1109/ICRA48891.2023.10160591

IEEE, 2023, pp. 3015–3021. [Online]. Available: https: //doi.org/10.1109/ICRA48891.2023.10160530

work page doi:10.1109/icra48891.2023.10160530 2023
[26]

URL https://doi.org/10.1109/ICRA48891.2023.10160591

J. Chen, T. Lan, and V . Aggarwal, “Option-aware adversarial inverse reinforcement learning for robotic control,” inIEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. IEEE, 2023, pp. 5902–5908. [Online]. Available: https://doi.org/10.1109/ICRA48891.2023.10160374

work page doi:10.1109/icra48891.2023.10160374 2023
[27]

In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

S. Hegde, Z. Huang, and G. S. Sukhatme, “Hyperppo: A scalable method for finding small policies for robotic control,” inIEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024. IEEE, 2024, pp. 10 821– 10 828. [Online]. Available: https://doi.org/10.1109/ICRA57147.2024. 10610861

work page doi:10.1109/icra57147.2024 2024
[28]

Deep reinforcement learning for robotic manipulation,

S. Gu, E. Holly, T. P. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation,”arXiv preprint arXiv:1610.00633, vol. 1, no. 1, 2016

work page arXiv 2016
[29]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” 2013. [Online]. Available: https://arxiv.org/abs/1312.5602

work page internal anchor Pith review arXiv 2013
[30]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2019. [Online]. Available: https: //arxiv.org/abs/1509.02971

work page internal anchor Pith review arXiv 2019
[31]

Asynchronous methods for deep rein- forcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning. PmLR, 2016, pp. 1928–1937

2016
[32]

Transformer in reinforcement learning for decision-making: a sur- vey,

W. Yuan, J. Chen, S. Chen, D. Feng, Z. Hu, P. Li, and W. Zhao, “Transformer in reinforcement learning for decision-making: a sur- vey,”Frontiers of Information Technology & Electronic Engineering, vol. 25, no. 6, pp. 763–790, 2024

2024
[33]

Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su, “Maniskill2: A unified benchmark for generalizable manipulation skills,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04659

work page arXiv 2023
[34]

Hindsight experience replay,

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin- der, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” inNeural Information Processing Systems, 2017

2017
[35]

Synthetic experience replay,

C. Lu, P. Ball, Y . W. Teh, and J. Parker-Holder, “Synthetic experience replay,”Advances in Neural Information Processing Systems, vol. 36, pp. 46 323–46 344, 2023

2023
[36]

Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks,

S. Nasiriany, H. Liu, and Y . Zhu, “Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7477–7484

2022
[37]

Prioritizing samples in reinforcement learning with reducible loss,

S. Sujit, S. Nath, P. Braga, and S. Ebrahimi Kahou, “Prioritizing samples in reinforcement learning with reducible loss,”Advances in Neural Information Processing Systems, vol. 36, pp. 23 237–23 258, 2023

2023
[38]

Skilltree: Explainable skill-based deep reinforcement learning for long-horizon control tasks,

Y . Wen, S. Li, R. Zuo, L. Yuan, H. Mao, and P. Liu, “Skilltree: Explainable skill-based deep reinforcement learning for long-horizon control tasks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 20, 2025, pp. 21 491–21 500

2025