pith. machine review for the scientific record. sign in

arxiv: 2605.00159 · v1 · submitted 2026-04-30 · 💻 cs.RO

Recognition: unknown

E²DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:11 UTC · model grok-4.3

classification 💻 cs.RO
keywords reinforcement learningdecision transformerrobotic manipulationexperience samplingdeterminantal point processsample efficiencylong-horizon tasksquality-diversity selection
0
0 comments X

The pith

A Decision Transformer for robotic manipulation learns more efficiently by actively selecting its own training experiences through guided sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents E²DT as a way to fix the limitations of standard Decision Transformers, which rely on uniform replay of experiences and thus suffer from poor sample efficiency and limited exploration in long-horizon robotic tasks. It replaces uniform sampling with a process that selects experiences based on both quality and diversity, using the model's latent embeddings to measure diversity across trajectory segments and a composite score to rank quality. This allows the system to focus on high-return, high-uncertainty, and underrepresented trajectories while preserving variety. A sympathetic reader would care because robotic manipulation often requires many trials to succeed on complex sequences, and better use of collected data could reduce the total number needed without sacrificing final performance.

Core claim

E²DT is a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. The framework is experience-aware, allowing it to be efficient by prioritizing high-return, high-uncertainty, and underrepresented trajectories, and effective by ensuring diversity across trajectory windows to preserve policy optimality. DT's internal latent embeddings measure diversity, while quality is quantified through a composite metric integrating return-to-go quantiles, predictive uncertainty, and inverse-frequency stage coverage; these are combined in a quality-diversity joint kernel that prioritizes the most informative experiences.

What carries the argument

The quality-diversity joint kernel inside a k-Determinantal Point Process, with diversity from DT latent embeddings across trajectory windows and quality from the composite metric of return-to-go quantiles, predictive uncertainty, and inverse-frequency stage coverage.

If this is right

  • Prioritizing quality and diversity in replay leads to higher sample efficiency than uniform sampling in long-horizon tasks.
  • Ensuring trajectory diversity through the joint kernel helps maintain policy optimality during learning.
  • The method produces consistent outperformance over prior approaches on both simulated and real-robot manipulation benchmarks.
  • Coupling policy learning directly with experience-aware sampling yields a more robust path for scaling to complex robotic sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling logic could be tested on other sequence models in reinforcement learning to see if gains transfer beyond Decision Transformers.
  • In physical robot settings, focusing on uncertain and diverse experiences might reduce wear on hardware by cutting the total number of trials required.
  • Adapting the kernel weights online during training could further improve results when task difficulty changes over time.

Load-bearing premise

The composite quality metric and latent-embedding diversity measure together produce an unbiased ranking of informative experiences.

What would settle it

Running E²DT on a new long-horizon robotic manipulation benchmark where it shows no improvement over standard Decision Transformer with uniform replay, or where removing either the quality metric or the diversity kernel leaves performance unchanged.

Figures

Figures reproduced from arXiv: 2605.00159 by Borong Zhang, Kaiyan Zhao, Xiaoguang Niu, Xingyu Liu, Xuetao Li, Yiming Wang, Yuyang Chen.

Figure 1
Figure 1. Figure 1: Our method prioritizes trajectories that jointly achieve [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory of a long-horizon manipulation task (Tar [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The E²DT framework diagram illustrates a closed-loop system that deeply integrates the decision model with active [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Manipulation Tasks:[Block Stacking, Nut Assembly, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean success rates (%) comparison in ManiSkill2. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world tasks on Elephant Robotics 280: (a) High [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DT's performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E$^2$DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E$^2$DT to be both efficient, by prioritizing sampling quality, such as high-return, high-uncertainty, and underrepresented trajectories, and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DT's internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage based on inverse frequency. These two dimensions are integrated into a novel quality-diversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E$^2$DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces E²DT, an extension of the Decision Transformer (DT) for long-horizon robotic manipulation. It replaces uniform replay with a DT-guided k-Determinantal Point Process (k-DPP) that selects experiences via a novel quality-diversity joint kernel. Quality is defined as a composite of RTG quantiles, DT predictive uncertainty, and inverse-frequency stage coverage; diversity is measured from the DT's latent embeddings. The authors claim this yields better sample efficiency and consistent outperformance versus standard DT and prior methods on both simulated and real-robot benchmarks.

Significance. If the empirical gains survive proper controls and the composite signals prove well-calibrated, the work would supply a concrete mechanism for coupling policy learning with experience selection in offline RL. The reuse of DT internals for both control and sampling is a clean architectural choice that could transfer to other transformer-based RL agents. The absence of quantitative numbers in the abstract and the presence of free kernel weights and k in the k-DPP, however, leave open the possibility that reported advantages partly reflect hyper-parameter search rather than the proposed framework.

major comments (3)
  1. [Method section (quality-diversity joint kernel)] Method section, quality-diversity joint kernel definition: the composite quality score is formed from RTG quantiles, predictive uncertainty, and stage coverage with explicit kernel weights listed as free parameters. The central claim that this ranking selects experiences that improve policy improvement rests on the untested axiom that the composite is calibrated to true informativeness; without an external validation set or held-out benchmark, the reported efficiency gains could be partly circular.
  2. [Experiments section] Experiments section, ablation and statistical reporting: the abstract asserts consistent outperformance on simulation and real-robot tasks, yet the provided description supplies no quantitative tables, ablation results isolating each quality component, or statistical tests (e.g., paired t-tests across seeds). Without these, it is impossible to determine whether the gains survive removal of any single signal or are sensitive to the choice of k in k-DPP.
  3. [§3.3] §3.3 (k-DPP sampling): the diversity term relies on latent embeddings from the same DT being trained; if these embeddings become overconfident or collapse during training, the k-DPP may fail to enforce genuine trajectory-window diversity, undermining the claim that the method simultaneously achieves efficiency and effectiveness.
minor comments (2)
  1. [Method] Notation for the joint kernel should be introduced once with a single equation rather than piecemeal; the current presentation makes it hard to verify how the three quality terms and the diversity kernel are combined.
  2. [Experiments] Figure captions for the real-robot setup should explicitly state the number of trials, success criterion, and whether the same random seeds were used across compared methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of calibration, empirical rigor, and training stability that we will address through targeted revisions and additions to the manuscript.

read point-by-point responses
  1. Referee: Method section (quality-diversity joint kernel): the composite quality score is formed from RTG quantiles, predictive uncertainty, and stage coverage with explicit kernel weights listed as free parameters. The central claim that this ranking selects experiences that improve policy improvement rests on the untested axiom that the composite is calibrated to true informativeness; without an external validation set or held-out benchmark, the reported efficiency gains could be partly circular.

    Authors: We appreciate the referee's point on the need to validate the composite quality score. The kernel weights are hyperparameters selected via grid search to maximize end-to-end task performance on the reported benchmarks. To directly test calibration and rule out circularity, we will add a new analysis in the revised manuscript using a held-out set of trajectories: we will compute the correlation between the composite scores and the measured policy improvement (in return) obtained when those trajectories are added to training. This will provide external evidence that the score prioritizes genuinely informative experiences. revision: yes

  2. Referee: Experiments section, ablation and statistical reporting: the abstract asserts consistent outperformance on simulation and real-robot tasks, yet the provided description supplies no quantitative tables, ablation results isolating each quality component, or statistical tests (e.g., paired t-tests across seeds). Without these, it is impossible to determine whether the gains survive removal of any single signal or are sensitive to the choice of k in k-DPP.

    Authors: We agree that quantitative tables, component-wise ablations, and statistical tests are necessary to substantiate the claims. The revised manuscript will include: (1) tables reporting mean and standard deviation of success rates and returns across five random seeds for all baselines and variants; (2) ablation studies that systematically remove each quality term (RTG quantiles, predictive uncertainty, stage coverage) to isolate contributions; (3) paired t-tests assessing statistical significance of improvements; and (4) a sensitivity study varying k in the k-DPP to demonstrate robustness of the reported gains. revision: yes

  3. Referee: §3.3 (k-DPP sampling): the diversity term relies on latent embeddings from the same DT being trained; if these embeddings become overconfident or collapse during training, the k-DPP may fail to enforce genuine trajectory-window diversity, undermining the claim that the method simultaneously achieves efficiency and effectiveness.

    Authors: This concern about embedding stability is well-taken. Because the DT is trained with an action-prediction objective, its embeddings are incentivized to distinguish different state-action sequences rather than collapse. The explicit use of predictive uncertainty within the quality kernel further biases sampling toward less confident (hence more diverse) regions. In the revision we will augment §3.3 with a discussion of this mechanism and add an empirical check (average pairwise embedding distance of selected trajectories over training epochs) to confirm that diversity is preserved throughout learning. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper proposes E²DT, a new DT-guided k-DPP framework whose quality-diversity kernel is constructed from RTG quantiles, predictive uncertainty, inverse-frequency coverage, and latent embeddings computed on the replay buffer. These quantities are used to select training experiences for the policy, after which performance is measured empirically on standard robotic manipulation benchmarks. No equation or claim reduces the reported outperformance to a tautology or self-fit by construction; the gains are presented as an empirical outcome rather than a mathematical identity. No load-bearing self-citations, uniqueness theorems, or renamed known results appear in the derivation. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on three untested modeling choices: that DT latent embeddings faithfully capture trajectory diversity, that the three-term composite quality score is a reliable proxy for informativeness, and that the k-DPP sampler preserves policy optimality when the kernel is non-stationary.

free parameters (2)
  • kernel weights for RTG quantile, uncertainty, and stage coverage
    The relative importance of the three quality signals must be chosen or fitted; the abstract does not state whether they are fixed or tuned on the same data used for policy training.
  • k in k-DPP
    The number of trajectories sampled per batch is a free parameter that directly controls the diversity-quality trade-off.
axioms (2)
  • domain assumption DT internal embeddings provide a meaningful distance for trajectory diversity
    Invoked when the paper states that latent embeddings measure diversity across trajectory windows.
  • ad hoc to paper The composite quality metric ranks experiences by true informativeness for policy improvement
    The paper defines quality via RTG quantiles, uncertainty, and inverse frequency without external validation that these signals correlate with downstream task success.
invented entities (1)
  • quality-diversity joint kernel no independent evidence
    purpose: To combine return, uncertainty, and coverage signals into a single DPP kernel for experience selection
    Newly defined in the paper; no independent evidence is supplied that the kernel generalizes beyond the reported benchmarks.

pith-pipeline@v0.9.0 · 5606 in / 1675 out tokens · 57018 ms · 2026-05-09T20:11:38.270661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Review of deep reinforcement learning for robot manipulation,

    H. Nguyen and H. La, “Review of deep reinforcement learning for robot manipulation,” in2019 Third IEEE international conference on robotic computing (IRC). IEEE, 2019, pp. 590–595

  2. [2]

    Trends and challenges in robot manipula- tion,

    A. Billard and D. Kragic, “Trends and challenges in robot manipula- tion,”Science, vol. 364, no. 6446, p. eaat8414, 2019

  3. [3]

    Deep reinforcement learning: A brief survey,

    K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,”IEEE Signal Process- ing Magazine, 2017

  4. [4]

    Deep reinforcement learning for robotics: A survey of real- world successes,

    C. Tang, B. Abbatematteo, J. Hu, R. Chandra, R. Mart ´ın-Mart´ın, and P. Stone, “Deep reinforcement learning for robotics: A survey of real- world successes,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 27, 2025, pp. 28 694–28 698

  5. [5]

    Modeling the long term future in model-based reinforce- ment learning,

    N. R. Ke, A. Singh, A. Touati, A. Goyal, Y . Bengio, D. Parikh, and D. Batra, “Modeling the long term future in model-based reinforce- ment learning,” inInternational Conference on Learning Representa- tions, 2019

  6. [6]

    Deep reinforcement learning-based large-scale robot exploration,

    Y . Cao, R. Zhao, Y . Wang, B. Xiang, and G. Sartoretti, “Deep reinforcement learning-based large-scale robot exploration,”IEEE Robotics and Automation Letters, vol. 9, no. 5, pp. 4631–4638, 2024

  7. [7]

    Exploration in deep reinforcement learning: A survey,

    P. Ladosz, L. Weng, M. Kim, and H. Oh, “Exploration in deep reinforcement learning: A survey,”Information Fusion, vol. 85, pp. 1–22, 2022

  8. [8]

    Efficient potential- based exploration in reinforcement learning using inverse dynamic bisimulation metric,

    Y . Wang, M. Yang, R. Dong, B. Sun, F. Liu,et al., “Efficient potential- based exploration in reinforcement learning using inverse dynamic bisimulation metric,”Advances in Neural Information Processing Systems, vol. 36, pp. 38 786–38 797, 2023

  9. [9]

    Rethinking exploration in rein- forcement learning with effective metric-based exploration bonus,

    Y . Wang, K. Zhao, F. Liu,et al., “Rethinking exploration in rein- forcement learning with effective metric-based exploration bonus,” Advances in Neural Information Processing Systems, vol. 37, pp. 57 765–57 792, 2024

  10. [10]

    URL https://doi.org/10.1109/ICRA48891.2023.10160591

    T. Huang, K. Chen, B. Li, Y . Liu, and Q. Dou, “Demonstration- guided reinforcement learning with efficient exploration for task automation of surgical robot,” inIEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. IEEE, 2023, pp. 4640–4647. [Online]. Available: https://doi.org/10.1109/ICRA48891.2023.10160327

  11. [11]

    Sample efficient reinforcement learning with reinforce,

    J. Zhang, J. Kim, B. O’Donoghue, and S. Boyd, “Sample efficient reinforcement learning with reinforce,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 10 887– 10 895

  12. [12]

    Pre-training goal-based models for sample-efficient reinforcement learning,

    H. Yuan, Z. Mu, F. Xie, and Z. Lu, “Pre-training goal-based models for sample-efficient reinforcement learning,” inThe Twelfth International Conference on Learning Representations, 2024

  13. [13]

    Sample-efficient learning to solve a real- world labyrinth game using data-augmented model-based reinforce- ment learning,

    T. Bi and R. D’Andrea, “Sample-efficient learning to solve a real- world labyrinth game using data-augmented model-based reinforce- ment learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 7455–7460

  14. [14]

    Dhrl: A graph-based approach for long-horizon and sparse hierarchical reinforcement learning,

    S. Lee, J. Kim, I. Jang, and H. J. Kim, “Dhrl: A graph-based approach for long-horizon and sparse hierarchical reinforcement learning,”Ad- vances in Neural Information Processing Systems, vol. 35, pp. 13 668– 13 678, 2022

  15. [15]

    Learning active manipula- tion to target shapes with model-free, long-horizon deep reinforcement learning,

    M. Sivertsvik, K. Sumskiy, and E. Misimi, “Learning active manipula- tion to target shapes with model-free, long-horizon deep reinforcement learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5411–5418

  16. [16]

    Decision transformer: Reinforcement learning via sequence modeling,

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,”Advances in neural information processing systems, vol. 34, pp. 15 084–15 097, 2021

  17. [17]

    k-dpps: Fixed-size determinantal point processes,

    A. Kulesza and B. Taskar, “k-dpps: Fixed-size determinantal point processes,” inProceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1193–1200

  18. [18]

    Determinantal point processes for machine learning,

    A. Kulesza, B. Taskar,et al., “Determinantal point processes for machine learning,”Foundations and Trends® in Machine Learning, 2012

  19. [19]

    Efficient sampling for k- determinantal point processes,

    C. Li, S. Jegelka, and S. Sra, “Efficient sampling for k- determinantal point processes,” 2016. [Online]. Available: https: //arxiv.org/abs/1509.01618

  20. [20]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Y . Zhu, J. Wong, A. Mandlekar, R. Mart ´ın-Mart´ın, A. Joshi, S. Nasiriany, Y . Zhu, and K. Lin, “robosuite: A modular simulation framework and benchmark for robot learning,” inarXiv preprint arXiv:2009.12293, 2020

  21. [21]

    A survey on deep reinforcement learning algorithms for robotic manipulation,

    D. Han, B. Mulyana, V . Stankovic, and S. Cheng, “A survey on deep reinforcement learning algorithms for robotic manipulation,”Sensors, vol. 23, no. 7, 2023

  22. [22]

    Bile: an effective behavior- based latent exploration scheme for deep reinforcement learning,

    Y . Wang, K. Zhao, Y . Li, and L. H. U, “Bile: an effective behavior- based latent exploration scheme for deep reinforcement learning,” in Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025, pp. 6497–6505

  23. [23]

    Efficient diversity- based experience replay for deep reinforcement learning,

    K. Zhao, Y . Wang, Y . Chen, Y . Li, X. Niu,et al., “Efficient diversity- based experience replay for deep reinforcement learning,”arXiv preprint arXiv:2410.20487, 2024

  24. [24]

    Active predictive coding: Brain- inspired reinforcement learning for sparse reward robotic control problems,

    A. Ororbia and A. A. Mali, “Active predictive coding: Brain- inspired reinforcement learning for sparse reward robotic control problems,” inIEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2,

  25. [25]

    URL https://doi.org/10.1109/ICRA48891.2023.10160591

    IEEE, 2023, pp. 3015–3021. [Online]. Available: https: //doi.org/10.1109/ICRA48891.2023.10160530

  26. [26]

    URL https://doi.org/10.1109/ICRA48891.2023.10160591

    J. Chen, T. Lan, and V . Aggarwal, “Option-aware adversarial inverse reinforcement learning for robotic control,” inIEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. IEEE, 2023, pp. 5902–5908. [Online]. Available: https://doi.org/10.1109/ICRA48891.2023.10160374

  27. [27]

    In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

    S. Hegde, Z. Huang, and G. S. Sukhatme, “Hyperppo: A scalable method for finding small policies for robotic control,” inIEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024. IEEE, 2024, pp. 10 821– 10 828. [Online]. Available: https://doi.org/10.1109/ICRA57147.2024. 10610861

  28. [28]

    Deep reinforcement learning for robotic manipulation,

    S. Gu, E. Holly, T. P. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation,”arXiv preprint arXiv:1610.00633, vol. 1, no. 1, 2016

  29. [29]

    Playing Atari with Deep Reinforcement Learning

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” 2013. [Online]. Available: https://arxiv.org/abs/1312.5602

  30. [30]

    Continuous control with deep reinforcement learning

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2019. [Online]. Available: https: //arxiv.org/abs/1509.02971

  31. [31]

    Asynchronous methods for deep rein- forcement learning,

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning. PmLR, 2016, pp. 1928–1937

  32. [32]

    Transformer in reinforcement learning for decision-making: a sur- vey,

    W. Yuan, J. Chen, S. Chen, D. Feng, Z. Hu, P. Li, and W. Zhao, “Transformer in reinforcement learning for decision-making: a sur- vey,”Frontiers of Information Technology & Electronic Engineering, vol. 25, no. 6, pp. 763–790, 2024

  33. [33]

    Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su, “Maniskill2: A unified benchmark for generalizable manipulation skills,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04659

  34. [34]

    Hindsight experience replay,

    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin- der, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” inNeural Information Processing Systems, 2017

  35. [35]

    Synthetic experience replay,

    C. Lu, P. Ball, Y . W. Teh, and J. Parker-Holder, “Synthetic experience replay,”Advances in Neural Information Processing Systems, vol. 36, pp. 46 323–46 344, 2023

  36. [36]

    Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks,

    S. Nasiriany, H. Liu, and Y . Zhu, “Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7477–7484

  37. [37]

    Prioritizing samples in reinforcement learning with reducible loss,

    S. Sujit, S. Nath, P. Braga, and S. Ebrahimi Kahou, “Prioritizing samples in reinforcement learning with reducible loss,”Advances in Neural Information Processing Systems, vol. 36, pp. 23 237–23 258, 2023

  38. [38]

    Skilltree: Explainable skill-based deep reinforcement learning for long-horizon control tasks,

    Y . Wen, S. Li, R. Zuo, L. Yuan, H. Mao, and P. Liu, “Skilltree: Explainable skill-based deep reinforcement learning for long-horizon control tasks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 20, 2025, pp. 21 491–21 500