GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization

Junjie Lai; Qiwei Wu; Renjing Xu; Rui Zhang; Tao Li; Weihua Zhang; Yunrong Guo; Zhengyu Zhang

arxiv: 2606.03335 · v1 · pith:RRAV4LVInew · submitted 2026-06-02 · 💻 cs.RO

GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization

Rui Zhang , Qiwei Wu , Zhengyu Zhang , Tao Li , Yunrong Guo , Junjie Lai , Renjing Xu , Weihua Zhang This is my paper

Pith reviewed 2026-06-28 09:44 UTC · model grok-4.3

classification 💻 cs.RO

keywords multi-task reinforcement learningdemonstration guided policy optimizationGPU-parallel trainingrobot manipulationbehavior cloningon-policy RLMT-Libero benchmark

0 comments

The pith

DGPO combines PPO with adaptive behavior cloning to let multi-task robot policies prefer demonstrated task distributions to a tunable degree.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a construction method that turns families of structured manipulation tasks into GPU-parallel multi-task RL benchmarks, instantiated as MT-Libero in Isaac Lab. It then introduces DGPO, an on-policy algorithm that mixes importance-weighted PPO updates with adaptive behavior cloning on matched demonstration actions. This setup supports simultaneous training across heterogeneous tasks with parallel rendering and physics randomization, using either state or visual inputs. A sympathetic reader would care because sparse success signals make pure RL difficult in such settings, and the method aims to retain PPO stability and online improvement while adding tunable guidance from limited prior data. The claim is that this outperforms both demonstration-free RL and existing demo-based approaches.

Core claim

DGPO is an on-policy demonstration guided method that combines importance weighted PPO with adaptive behavior cloning on matched demonstration actions. It enables a tunable preference toward demonstrated task distributions, outperforming both prior-free RL and existing demonstration-based methods while preserving the stability and online improvement benefits of on-policy PPO. The supporting benchmark MT-Libero is built by a construction methodology that converts structured manipulation task families into GPU-parallel environments supporting simultaneous heterogeneous training.

What carries the argument

DGPO (Demonstration Guided Policy Optimization), which integrates importance-weighted PPO with adaptive behavior cloning on matched demonstration actions to control the strength of preference for demonstrated behaviors.

If this is right

Simultaneous reinforcement learning over heterogeneous task suites becomes practical with parallel rendering, physics randomization, and support for state-input or visual-input policies.
Policies gain a controllable bias toward demonstrated task distributions without giving up the ability to improve online through PPO updates.
Sparse success signals in multi-task robot manipulation can be addressed by blending on-policy RL with adaptive cloning on available demonstration actions.
The same training run can handle multiple tasks at once while still allowing per-task specialization through the tunable preference mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark construction approach could be applied to other simulation platforms or task domains beyond manipulation to create additional large-scale parallel training suites.
Tunable demonstration preference might reduce the sample complexity needed for real-robot fine-tuning after simulation training.
If the importance weighting in DGPO can be adjusted dynamically per task, it could enable automatic balancing across task difficulties within one training run.

Load-bearing premise

The construction methodology can turn arbitrary structured manipulation task families into GPU-parallel benchmarks that support simultaneous heterogeneous training with matched demonstration actions for the adaptive behavior cloning component.

What would settle it

A direct comparison in MT-Libero where DGPO either loses PPO-style stability during training or fails to exceed the success rates of prior-free PPO and existing demonstration methods across the task suite.

Figures

Figures reproduced from arXiv: 2606.03335 by Junjie Lai, Qiwei Wu, Renjing Xu, Rui Zhang, Tao Li, Weihua Zhang, Yunrong Guo, Zhengyu Zhang.

**Figure 1.** Figure 1: Overview of MT-Libero and DGPO. Left: MT-Libero executes heterogeneous LIBERO manipulation tasks as parallel Isaac Lab task groups within one vectorized training loop. Right: DGPO trains a shared multi-task policy by combining importance-weighted PPO with adaptive behavior cloning on matched demonstration actions. Benchmark scale alone does not solve sparse exploration or multi-task optimization. Demonstr… view at source ↗

**Figure 2.** Figure 2: Core elements for GPU-parallel multi-task RL environments. The figure summarizes how scene and task definitions, offline asset conversion to USD, heterogeneous GPU vectorized task groups, and task-specific rewards and resets are converted into reusable MT-RL environment components. The examples also illustrate that the same construction can organize parallel RL environments across different embodiments. r… view at source ↗

**Figure 3.** Figure 3: Experimental overview. (a) Success rate distributions across tasks and methods; (b) IW-PPO task weight heatmap over the first 200M steps; (c) adaptive BC task weight evolution; (d) ablation learning curves; (e) ablation performance on the 20th percentile tasks; and (f) the tradeoff between specialization and generalization under perturbations inside the demonstration range. 4.5 Demonstration Range Generali… view at source ↗

**Figure 4.** Figure 4: RoboTwin multi-task benchmark extension. We additionally reproduced a multi-task RoboTwin benchmark instantiation to test whether the same descriptor-based construction recipe extends beyond LIBERO to dual-arm manipulation scenes with different assets, task geometry, and coordination structure. C Supplementary Experimental Details Section C.1 verifies that the benchmark construction scales beyond LIBERO (R… view at source ↗

**Figure 5.** Figure 5: Comparison of state-input and visual-input MT-RFCL and MT-RLPD under the single-L20 budget. Mean episode reward (left) and per-suite success rate (right) over training; both plateau near zero throughout, consistent with the buffer- and batch-size constraints in table 10 [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative real robot rollouts. The figure shows rollout snapshots from the preliminary real-world transfer check across three tabletop scenes using one simulation-trained state-input policy. algorithmic changes. A host-RAM-backed replay buffer with overlapped host-to-device streaming and double-buffered minibatches would restore the state-input replay capacity at a modest per-step transfer cost. A distr… view at source ↗

read the original abstract

Large scale GPU-parallel reinforcement learning has changed what can be trained in robot simulation, yet most systems still optimize one specialist policy per task. We propose a construction methodology for turning structured manipulation task families into GPU-parallel multi-task RL benchmarks, and instantiate it as MT-Libero using LIBERO assets and task predicates in Isaac Lab. The resulting benchmark supports simultaneous reinforcement learning over heterogeneous task suites with parallel rendering, physics randomization, and state-input or visual-input policies. To make such training practical under sparse success signals and limited prior data, we further propose DGPO, an on-policy demonstration guided method that combines importance weighted PPO with adaptive behavior cloning on matched demonstration actions. DGPO enables a tunable preference toward demonstrated task distributions, outperforming both prior-free RL and existing demonstration-based methods while preserving the stability and online improvement benefits of on-policy PPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces MT-Libero as a GPU-parallel multi-task benchmark and DGPO as an on-policy hybrid of weighted PPO plus adaptive BC, which looks like a practical engineering step but rests on unshown experiments.

read the letter

The main takeaway is a construction method that turns structured manipulation task families into GPU-parallel multi-task RL benchmarks, shown as MT-Libero from LIBERO assets in Isaac Lab, plus DGPO which blends importance-weighted PPO with adaptive behavior cloning on matched demonstration actions.

This combination is new relative to the single-task or offline methods referenced. The benchmark part is useful because it enables simultaneous training across heterogeneous tasks with parallel rendering, physics randomization, and either state or visual policies. DGPO adds a tunable preference for demonstrated distributions while keeping the stability and online improvement of on-policy PPO, which addresses sparse success signals without going fully offline.

The approach is coherent on its own terms and builds directly on established components, so there is no visible circularity or internal contradiction. The construction and algorithm description are the parts that could be reused by others working in simulation-based robotics.

The soft spot is that the abstract states outperformance over prior-free RL and existing demonstration methods but supplies no metrics, baselines, statistical details, or ablation results. Without those, the data-to-claim link cannot be checked. The assumption that the construction works for arbitrary task families with matched actions is plausible but remains to be verified in practice.

This is for researchers in multi-task robot RL who already use or can adopt Isaac Lab and have access to some demonstrations. A reader focused on scaling simulation training or hybrid on-policy methods would get concrete value from the benchmark recipe and the weighting scheme.

It deserves a serious referee to examine the experiments and reproducibility.

Referee Report

1 major / 0 minor

Summary. The paper claims to provide a construction methodology for turning structured manipulation task families into GPU-parallel multi-task RL benchmarks, instantiated as MT-Libero using LIBERO assets and task predicates in Isaac Lab. This benchmark supports simultaneous heterogeneous training with parallel rendering, physics randomization, and state or visual policies. It further proposes DGPO, an on-policy method combining importance-weighted PPO with adaptive behavior cloning on matched demonstration actions, which enables a tunable preference toward demonstrated task distributions and is claimed to outperform both prior-free RL and existing demonstration-based methods while preserving PPO stability and online improvement benefits.

Significance. If the empirical claims hold, the work could meaningfully advance scalable multi-task robot learning by enabling efficient GPU-parallel training across heterogeneous tasks with tunable demonstration guidance under sparse rewards. The on-policy formulation and preservation of PPO's stability properties address a practical need in robotics where demonstrations are available but must be balanced with continued online improvement.

major comments (1)

Abstract: The abstract asserts outperformance over prior-free RL and existing demonstration-based methods, but supplies no experimental details, metrics, baselines, or statistical evidence, so the data-to-claim link cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need to strengthen the connection between claims and evidence in the abstract. We address this point below.

read point-by-point responses

Referee: Abstract: The abstract asserts outperformance over prior-free RL and existing demonstration-based methods, but supplies no experimental details, metrics, baselines, or statistical evidence, so the data-to-claim link cannot be evaluated.

Authors: We agree that the abstract, in its current form, states the performance claims at a high level without referencing specific metrics or baselines. While abstracts are necessarily concise, we acknowledge that this can make it difficult for readers to immediately assess the strength of the empirical support. In the revised manuscript we will update the abstract to include a brief mention of the primary evaluation metrics (success rate under sparse rewards), the main baselines (prior-free PPO and standard behavior cloning variants), and the key result that DGPO achieves higher average success across the MT-Libero task suite while retaining on-policy stability. Full tables, statistical details, and ablation studies remain in Sections 4 and 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method combines established PPO and BC components with empirical validation

full rationale

The paper proposes a benchmark construction (MT-Libero) and DGPO as an on-policy combination of importance-weighted PPO with adaptive behavior cloning. No equations or claims reduce a prediction or result to its own fitted inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present in the provided abstract or description. The central claims rest on empirical outperformance rather than definitional equivalence. This is the expected honest non-finding for a methods paper that extends standard RL primitives without internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger reflects high-level elements stated there. The tunable preference is treated as a free parameter; standard RL assumptions about on-policy stability are treated as domain assumptions.

free parameters (1)

tunable preference weight
The method description states that DGPO enables a tunable preference toward demonstrated distributions, implying a controllable hyperparameter that balances the PPO and behavior cloning terms.

axioms (1)

domain assumption Demonstration actions can be reliably matched to the current policy's action space across heterogeneous tasks
Required for the adaptive behavior cloning component of DGPO to function as described.

invented entities (2)

MT-Libero benchmark no independent evidence
purpose: GPU-parallel multi-task RL benchmark supporting state or visual policies with physics randomization
Newly constructed from LIBERO assets and task predicates inside Isaac Lab.
DGPO algorithm no independent evidence
purpose: On-policy demonstration-guided policy optimization combining importance-weighted PPO with adaptive behavior cloning
Newly proposed method.

pith-pipeline@v0.9.1-grok · 5689 in / 1473 out tokens · 40307 ms · 2026-06-28T09:44:34.784807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages

[1]

D’Eramo, D

C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters. Sharing knowledge in multi-task deep reinforcement learning. InInternational Conference on Learning Representations, 2020

2020
[2]

Z. Xu, Z. Xu, R. Jiang, P. Stone, and A. Tewari. Sample efficient myopic exploration through multitask reinforcement learning with diverse tasks. InThe Twelfth International Conference on Learning Representations, 2024

2024
[3]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T.-K. Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Demonstrating GPU Parallelized Robot Simulation and Rendering for Generalizable Embodied AI with ManiSkill3. InProceedings o...

work page doi:10.15607/rss.2025.xxi.021 2025
[4]

Joshi, Z

V . Joshi, Z. Xu, B. Liu, P. Stone, and A. Zhang. Benchmarking massively parallelized multi- task reinforcement learning for robotics tasks, 2025. URLhttps://arxiv.org/abs/2507. 23172

2025
[5]

Janwani, E

N. Janwani, E. Novoseller, V . J. Lawhern, and M. Tucker. Mo-playground: Massively paral- lelized multi-objective reinforcement learning for robotics, 2026

2026
[6]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 44776–44791. Curran Associates, Inc., 2023

2023
[7]

Mittal, P

NVIDIA, :, M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G....

2025
[8]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In L. P. Kaelbling, D. Kragic, and K. Sugiura, editors,Proceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pages 1094–1100. PMLR, 30 Oct–01 Nov 2020

2020
[9]

McLean, E

R. McLean, E. Chatzaroulas, L. McCutcheon, F. R ¨oder, T. Yu, Z. He, K. Zentner, R. Julian, J. K. Terry, I. Woungang, N. Farsad, and P. S. Castro. Meta-world+: An improved, standard- ized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track, 2026

2026
[10]

H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, H. Lou, C. T. Cheng, P. Li, H. Chen, Y . Liang, Y . Qian, J. Mao, W. Wan, Y . Geng, M. Zhang, J. Lyu, S. Zhao, J. Zhang, C. Xu, J. Zhang, C. Zhao, H. Lu, Y . Ding, R. Gong, Y . Wang, Y . Kuang, R. Wu, B. Jia, H. Dong, S. Huang, Y . Wang, J. Malik, and P. Abbeel. RoboVerse: A Unified Platform, Benchmark and...

work page doi:10.15607/rss.2025.xxi.022 2025
[11]

Hansen, H

N. Hansen, H. Su, and X. Wang. Learning massively multitask world models for continuous control. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[12]

Zhang, M

Z. Zhang, M. Duan, Y . Ye, and H. R. Zhang. Scalable multi-objective and meta reinforcement learning via gradient estimation, 2026

2026
[13]

C. Bai, L. Wang, J. Hao, Z. Yang, B. Zhao, Z. Wang, and X. Li. Pessimistic value iteration for multi-task data sharing in offline reinforcement learning.Artificial Intelligence, 326:104048, Jan. 2024. ISSN 0004-3702. doi:10.1016/j.artint.2023.104048. URLhttp://dx.doi.org/ 10.1016/j.artint.2023.104048

work page doi:10.1016/j.artint.2023.104048 2024
[14]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi- task learning. InNeurIPS, 2020

2020
[15]

B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu. Conflict-averse gradient descent for multi-task learning. InNeurIPS, pages 18878–18890, 2021

2021
[16]

B. Liu, Y . Feng, P. Stone, and Q. Liu. Famo: Fast adaptive multitask optimization, 2023. URL https://arxiv.org/abs/2306.03792

arXiv 2023
[17]

Z. Chen, V . Badrinarayanan, C.-Y . Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML, pages 793–802, 2018

2018
[18]

URLhttps://www

A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions. InProceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. doi:10.15607/RSS.2018.XIV .049

work page doi:10.15607/rss.2018.xiv 2018
[19]

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 1577–1594. PMLR, 23–29 Jul 2023

2023
[20]

Bhatt, S.-C

D. Bhatt, S.-C. Chou, and N. Atanasov. Rainbow-demorl: Combining improvements in demonstration-augmented reinforcement learning. 2026. URLhttps://arxiv.org/abs/ 2603.27400

arXiv 2026
[21]

H. Fu, R. Gong, X. Zhang, M. V . Minniti, J. Patel, and K. Schmeckpeper. Data-efficient multitask dagger, 2025

2025
[22]

S. Tao, A. Shukla, T. kai Chan, and H. Su. Reverse forward curriculum learning for extreme sample and demo efficiency. InThe Twelfth International Conference on Learning Represen- tations, 2024

2024
[23]

T. Mu, M. Liu, and H. Su. Drs: Learning reusable dense rewards for multi-stage tasks. InThe Twelfth International Conference on Learning Representations, 2024

2024
[24]

A. L. Escoriza, N. Hansen, S. Tao, T. Mu, and H. Su. Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Pro- ceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceed- ings...

2025
[25]

C. Cao, M. R. Garcia, M. Nabail, X. Wang, and N. Rhinehart. Residual reward models for preference-based reinforcement learning.CoRR, abs/2507.00611, July 2025. 10

arXiv 2025
[26]

Baimukashev, G

D. Baimukashev, G. Alcan, K. S. Luck, and V . Kyrki. Learning transparent reward models via unsupervised feature selection. In8th Annual Conference on Robot Learning, 2024

2024
[27]

Y . Tang, Y . Shang, Y . Chen, B. Wei, X. Zhang, S. Yu, L. Shi, C. Yu, C. Gao, W. Wu, and Y . Li. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025

2025
[28]

Ferraro, P

S. Ferraro, P. Mazzaglia, T. Verbelen, and B. Dhoedt. FOCUS: Object-centric world mod- els for robotic manipulation. InIntrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023, 2023

2023
[29]

Kuang, L

Y . Kuang, L. J. Manso, and G. V ogiatzis. Goal-based self-adaptive generative adversarial imitation learning (goal-sagail) for multi-goal robotic manipulation tasks, 2025

2025
[30]

Glazer, A

N. Glazer, A. Navon, A. Shamsian, and E. Fetaya. Multi task inverse reinforcement learning for common sense reward, 2025

2025
[31]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017
[32]

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Trans. Graph., 37(4):143:1– 143:14, July 2018. ISSN 0730-0301. doi:10.1145/3197517.3201311. URLhttp://doi. acm.org/10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311 2018
[33]

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, L. Lin, Z. Xie, M. Ding, and P. Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 27649–27660, June 2025

2025
[34]

Shang, K

J. Shang, K. Schmeckpeper, B. B. May, M. V . Minniti, T. Kelestemur, D. Watkins, and L. Her- lant. Theia: Distilling diverse vision foundation models for robot learning. In8th Annual Conference on Robot Learning, 2024

2024
[35]

Backward passes

M. Jiang, E. Grefenstette, and T. Rockt ¨aschel. Prioritized level replay. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4940–4950. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/jiang21b.html. 11 A Implementation Details Thi...

arXiv 2021

[1] [1]

D’Eramo, D

C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters. Sharing knowledge in multi-task deep reinforcement learning. InInternational Conference on Learning Representations, 2020

2020

[2] [2]

Z. Xu, Z. Xu, R. Jiang, P. Stone, and A. Tewari. Sample efficient myopic exploration through multitask reinforcement learning with diverse tasks. InThe Twelfth International Conference on Learning Representations, 2024

2024

[3] [3]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T.-K. Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Demonstrating GPU Parallelized Robot Simulation and Rendering for Generalizable Embodied AI with ManiSkill3. InProceedings o...

work page doi:10.15607/rss.2025.xxi.021 2025

[4] [4]

Joshi, Z

V . Joshi, Z. Xu, B. Liu, P. Stone, and A. Zhang. Benchmarking massively parallelized multi- task reinforcement learning for robotics tasks, 2025. URLhttps://arxiv.org/abs/2507. 23172

2025

[5] [5]

Janwani, E

N. Janwani, E. Novoseller, V . J. Lawhern, and M. Tucker. Mo-playground: Massively paral- lelized multi-objective reinforcement learning for robotics, 2026

2026

[6] [6]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 44776–44791. Curran Associates, Inc., 2023

2023

[7] [7]

Mittal, P

NVIDIA, :, M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G....

2025

[8] [8]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In L. P. Kaelbling, D. Kragic, and K. Sugiura, editors,Proceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pages 1094–1100. PMLR, 30 Oct–01 Nov 2020

2020

[9] [9]

McLean, E

R. McLean, E. Chatzaroulas, L. McCutcheon, F. R ¨oder, T. Yu, Z. He, K. Zentner, R. Julian, J. K. Terry, I. Woungang, N. Farsad, and P. S. Castro. Meta-world+: An improved, standard- ized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track, 2026

2026

[10] [10]

H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, H. Lou, C. T. Cheng, P. Li, H. Chen, Y . Liang, Y . Qian, J. Mao, W. Wan, Y . Geng, M. Zhang, J. Lyu, S. Zhao, J. Zhang, C. Xu, J. Zhang, C. Zhao, H. Lu, Y . Ding, R. Gong, Y . Wang, Y . Kuang, R. Wu, B. Jia, H. Dong, S. Huang, Y . Wang, J. Malik, and P. Abbeel. RoboVerse: A Unified Platform, Benchmark and...

work page doi:10.15607/rss.2025.xxi.022 2025

[11] [11]

Hansen, H

N. Hansen, H. Su, and X. Wang. Learning massively multitask world models for continuous control. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[12] [12]

Zhang, M

Z. Zhang, M. Duan, Y . Ye, and H. R. Zhang. Scalable multi-objective and meta reinforcement learning via gradient estimation, 2026

2026

[13] [13]

C. Bai, L. Wang, J. Hao, Z. Yang, B. Zhao, Z. Wang, and X. Li. Pessimistic value iteration for multi-task data sharing in offline reinforcement learning.Artificial Intelligence, 326:104048, Jan. 2024. ISSN 0004-3702. doi:10.1016/j.artint.2023.104048. URLhttp://dx.doi.org/ 10.1016/j.artint.2023.104048

work page doi:10.1016/j.artint.2023.104048 2024

[14] [14]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi- task learning. InNeurIPS, 2020

2020

[15] [15]

B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu. Conflict-averse gradient descent for multi-task learning. InNeurIPS, pages 18878–18890, 2021

2021

[16] [16]

B. Liu, Y . Feng, P. Stone, and Q. Liu. Famo: Fast adaptive multitask optimization, 2023. URL https://arxiv.org/abs/2306.03792

arXiv 2023

[17] [17]

Z. Chen, V . Badrinarayanan, C.-Y . Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML, pages 793–802, 2018

2018

[18] [18]

URLhttps://www

A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions. InProceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. doi:10.15607/RSS.2018.XIV .049

work page doi:10.15607/rss.2018.xiv 2018

[19] [19]

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 1577–1594. PMLR, 23–29 Jul 2023

2023

[20] [20]

Bhatt, S.-C

D. Bhatt, S.-C. Chou, and N. Atanasov. Rainbow-demorl: Combining improvements in demonstration-augmented reinforcement learning. 2026. URLhttps://arxiv.org/abs/ 2603.27400

arXiv 2026

[21] [21]

H. Fu, R. Gong, X. Zhang, M. V . Minniti, J. Patel, and K. Schmeckpeper. Data-efficient multitask dagger, 2025

2025

[22] [22]

S. Tao, A. Shukla, T. kai Chan, and H. Su. Reverse forward curriculum learning for extreme sample and demo efficiency. InThe Twelfth International Conference on Learning Represen- tations, 2024

2024

[23] [23]

T. Mu, M. Liu, and H. Su. Drs: Learning reusable dense rewards for multi-stage tasks. InThe Twelfth International Conference on Learning Representations, 2024

2024

[24] [24]

A. L. Escoriza, N. Hansen, S. Tao, T. Mu, and H. Su. Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Pro- ceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceed- ings...

2025

[25] [25]

C. Cao, M. R. Garcia, M. Nabail, X. Wang, and N. Rhinehart. Residual reward models for preference-based reinforcement learning.CoRR, abs/2507.00611, July 2025. 10

arXiv 2025

[26] [26]

Baimukashev, G

D. Baimukashev, G. Alcan, K. S. Luck, and V . Kyrki. Learning transparent reward models via unsupervised feature selection. In8th Annual Conference on Robot Learning, 2024

2024

[27] [27]

Y . Tang, Y . Shang, Y . Chen, B. Wei, X. Zhang, S. Yu, L. Shi, C. Yu, C. Gao, W. Wu, and Y . Li. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025

2025

[28] [28]

Ferraro, P

S. Ferraro, P. Mazzaglia, T. Verbelen, and B. Dhoedt. FOCUS: Object-centric world mod- els for robotic manipulation. InIntrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023, 2023

2023

[29] [29]

Kuang, L

Y . Kuang, L. J. Manso, and G. V ogiatzis. Goal-based self-adaptive generative adversarial imitation learning (goal-sagail) for multi-goal robotic manipulation tasks, 2025

2025

[30] [30]

Glazer, A

N. Glazer, A. Navon, A. Shamsian, and E. Fetaya. Multi task inverse reinforcement learning for common sense reward, 2025

2025

[31] [31]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017

[32] [32]

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Trans. Graph., 37(4):143:1– 143:14, July 2018. ISSN 0730-0301. doi:10.1145/3197517.3201311. URLhttp://doi. acm.org/10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311 2018

[33] [33]

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, L. Lin, Z. Xie, M. Ding, and P. Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 27649–27660, June 2025

2025

[34] [34]

Shang, K

J. Shang, K. Schmeckpeper, B. B. May, M. V . Minniti, T. Kelestemur, D. Watkins, and L. Her- lant. Theia: Distilling diverse vision foundation models for robot learning. In8th Annual Conference on Robot Learning, 2024

2024

[35] [35]

Backward passes

M. Jiang, E. Grefenstette, and T. Rockt ¨aschel. Prioritized level replay. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4940–4950. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/jiang21b.html. 11 A Implementation Details Thi...

arXiv 2021