pith. machine review for the scientific record. sign in

arxiv: 2603.15956 · v2 · submitted 2026-03-16 · 💻 cs.RO · cs.AI

Recognition: no theorem link

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:32 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords diffusion policysim-to-real transferexpert policy learningbehavior cloningreinforcement learningrobotic manipulationsparse rewardsimitation learning
0
0 comments X

The pith

ExpertGen steers frozen diffusion policies from imperfect demonstrations to high task success by optimizing only their initial noise in simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to remove the need for large volumes of expensive real-world expert data when training robust robot policies. It starts by training a diffusion model on imperfect demonstrations that can come from language models or limited human input. Reinforcement learning then tunes only the starting noise fed into this frozen policy to drive up task completion using sparse rewards. The frozen weights keep generated motions inside safe human-like ranges. The resulting state-based experts are distilled into vision-based controllers that run on physical robots.

Core claim

ExpertGen initializes a diffusion policy on imperfect demonstrations and uses RL to optimize its initial noise input while keeping the policy weights frozen, thereby steering the generated behaviors toward high task success rates within safe manifolds. On industrial assembly tasks this yields 90.5% success and on long-horizon manipulation 85% success, outperforming baselines, with the resulting state-based policies distilled via DAgger into visuomotor policies that deploy successfully on real hardware.

What carries the argument

Reinforcement learning optimization of only the initial noise vector supplied to a frozen diffusion policy, which regularizes exploration to human-like behavior manifolds while using sparse rewards.

If this is right

  • The method produces dexterous policies that remain robust across varied initial configurations and recovery from failures.
  • State-based experts learned in simulation transfer to real robots after distillation into visuomotor policies.
  • Effective learning occurs with only sparse rewards because the frozen prior limits unsafe exploration.
  • Imperfect sources such as LLM-generated demonstrations become sufficient starting points for high-performing policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If noise optimization alone keeps behaviors inside safe manifolds, the same idea could apply to other generative models used for policy generation.
  • Strong performance on long-horizon tasks suggests the approach could support composition of sequential robotic skills without additional dense supervision.
  • Avoiding weight updates during the RL phase may lower compute needs compared with full policy fine-tuning, allowing quicker simulation iterations.
  • Successful sim-to-real transfer via simple distillation implies that behavior regularization can substitute for extensive domain randomization in many cases.

Load-bearing premise

That optimizing only the initial noise of a frozen diffusion policy will reliably steer behavior toward high task success while remaining inside safe, human-like manifolds without reward engineering or policy updates.

What would settle it

An experiment showing that noise-optimized policies achieve lower success rates than strong baselines on the same assembly and long-horizon tasks or fail to transfer after DAgger distillation to real hardware.

Figures

Figures reproduced from arXiv: 2603.15956 by Ahmet Salih Gundogdu, Di Deng, Eric Rosen, Karl Schmeckpeper, Kausik Sivakumar, Maria Vittoria Minniti, Peter Stone, Ran Gong, Riedana Yan, Xiaohan Zhang, Zifan Xu, Zixing Wang.

Figure 1
Figure 1. Figure 1: EXPERTGEN pipeline: (left) generative modeling of imperfect behavior priors; (middle) steering prior model in massively parallel simulation using reinforcement learning; and (right) visual distillation in simulation with DAgger for zero-shot sim￾to-real transfer. Abstract—Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrat… view at source ↗
Figure 2
Figure 2. Figure 2: ExpertGen training pipeline: generative modeling of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustrations of the tasks in our experiments. (A) Real [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: The success rates (%) of the evaluated approaches on selected assets from AutoMate benchmark. E [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of DAgger vs. BC in simulation. We plot the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Success rate (%) distribution of imperfect prior diffu [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Success rates of base diffusion policies (top) and the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The success rates (%) of the evaluated state-based [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An overview of the selected eight tasks from A [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the selected nine assets from Au [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: he success rates (%) of the evaluated approaches on selected assets from AutoMate benchmark. E [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall smoothness and feasibility evaluation of state [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Scripted policy for Lift Banana [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Scripted policy for Lift Brick Scripted policy - Lift Peach def scripted_policy(env): import torch # Object id for peach peach_id = 1 # Step 1: Pick up the peach. This skill will grasp the peach and lift it with an internally defined distance. pick(env, peach_id) # Step 2: Compute the target position for lifting. # We want to ensure the peach is lifted upward by at least 20 cm relative to its initial posi… view at source ↗
Figure 15
Figure 15. Figure 15: Scripted policy for Lift Peach [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Scripted policy for Open Drawer Scripted policy - Push Pear to Center def scripted_policy(env): import torch # Push the pear (object_id=1) to target xy (0.4, -0.5) using iterative, state-based pushes pear_id = 1 target_xy = make_a_tensor(env, [0.4, -0.44]) push_or_pull_object_to_xy(env, pear_id, target_xy) [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Scripted policy for Push Pear to Center Scripted policy - Stack Banana on Can def scripted_policy(env): import torch banana_id = 2 can_id = 1 # 1) Pick the banana pick(env, banana_id) # 2) Lift to 25 cm above initial height init_banana = get_object_initial_position(env, banana_id) # (N,3) lift_offset = torch.tensor([0.0, 0.0, 0.18], device=init_banana.device).reshape(1, 3) lifted_pos = init_banana + lift_… view at source ↗
Figure 18
Figure 18. Figure 18: Scripted policy for Stack Banana on Can [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scripted policy for Put Object in Closed Drawer Scripted policy - Place Strawberry in Bowl def scripted_policy(env): import torch strawberry_id = 2 bowl_id = 1 # 1) Pick the strawberry pick(env, strawberry_id) # 2) Lift to 25 cm above initial height init_strawberry = get_object_initial_position(env, strawberry_id) # (N,3) lift_offset = torch.tensor([0.0, 0.0, 0.18], device=init_strawberry.device).reshape(… view at source ↗
Figure 20
Figure 20. Figure 20: Scripted policy for Put Object in Closed Drawer [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Scripted policy for AutoMate disassembly tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗
read the original abstract

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ExpertGen, a framework that trains a diffusion policy on imperfect demonstrations (synthesized by LLMs or from humans) as a behavior prior, then applies RL to optimize only the initial noise input while keeping the diffusion policy frozen. This steers the policy toward high task success using sparse rewards without reward engineering. It reports 90.5% overall success on industrial assembly tasks and 85% on long-horizon manipulation tasks, outperforming baselines, with policies that are dexterous and robust; the state-based experts are then distilled via DAgger into visuomotor policies for real-robot deployment.

Significance. If the coverage assumption holds and the empirical gains are reproducible, ExpertGen could meaningfully advance scalable expert policy acquisition in robotics by reducing reliance on high-quality teleoperated data and manual reward design, while using frozen diffusion priors for safe, manifold-constrained exploration. The sim-to-real distillation step adds practical value for hardware transfer.

major comments (2)
  1. [Abstract] Abstract: The central empirical claims (90.5% assembly success, 85% long-horizon success, outperformance of all baselines) are presented without any mention of evaluation protocol, number of trials, random seeds, error bars, or statistical significance tests. This information is load-bearing for assessing whether the reported rates substantiate the method's advantage over baselines.
  2. [Method description] Method description (as summarized): The claim that optimizing only the initial noise of a frozen diffusion prior reliably recovers high-success behaviors rests on the untested assumption that the prior manifold (trained on imperfect demos) already contains near-optimal trajectories. No ablation isolating noise optimization, measuring manifold coverage, or comparing against unfrozen policy updates is described, leaving open the possibility that successes arise from implicit policy capacity rather than the proposed mechanism.
minor comments (2)
  1. [Abstract] Abstract: Specify the exact baselines (e.g., names, implementations) and how imperfect demonstrations are generated (e.g., LLM prompt details or human demo quality metrics) to allow readers to contextualize the outperformance.
  2. [Abstract] The phrase 'no reward engineering' is used while claiming sparse rewards; clarify whether any task-specific reward shaping or success detection is still required for the RL stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims (90.5% assembly success, 85% long-horizon success, outperformance of all baselines) are presented without any mention of evaluation protocol, number of trials, random seeds, error bars, or statistical significance tests. This information is load-bearing for assessing whether the reported rates substantiate the method's advantage over baselines.

    Authors: We agree that the abstract would benefit from a concise reference to the evaluation protocol. In the revised version we will add a short clause noting that the reported success rates are averaged over 100 trials per task across five random seeds, with standard deviations and statistical comparisons provided in the experimental section. This change will be limited to one sentence to preserve abstract length while addressing the concern. revision: yes

  2. Referee: [Method description] Method description (as summarized): The claim that optimizing only the initial noise of a frozen diffusion prior reliably recovers high-success behaviors rests on the untested assumption that the prior manifold (trained on imperfect demos) already contains near-optimal trajectories. No ablation isolating noise optimization, measuring manifold coverage, or comparing against unfrozen policy updates is described, leaving open the possibility that successes arise from implicit policy capacity rather than the proposed mechanism.

    Authors: The current manuscript does not contain an explicit ablation that isolates noise optimization from full policy updates or directly measures manifold coverage. The existing baseline comparisons provide indirect support for the value of the frozen prior, but we acknowledge that a more targeted analysis would strengthen the mechanistic claim. We will therefore add a new ablation subsection that (i) compares frozen noise optimization against unfrozen fine-tuning of the diffusion policy and (ii) reports a simple coverage metric on the imperfect demonstration set. A brief discussion of the coverage assumption will also be inserted in the method section. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on standard diffusion pretraining plus RL noise optimization with independent empirical validation

full rationale

The paper presents a two-stage pipeline (diffusion policy pretrained on imperfect demonstrations, followed by RL optimization of initial noise only while keeping the policy frozen) whose success rates are measured empirically on assembly and manipulation benchmarks. No equations, derivations, or self-citations reduce the reported 90.5% or 85% success rates to quantities defined by construction from the fitted parameters or prior data. The central claim rests on the coverage assumption of the behavior manifold, which is an empirical hypothesis rather than a definitional loop. Standard RL and diffusion components are used without self-referential fitting or uniqueness theorems imported from the authors' prior work. This is a normal non-circular empirical method paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions from diffusion models and RL without introducing new free parameters or invented entities beyond typical hyperparameter choices.

free parameters (1)
  • RL optimization hyperparameters
    Learning rates, noise schedules, and reward scaling for the initial-noise optimization step are expected to be tuned on the target tasks.
axioms (1)
  • domain assumption A diffusion policy trained on imperfect demonstrations can serve as a safe behavior prior that RL can refine without policy updates.
    Invoked in the initialization and freezing step of the pipeline.

pith-pipeline@v0.9.0 · 5593 in / 1210 out tokens · 44373 ms · 2026-05-15T09:32:40.474611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

    cs.RO 2026-03 conditional novelty 7.0

    Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  2. [2]

    Openmathinstruct-2: Accelerating ai for math with mas- sive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024

    Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with mas- sive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024

  3. [3]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278– 25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278– 25294, 2022

  4. [4]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  5. [5]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni- cal report.arXiv preprint arXiv:2412.19437, 2024

  6. [6]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  7. [7]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  9. [9]

    Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

    Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  11. [11]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  12. [12]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021

  13. [13]

    Flamingo: a visual language model for few-shot learn- ing.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learn- ing.Advances in neural information processing systems, 35:23716–23736, 2022

  14. [14]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  15. [15]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  16. [16]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  17. [17]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  18. [18]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  19. [19]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipula- tion with long-horizon reasoning tasks

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipula- tion with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

  20. [20]

    Anytask: an automated task and data generation framework for advancing sim-to-real policy learning.arXiv preprint arXiv:2512.17853, 2025

    Ran Gong*, Xiaohan Zhang*, Jinghuan Shang*, Maria Vittoria Minniti*, Jigarkumar Patel, Valerio Pepe, Riedana Yan, Ahmet Gundogdu, Ivan Kapelyukh, Ali Abbas, Xiaoqiang Yan, Harsh Patel, Laura Herlant, and Karl Schmeckpeper. Anytask: an automated task and data generation framework for advancing sim-to-real policy learning.arXiv preprint arXiv:2512.17853, 2025

  21. [21]

    Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  22. [22]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

  23. [23]

    Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

    Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Casta˜neda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

  24. [24]

    Opening the sim- to-real door for humanoid pixel-to-action policy transfer

    Haoru Xue, Tairan He, Zi Wang, Qingwei Ben, Wenli Xiao, Zhengyi Luo, Xingye Da, Fernando Casta ˜neda, Guanya Shi, Shankar Sastry, et al. Opening the sim- to-real door for humanoid pixel-to-action policy transfer. arXiv preprint arXiv:2512.01061, 2025

  25. [25]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Hei- den, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Ani- mesh Garg, Renato Gasoto, Lionel Gulich, Yijie...

  26. [26]

    Steering your diffusion policy with latent space reinforcement learning

    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning. arXiv preprint arXiv:2506.15799, 2025

  27. [27]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

  28. [28]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelli- gence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  29. [29]

    Residual Policy Learning

    Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

  30. [30]

    From imitation to refinement- residual rl for precise assembly

    Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement- residual rl for precise assembly. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025

  31. [31]

    Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

    Siheng Zhao, Yanjie Ze, Yue Wang, C Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

  32. [32]

    Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

    Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

  33. [33]

    Accelerating residual reinforcement learning with uncertainty estimation.arXiv preprint arXiv:2506.17564, 2025

    Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia, George Konidaris, and Stefanie Tellex. Accelerating residual reinforcement learning with uncertainty estimation.arXiv preprint arXiv:2506.17564, 2025

  34. [34]

    Efficient online reinforcement learning fine-tuning need not retain offline data,

    Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online reinforcement learn- ing fine-tuning need not retain offline data.arXiv preprint arXiv:2412.07762, 2024

  35. [35]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  36. [36]

    Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

  37. [37]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ire- tiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596, 2023

  38. [38]

    Skillmimicgen: Automated demonstration genera- tion for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

    Caelan Garrett, Ajay Mandlekar, Bowen Wen, and Dieter Fox. Skillmimicgen: Automated demonstration genera- tion for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

  39. [39]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16923–16930. IEEE, 2025

  40. [40]

    Demogen: Syn- thetic demonstration generation for data-efficient visuo- motor policy learning.arXiv preprint arXiv:2502.16932, 2025

    Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, Zhecheng Yuan, and Huazhe Xu. Demogen: Syn- thetic demonstration generation for data-efficient visuo- motor policy learning.arXiv preprint arXiv:2502.16932, 2025

  41. [41]

    Reinforcegen: Hybrid skill policies with auto- mated data generation and reinforcement learning.arXiv preprint arXiv:2512.16861, 2025

    Zihan Zhou, Animesh Garg, Ajay Mandlekar, and Caelan Garrett. Reinforcegen: Hybrid skill policies with auto- mated data generation and reinforcement learning.arXiv preprint arXiv:2512.16861, 2025

  42. [42]

    Maniskill3: Gpu parallelized robotics simulation and rendering for gen- eralizable embodied ai.Robotics: Science and Systems, 2025

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xan- der Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, and Tse kai Chan et al. Maniskill3: Gpu parallelized robotics simulation and rendering for gen- eralizable embodied ai.Robotics: Science and Systems, 2025

  43. [43]

    Maniskill2: A uni- fied benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A uni- fied benchmark for generalizable manipulation skills. In International Conference on Learning Representations, 2023

  44. [44]

    Robotwin: Dual-arm robot benchmark with generative digital twins

    Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025

  45. [45]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yi- heng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain random- ization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  46. [46]

    Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

    Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

  47. [47]

    Gen- sim2: Scaling robot data generation with multi-modal and reasoning llms

    Pu Hua, Minghuan Liu, Annabella Macaluso, Yunfeng Lin, Weinan Zhang, Huazhe Xu, and Lirui Wang. Gen- sim2: Scaling robot data generation with multi-modal and reasoning llms. In8th Annual Conference on Robot Learning, 2024

  48. [48]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  49. [49]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  50. [50]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  51. [51]

    Address- ing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Address- ing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

  52. [52]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and R ´emi Munos. A distributional perspective on reinforcement learning. InInternational conference on machine learning, pages 449–458. PMLR, 2017

  53. [53]

    Code as Policies: Language Model Programs for Embodied Control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embod- ied control.arXiv preprint arXiv:2209.07753, 2022

  54. [54]

    Automate: Specialist and generalist assembly policies over diverse geometries.arXiv preprint arXiv:2407.08028, 1(2), 2024

    Bingjie Tang, Iretiayo Akinola, Jie Xu, Bowen Wen, Ankur Handa, Karl Van Wyk, Dieter Fox, Gaurav S Sukhatme, Fabio Ramos, and Yashraj Narang. Automate: Specialist and generalist assembly policies over diverse geometries.arXiv preprint arXiv:2407.08028, 1(2), 2024

  55. [55]

    SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

    Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura, Guy Tevet, Chuan Guo, Michael Tay- lor, Chang Shu, Pengcheng Xi, et al. Smp: Reusable score-matching motion priors for physics-based character control.arXiv preprint arXiv:2512.03028, 2025

  56. [56]

    Binomial proportion confidence interval — Wikipedia, the free encyclopedia

    Wikipedia contributors. Binomial proportion confidence interval — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Binomial proportion confidence interval&oldid=1333177192,

  57. [57]

    [Online; accessed 31-January-2026]

  58. [58]

    Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation.Artificial intelligence in medicine, 45(1):11–34, 2009

    Paolo Tormene, Toni Giorgino, Silvana Quaglini, and Mario Stefanelli. Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation.Artificial intelligence in medicine, 45(1):11–34, 2009

  59. [59]

    Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands, 2025

    Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands, 2025

  60. [60]

    ""Lift plug from socket till disassembly and then randomize end-effector pose

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. APPENDIXA HYPER-PARAMETERS A. Diffusion Policy This paper considers diffusion policy for all imitation learn- ing train...