WorldSample: Closed-loop Real-robot RL with World Modelling

Bofang Jia; Le Xu; Xinyang Song; Yuquan Xue; Zeyi Liu; Zhengyi Gu; Zhenyu Wu; Ziwei Wang

arxiv: 2607.02431 · v1 · pith:FULZGDGUnew · submitted 2026-07-02 · 💻 cs.RO · cs.AI

WorldSample: Closed-loop Real-robot RL with World Modelling

Yuquan Xue , Le Xu , Zeyi Liu , Zhenyu Wu , Zhengyi Gu , Xinyang Song , Bofang Jia , Ziwei Wang This is my paper

Pith reviewed 2026-07-03 10:55 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords reinforcement learningworld modelsrobot manipulationdata augmentationpolicy improvementclosed-loop learningsynthetic transitionsvisual fidelity

0 comments

The pith

WorldSample closes a real-synthetic loop with a post-trained world model and Policy-Paced Learning to raise real-robot RL success rates by 28 percent while cutting training steps by 59 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that real-robot reinforcement learning can move past the limits of imitation learning by using trial-and-error data that goes beyond demonstration states. It does this by grounding a world model on actual robot rollouts, then generating synthetic transitions that Policy-Paced Learning selects and schedules to add useful variety without letting model errors inflate value estimates. If the loop holds, robots learn contact-rich and precise manipulation skills with substantially fewer physical interactions, and the world model itself improves in visual accuracy through the same cycle. The reported experiments measure these gains directly against baselines that lack the closed loop or the pacing step.

Core claim

WorldSample establishes a closed loop in which physical robot rollouts are used to post-train a world model that produces high-fidelity synthetic transitions. These transitions are not added indiscriminately; Policy-Paced Learning instead applies sample selection and scheduling to balance the augmentation benefit against risks of value overestimation and hallucination noise. The result is higher policy success and lower training cost on contact-rich tasks, together with measurable gains in the world model's own visual prediction quality over training that uses only demonstrations.

What carries the argument

Policy-Paced Learning, which regulates the training process through sample selection and scheduling of synthetic transitions generated by the real-grounded world model.

If this is right

Fewer physical robot interactions are required to reach a given policy performance level on manipulation tasks.
The world model receives iterative visual-fidelity gains from the same real rollouts that improve the policy.
Contact-rich and precise tasks become more tractable for real-robot RL because synthetic data fills coverage gaps without extra hardware cost.
Training schedules can be shortened while maintaining or increasing final success rates compared with demonstration-only or unpaced methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding-plus-pacing pattern could be tested on tasks outside manipulation, such as locomotion or navigation, to check whether the efficiency gains transfer.
Multiple iterations of the real-synthetic loop might compound improvements if each cycle further reduces hallucination before the next policy update.
Replacing the current world model architecture with one that already has lower baseline hallucination could amplify the measured gains without changing the pacing logic.

Load-bearing premise

The post-trained world model must produce synthetic transitions with low enough visual hallucination that Policy-Paced Learning can reliably separate useful augmentation from noise and overestimation.

What would settle it

An experiment that measures policy performance when the world model is post-trained on real rollouts yet still generates high visual hallucination, checking whether success rates and training efficiency then fail to exceed or fall below the no-augmentation baseline.

read the original abstract

Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and policy improvement. Grounded on real rollouts, WorldSample generates high-fidelity synthetic transitions through a post-trained world model, which greatly lowers the visual hallucination. Specifically, rather than simply using these transitions as real-world experience, WorldSample introduces Policy-Paced Learning (PPL) to regulate the training process through sample selection and scheduling, balancing useful augmentation against value overestimation and mitigating the hallucination-induced noise. Experiments on robot manipulation tasks involving contact-rich and precise tasks show that WorldSample improves policy success rate by 28% while reducing training steps by 59% compared with baselines. Furthermore, WorldSample improves world model visual fidelity by 19.4dB in PSNR and 0.47 in SSIM over demonstration-only post-training, validating the effectiveness of the real-synthetic loop for both policy and world model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldSample adds a real-data post-trained world model plus Policy-Paced Learning to close the loop in real-robot RL, but the reported gains rest on visual metrics that do not address dynamics error in contact tasks.

read the letter

The main takeaway is a closed real-synthetic loop: real rollouts train a world model, which then generates extra transitions that Policy-Paced Learning selects and schedules for the policy. The authors claim this cuts real steps by 59% and raises success rate by 28% on contact-rich manipulation while also lifting the world model’s PSNR and SSIM.

The approach is straightforward and directly targets the interaction-cost problem. Grounding the world model on actual robot data instead of demonstrations alone is a reasonable step, and PPL looks like a practical filter against hallucinated transitions. If the full paper shows clean ablations and reproducible controls, the framework could be useful to people already running real-robot RL.

The soft spot is exactly the one flagged in the stress test. The paper only reports image quality (19.4 dB PSNR, 0.47 SSIM). It does not measure next-state or contact-force prediction error. In tasks with physical contact, a visually convincing but dynamically wrong rollout can still produce value overestimation that sample selection may not fully remove. Without those dynamics checks, the 28% and 59% numbers are hard to trust.

The abstract also gives no baseline details, no statistical tests, and no protocol for data exclusion or random seeds. That makes the quantitative claims provisional at best.

This is for robotics RL groups already working on world models and sample efficiency. A reader who needs a concrete method to reduce real-robot time might extract something from the PPL scheduling, but only after seeing the full experimental section.

I would send it to review if the methods and dynamics metrics hold up; the topic matters and the loop idea is coherent. Based on the abstract alone, the central assumption about transition accuracy remains untested.

Referee Report

3 major / 2 minor

Summary. The paper proposes WorldSample, a closed-loop real-robot RL framework that grounds a post-trained world model on physical rollouts to generate synthetic transitions and uses Policy-Paced Learning (PPL) for sample selection and scheduling to balance augmentation benefits against hallucination-induced noise and value overestimation. Experiments on contact-rich and precise robot manipulation tasks are reported to yield a 28% higher policy success rate and 59% fewer training steps versus baselines, together with world-model gains of 19.4 dB PSNR and 0.47 SSIM over demonstration-only post-training.

Significance. If the central claims hold after proper validation, the work would offer a practical route to reducing expensive real-robot interactions in RL by closing a real-synthetic loop; the explicit grounding of the world model and the PPL mechanism constitute concrete, testable contributions to sample-efficient robotics RL.

major comments (3)

[Abstract] Abstract (experiments paragraph): the reported 28 % success-rate and 59 % step reductions are stated without any reference to the experimental protocol, number of random seeds, baseline implementations, statistical tests, or data-exclusion criteria, rendering the quantitative claims unverifiable from the supplied text.
[Abstract] Abstract (world-model paragraph): visual fidelity is quantified solely by PSNR (+19.4 dB) and SSIM (+0.47) relative to demonstration-only training; these image-level metrics do not bound next-state or contact-force prediction error, which is load-bearing for the claim that PPL can reliably filter hallucination-induced noise in contact-rich tasks.
[Abstract] Abstract (PPL description): the assertion that PPL “balances useful augmentation against value overestimation” is presented without any equation, algorithm box, or ablation that shows how sample selection and scheduling achieve this balance, leaving the mechanism’s effectiveness ungrounded.

minor comments (2)

[Abstract] The phrase “19.4dB” should be written with a space (“19.4 dB”) for standard scientific notation.
[Abstract] The abstract refers to “baselines” without naming them; a parenthetical list of the compared methods would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve clarity and verifiability of the claims.

read point-by-point responses

Referee: [Abstract] Abstract (experiments paragraph): the reported 28 % success-rate and 59 % step reductions are stated without any reference to the experimental protocol, number of random seeds, baseline implementations, statistical tests, or data-exclusion criteria, rendering the quantitative claims unverifiable from the supplied text.

Authors: We agree that the abstract would benefit from additional context. In the revised version we will append a concise qualifier such as 'evaluated over 5 random seeds on contact-rich tasks (full protocol in Section 4)' while preserving length constraints. The body already reports the seed count, baseline code, and statistical details; the abstract revision will make these claims traceable without expanding into a methods summary. revision: yes
Referee: [Abstract] Abstract (world-model paragraph): visual fidelity is quantified solely by PSNR (+19.4 dB) and SSIM (+0.47) relative to demonstration-only training; these image-level metrics do not bound next-state or contact-force prediction error, which is load-bearing for the claim that PPL can reliably filter hallucination-induced noise in contact-rich tasks.

Authors: The observation is correct: PSNR/SSIM are image-level proxies and do not directly bound dynamics or force errors. The manuscript relies on downstream policy success rates as the primary evidence that the grounded world model reduces harmful hallucinations in contact-rich settings. We will revise the abstract to state that the reported visual gains are 'complemented by task-level policy improvements' and will add a short clarifying sentence in the world-model evaluation section acknowledging the metric limitation while pointing to the RL results as the relevant validation. revision: partial
Referee: [Abstract] Abstract (PPL description): the assertion that PPL “balances useful augmentation against value overestimation” is presented without any equation, algorithm box, or ablation that shows how sample selection and scheduling achieve this balance, leaving the mechanism’s effectiveness ungrounded.

Authors: The PPL formulation, including the selection criterion and pacing schedule, is defined with equations in Section 3.2 and presented as Algorithm 1; its effect on value overestimation is quantified via ablations in Section 5.3. We will update the abstract phrasing to 'via the Policy-Paced Learning mechanism (Sec. 3) that regulates sample selection and scheduling' so the claim is explicitly anchored to the detailed exposition rather than standing alone. revision: yes

Circularity Check

0 steps flagged

No circularity; experimental claims only

full rationale

The paper introduces WorldSample and Policy-Paced Learning as a framework, then reports empirical gains (28% success rate, 59% fewer steps, 19.4 dB PSNR, 0.47 SSIM) from real-robot experiments. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on measured outcomes rather than any reduction of a result to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method name WorldSample and the PPL component are procedural descriptions rather than new postulated entities.

pith-pipeline@v0.9.1-grok · 5794 in / 1094 out tokens · 48282 ms · 2026-07-03T10:55:09.894758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 28 canonical work pages · 16 internal anchors

[1]

Efficientonlinereinforcementlearning with offline data

PhilipJBall,LauraSmith,IlyaKostrikov,andSergeyLevine. Efficientonlinereinforcementlearning with offline data. InICML, pages 1577–1594. PMLR, 2023

2023
[2]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.𝜋0: A visio...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl

KevinChen,MarcoCusumano-Towner,BrodyHuval,AlekseiPetrenko,JacksonHamburger,Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv:2502.01600, 2025

work page arXiv 2025
[4]

Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXiv preprint arXiv:2502.05450, 2025

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXiv preprint arXiv:2502.05450, 2025

work page arXiv 2025
[5]

Wmpo: World model-based policy optimization for vision-language-action models.ArXiv preprint arXiv:2511.09515, 2025

Zhu Fangqi, Yan Zhengyang, Hong Zicong, Shou Quanxin, Ma Xiao, and Guo Song. Wmpo: World model-based policy optimization for vision-language-action models.ArXiv preprint arXiv:2511.09515, 2025

work page arXiv 2025
[6]

Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

Seyed Kamyar Seyed Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, and Igor Mordatch. Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

work page arXiv 2025
[7]

Vlaw: Iterative co-improvement of vision-language-action policy and world model.CoRR, abs/2602.12063, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

work page arXiv 2026
[8]

Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

work page arXiv 2025
[9]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Sime: Enhancing policy self-improvement with modal-level exploration.arXiv preprint arXiv:2505.01396, 2025

Yang Jin, Jun Lv, Wenye Yu, Hongjie Fang, Yong-Lu Li, and Cewu Lu. Sime: Enhancing policy self-improvement with modal-level exploration.arXiv preprint arXiv:2505.01396, 2025

work page arXiv 2025
[13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Vla-rft: Vision-language-actionreinforcement fine-tuning with verified rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, MingyangSun, HongyinZhang,DonglinWang,etal. Vla-rft: Vision-language-actionreinforcement fine-tuning with verified rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

work page arXiv 2025
[16]

Towards generalist robot policies: What matters in building vision-language-action models.arXiv e-prints, pages arXiv–2412, 2024

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv e-prints, pages arXiv–2412, 2024

2024
[17]

Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization, 2026

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization, 2026

2026
[18]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

2025
[22]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, BenjaminBurchfiel,HongkaiDai,andMaxSimchowitz. Diffusionpolicypolicyoptimization.arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

What matters in learning from large-scale datasets for robot manipulation

Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Chul Shin, Soroush Nasiriany, Ajay Mandlekar, and Danfei Xu. What matters in learning from large-scale datasets for robot manipulation. InICLR, 2025

2025
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

AndrewWagenmaker, MitsuhikoNakamoto, YunchuZhang, SeohongPark, WaleedYagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

Yuquan Xue, Guanxing Lu, Zhenyu Wu, Chuanrui Zhang, Bofang Jia, Zhengyi Gu, and Ziwei Wang. Resample: Arobustdataaugmentationframeworkviaexploratorysamplingforroboticmanipulation. arXiv preprint arXiv:2510.17640, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, Zhecheng Yuan, and Huazhe Xu. Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025. 11

work page arXiv 2025
[30]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Rlinf-vla: A unified and efficient framework for vla+ rl training

Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training. arXiv preprint arXiv:2510.06710, 2025

work page arXiv 2025
[32]

Reinbot: Amplifyingrobotvisual-languagemanipulationwithreinforcementlearning.arXiv preprint arXiv:2505.07395, 2025

Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifyingrobotvisual-languagemanipulationwithreinforcementlearning.arXiv preprint arXiv:2505.07395, 2025

work page arXiv 2025
[33]

Reinforcing action policies by prophesying.arXiv preprint arXiv:2511.20633, 2025

Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. Reinforcing action policies by prophesying.arXiv preprint arXiv:2511.20633, 2025. 12 A Experiment Details A.1 Real-World Experimental Setup Figure 6. Real-world task settings.We evaluate WorldSample on diverse manipulation tasks covering contact-rich insertion, object pushing, sorting, pick & place,...

work page arXiv 2025

[1] [1]

Efficientonlinereinforcementlearning with offline data

PhilipJBall,LauraSmith,IlyaKostrikov,andSergeyLevine. Efficientonlinereinforcementlearning with offline data. InICML, pages 1577–1594. PMLR, 2023

2023

[2] [2]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.𝜋0: A visio...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl

KevinChen,MarcoCusumano-Towner,BrodyHuval,AlekseiPetrenko,JacksonHamburger,Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv:2502.01600, 2025

work page arXiv 2025

[4] [4]

Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXiv preprint arXiv:2502.05450, 2025

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXiv preprint arXiv:2502.05450, 2025

work page arXiv 2025

[5] [5]

Wmpo: World model-based policy optimization for vision-language-action models.ArXiv preprint arXiv:2511.09515, 2025

Zhu Fangqi, Yan Zhengyang, Hong Zicong, Shou Quanxin, Ma Xiao, and Guo Song. Wmpo: World model-based policy optimization for vision-language-action models.ArXiv preprint arXiv:2511.09515, 2025

work page arXiv 2025

[6] [6]

Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

Seyed Kamyar Seyed Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, and Igor Mordatch. Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

work page arXiv 2025

[7] [7]

Vlaw: Iterative co-improvement of vision-language-action policy and world model.CoRR, abs/2602.12063, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

work page arXiv 2026

[8] [8]

Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

work page arXiv 2025

[9] [9]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Sime: Enhancing policy self-improvement with modal-level exploration.arXiv preprint arXiv:2505.01396, 2025

Yang Jin, Jun Lv, Wenye Yu, Hongjie Fang, Yong-Lu Li, and Cewu Lu. Sime: Enhancing policy self-improvement with modal-level exploration.arXiv preprint arXiv:2505.01396, 2025

work page arXiv 2025

[13] [13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Vla-rft: Vision-language-actionreinforcement fine-tuning with verified rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, MingyangSun, HongyinZhang,DonglinWang,etal. Vla-rft: Vision-language-actionreinforcement fine-tuning with verified rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

work page arXiv 2025

[16] [16]

Towards generalist robot policies: What matters in building vision-language-action models.arXiv e-prints, pages arXiv–2412, 2024

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv e-prints, pages arXiv–2412, 2024

2024

[17] [17]

Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization, 2026

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization, 2026

2026

[18] [18]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

2025

[22] [22]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, BenjaminBurchfiel,HongkaiDai,andMaxSimchowitz. Diffusionpolicypolicyoptimization.arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

What matters in learning from large-scale datasets for robot manipulation

Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Chul Shin, Soroush Nasiriany, Ajay Mandlekar, and Danfei Xu. What matters in learning from large-scale datasets for robot manipulation. InICLR, 2025

2025

[25] [25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

AndrewWagenmaker, MitsuhikoNakamoto, YunchuZhang, SeohongPark, WaleedYagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

Yuquan Xue, Guanxing Lu, Zhenyu Wu, Chuanrui Zhang, Bofang Jia, Zhengyi Gu, and Ziwei Wang. Resample: Arobustdataaugmentationframeworkviaexploratorysamplingforroboticmanipulation. arXiv preprint arXiv:2510.17640, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, Zhecheng Yuan, and Huazhe Xu. Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025. 11

work page arXiv 2025

[30] [30]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Rlinf-vla: A unified and efficient framework for vla+ rl training

Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training. arXiv preprint arXiv:2510.06710, 2025

work page arXiv 2025

[32] [32]

Reinbot: Amplifyingrobotvisual-languagemanipulationwithreinforcementlearning.arXiv preprint arXiv:2505.07395, 2025

Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifyingrobotvisual-languagemanipulationwithreinforcementlearning.arXiv preprint arXiv:2505.07395, 2025

work page arXiv 2025

[33] [33]

Reinforcing action policies by prophesying.arXiv preprint arXiv:2511.20633, 2025

Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. Reinforcing action policies by prophesying.arXiv preprint arXiv:2511.20633, 2025. 12 A Experiment Details A.1 Real-World Experimental Setup Figure 6. Real-world task settings.We evaluate WorldSample on diverse manipulation tasks covering contact-rich insertion, object pushing, sorting, pick & place,...

work page arXiv 2025