pith. sign in

arxiv: 2602.09023 · v4 · pith:DCPO2QGRnew · submitted 2026-02-09 · 💻 cs.RO

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

Pith reviewed 2026-05-21 13:08 UTC · model grok-4.3

classification 💻 cs.RO
keywords digital twinreinforcement learningrobotic manipulationvision-language-actionexploration spacereal-world trainingsupervised fine-tuning
0
0 comments X

The pith

A smartphone-reconstructed digital twin expands the exploration space for online reinforcement learning on vision-language-action models, enabling near-100 percent success in robotic manipulation tasks with only 20 minutes of real-world on-

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the trajectory distribution from supervised fine-tuning limits how effectively online reinforcement learning can explore in real-world robotic settings. TwinRL counters this by first building a high-fidelity digital twin from ordinary smartphone images of the scene. It then runs a dedicated twin RL warm-up phase that generates interactive trajectories in parallel to populate the replay buffer and flag failure-prone states before any substantial real-world training begins. The real-world RL stage that follows benefits from this guided buffer and targeted human interventions, producing faster convergence and high success on both familiar and novel task variations. A sympathetic reader cares because the approach addresses the core barrier of costly and slow real-world interaction for bringing vision-language-action models into practical robot use.

Core claim

TwinRL reconstructs a high-fidelity digital twin from smartphone-captured scenes and applies it across three stages: supervised fine-tuning with an explicit exploration-space expansion step, a twin RL warm-up that treats the twin as an active exploration guide rather than simple data augmentation, and final real-world RL. The twin RL phase runs efficient parallel learning to fill the replay buffer, stabilize training, and identify informative yet failure-prone configurations for targeted human-in-the-loop rollouts. Across four tasks this yields near-100 percent success in both in-distribution and out-of-distribution regions together with more than 30 percent faster convergence than priorreal

What carries the argument

The smartphone-reconstructed digital twin acting as an exploration guide: it first broadens the support of the supervised fine-tuning trajectory distribution and then executes parallel reinforcement learning to generate trajectories that populate the replay buffer and stabilize the subsequent real-world learning phase.

If this is right

  • The approach delivers near-100 percent task success on both seen and unseen configurations.
  • Convergence occurs more than 30 percent faster than earlier real-world reinforcement learning baselines.
  • Effective training requires only about 20 minutes of physical robot interaction time.
  • Failure-prone states identified inside the twin allow focused human rollouts that further raise sample efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Updating the digital twin on the fly from new phone images could support ongoing adaptation when the physical scene changes.
  • The same twin-guided buffer population step might transfer to other real-world control domains such as mobile navigation or assembly.
  • Lowering the amount of real-robot time needed could make reinforcement learning on vision-language-action models practical in domestic or small-factory settings where safety and hardware wear matter.

Load-bearing premise

The digital twin built from smartphone images matches the real physical environment closely enough that trajectories generated inside it remain useful guides rather than sources of misleading information for real-world reinforcement learning.

What would settle it

A head-to-head experiment in which real-world reinforcement learning trained directly from the same supervised fine-tuning checkpoint, without any twin RL warm-up or twin-generated trajectories, reaches comparable success rates and convergence speed.

Figures

Figures reproduced from arXiv: 2602.09023 by Chenyang Gu, Gao Huang, Jiaming Liu, Nuowei Han, Peng Jia, Qinwen Xu, Rui Zhou, Shanghang Zhang, Shaojun Shi, Shuo Gu, Sirui Han, Wenzhao Zheng, Yang Yue, Zhuoyang Liu.

Figure 1
Figure 1. Figure 1: Overview. (a) We propose TwinRL, a digital twin–real-world collaborative RL framework that expands the exploration space from in-distribution teleoperation data to out-of-distribution regions. TwinRL then performs efficient, parallel online RL in the digital twin to enable sim-to-real guided exploration, improving the convergence speed of real-world RL. b) Across four tasks, TwinRL converges faster in onli… view at source ↗
Figure 2
Figure 2. Figure 2: Exploration Bottlenecks.(a) We split the workspace into an in-distribution region (A) and an OOD region (B). Each region is defined by the manipulated object’s center location at task completion. (b) Heatmaps visualize the performance of different policies. (c) Learning curves show the online RL training dynamics of the A-only policy in both regions. autonomous online RL. Specifically, we compare two train… view at source ↗
Figure 3
Figure 3. Figure 3: TwinRL. Stage I: Starting from human teleoperation, we introduce an exploration-space expansion strategy that synthesizes diverse digital-twin demonstrations to broaden SFT coverage. Stage II: The SFT-initialized policy is then trained with scalable, parallel online RL in the digital twin to harvest RL-style rollouts, which are transferred to initialize the real-world replay buffer and stabilize online lea… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world experimental setup. We consider four tasks, namely Pick-and-Place, Insert-Hexagon-Block, Insert-Triple￾Column-Block, and Erase-Whiteboard, covering multi-step, precise, and contact-rich manipulation. The red and blue areas denote the in-distribution (ID) and out-of-distribution (OOD) evaluation regions, respectively. ages actions with higher critic-estimated Q values [15]. The value function Qθ … view at source ↗
Figure 5
Figure 5. Figure 5: Real-world Experiments. We report success-rate curves for online RL across four manipulation tasks under both ID and OOD settings. The y-axis shows success rate, and the x-axis reports both online training time and our model training steps. TABLE I: Ablation on exploration space expansion. We vary the number of twin-generated trajectories added during SFT warm-up and measure the resulting success rate (SR)… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on Sim-to-Real-Guided HiL. The guidance markedly accelerates RL learning, reaching 100% success at around 4k steps (∼14 min), while training without guidance improves more slowly and attains a lower success rate. C. Ablation Study We select Insert-Hexagon-Block for ablation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Motivation. (a) We split the workspace into an in-distribution region (A) and an out-of-distribution region (B), defined by the manipulated object’s center location at task completion. (b) Heatmaps visualize the success rate of different policies. (c) Learning curves show the online RL training dynamics of the A-only policy in both regions. APPENDIX A. ADDITIONAL MOTIVATION EXPERIMENTS Similar to the motiv… view at source ↗
Figure 9
Figure 9. Figure 9: Real-World Robot Setups and Experimental Assets. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison between real-world scenarios and their corresponding digital-twin renderings. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional ablations. We evaluate how augmenting SFT with digital-twin data affects performance, comparing (a) adding twin data only in the OOD region and (b) adding twin data in both the ID and OOD regions. Each grid cell is evaluated with five rollout trials. examine how the distribution of twin data affect warm-up SFT performance. In this study, we fix the amount of real￾world data to 30 in-distributio… view at source ↗
Figure 10
Figure 10. Figure 10: D. ADDITIONAL RESULTS Episode Length: We provide additional analysis using episode length to characterize the efficiency of real-world on￾line reinforcement learning. All experiments are conducted on the Insert-Triple-Column-Block task under both in-distribution (ID) and out-of-distribution (OOD) settings, following the same real-world training protocol and baselines as in Sec￾tion IV-B of the main paper.… view at source ↗
Figure 14
Figure 14. Figure 14: Additional Robustness Evaluation. We additionally compare the SFT policy and the TwinRL-guided online RL policy on the Pick-and-Place task under previously unseen environmental perturbations. E. FAILURE CASE ANALYSIS In the four tasks designed for this study, several typical failure modes were identified. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of complete real-world task execution processes (left to right). [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visualization of failure cases on different tasks, and red box highlights the failure positions. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes TwinRL, a three-stage digital twin-real-world collaborative post-training framework for Vision-Language-Action (VLA) models in robotic manipulation. After SFT warm-up with an exploration-space-expansion strategy, a smartphone-reconstructed digital twin performs parallel RL to generate trajectories that populate the replay buffer, identify failure-prone states, and guide subsequent real-world RL. Across four tasks the method is reported to reach near-100% success in both in- and out-of-distribution regimes while achieving >30% faster convergence than prior real-world RL baselines with only 20 minutes of on-robot interaction.

Significance. If the digital-twin trajectories transfer reliably, the framework offers a practical route to reduce the real-world sample complexity of online RL for contact-rich manipulation. The explicit separation of SFT-induced exploration limits from the twin-guided warm-up stage is a clear conceptual contribution. The significance hinges on whether the smartphone reconstruction supplies sufficiently accurate inertial, frictional, and compliance parameters; absent that validation the efficiency and generalization claims rest on an unquantified sim-to-real assumption.

major comments (3)
  1. [Abstract / Results] Abstract and Results: the headline claims of near-100% success and >30% faster convergence are stated without error bars, trial counts, statistical tests, or ablation tables. Because these quantities are load-bearing for the central efficiency and generalization assertions, their absence prevents assessment of whether post-hoc selection or unstated variance affects the reported gains.
  2. [Twin RL Warm-up] Twin RL warm-up section: the claim that parallel RL in the smartphone-reconstructed twin can act as an exploration guide without large sim-to-real gaps is not supported by any quantitative dynamics validation (e.g., measured vs. simulated contact forces, friction coefficients, or inertial parameters). If these quantities differ materially, value estimates and trajectories generated in the twin will not stabilize the subsequent real-world RL stage as asserted.
  3. [SFT Warm-up] SFT warm-up and exploration-expansion strategy: the motivating observation that online-RL exploration is largely constrained by the SFT trajectory distribution is used to justify the framework, yet no ablation isolates the contribution of the expansion strategy versus the twin RL warm-up itself. Without this decomposition it is unclear whether the reported 30% speedup is attributable to the digital-twin component or to other unstated factors.
minor comments (2)
  1. [Figures / Methods] Figure captions and method diagrams should explicitly label the replay-buffer population step that occurs between twin RL and real-world RL to avoid ambiguity in the data-flow description.
  2. [Digital Twin Reconstruction] The manuscript would benefit from a short table summarizing the smartphone-capture protocol (number of views, lighting conditions, object textures) so that readers can gauge reconstruction repeatability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions planned for the resubmitted manuscript.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: the headline claims of near-100% success and >30% faster convergence are stated without error bars, trial counts, statistical tests, or ablation tables. Because these quantities are load-bearing for the central efficiency and generalization assertions, their absence prevents assessment of whether post-hoc selection or unstated variance affects the reported gains.

    Authors: We agree that the presentation of headline results would be strengthened by explicit statistical details. The original experiments consisted of five independent trials per task and method; we will revise the abstract and results section to report means with standard-error bars, state the trial count explicitly, and include paired statistical comparisons (t-tests) against baselines. An expanded ablation table will also be added to the main experiments section. revision: yes

  2. Referee: [Twin RL Warm-up] Twin RL warm-up section: the claim that parallel RL in the smartphone-reconstructed twin can act as an exploration guide without large sim-to-real gaps is not supported by any quantitative dynamics validation (e.g., measured vs. simulated contact forces, friction coefficients, or inertial parameters). If these quantities differ materially, value estimates and trajectories generated in the twin will not stabilize the subsequent real-world RL stage as asserted.

    Authors: The reconstruction pipeline combines visual geometry from smartphone video with a small number of real-world probing interactions to estimate friction and inertial parameters. While direct force-torque comparisons were outside the scope of the initial hardware setup, the observed transfer performance (near-100 % real-world success after twin warm-up and strong out-of-distribution generalization) supplies indirect evidence that the dynamics are sufficiently accurate for the tasks studied. We will add a new subsection that reports trajectory-distribution similarity metrics and the parameter-estimation procedure; a full contact-force validation would require additional instrumentation and is noted as future work. revision: partial

  3. Referee: [SFT Warm-up] SFT warm-up and exploration-expansion strategy: the motivating observation that online-RL exploration is largely constrained by the SFT trajectory distribution is used to justify the framework, yet no ablation isolates the contribution of the expansion strategy versus the twin RL warm-up itself. Without this decomposition it is unclear whether the reported 30% speedup is attributable to the digital-twin component or to other unstated factors.

    Authors: We concur that an explicit decomposition clarifies the source of the reported gains. The revised manuscript will include a four-way ablation (SFT baseline, SFT+expansion only, SFT+twin warm-up without expansion, and full TwinRL). The new results attribute the majority of the convergence speedup to the twin RL warm-up while showing that the expansion strategy further improves exploration coverage; both components are required to reach the final performance level. revision: yes

Circularity Check

0 steps flagged

TwinRL framework relies on empirical validation without circular derivations

full rationale

The paper's core contribution is an empirically motivated three-stage framework (SFT warm-up, twin RL warm-up, real-world RL) whose performance claims are supported by direct experimental results across four tasks rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the key observation about SFT-constrained exploration is presented as an experimental finding used to motivate the design, and the reported success rates and convergence improvements are measured outcomes, not quantities forced by construction from the same inputs. The digital-twin reconstruction and parallel RL steps are described as practical engineering choices whose validity is tested externally via on-robot interaction data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified premise that a smartphone-derived digital twin is sufficiently accurate to expand exploration and stabilize real-world RL without introducing misleading trajectories.

axioms (1)
  • domain assumption A smartphone-captured scene can be reconstructed into a high-fidelity digital twin that faithfully supports RL exploration and failure identification.
    Invoked when the abstract states 'TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes' and then uses it to guide real-world RL.

pith-pipeline@v0.9.0 · 5870 in / 1364 out tokens · 50171 ms · 2026-05-21T13:08:34.863014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  2. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  3. MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

    cs.RO 2026-04 unverdicted novelty 6.0

    MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 2 Pith papers · 20 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

  2. [2]

    Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992

    Paul J Besl and Neil D McKay. Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992

  3. [3]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. pi0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InarXiv preprint arXiv:2307.15818, 2023

  6. [6]

    Less is more: Em- powering gui agent with context-aware simplification

    Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi- Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

  7. [7]

    SAM 3D: 3Dfy Anything in Images

    Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  8. [8]

    Conrft: A reinforced fine-tuning method for vla models via consistency policy

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025

  9. [9]

    arXiv preprint arXiv:2506.08440 , year=

    Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025

  10. [10]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  11. [11]

    Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

    Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

  12. [12]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

  13. [13]

    Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,

    Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Ren- rui Zhang, Peng Jia, Pheng-Ann Heng, and Shanghang Zhang. Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,

  14. [14]

    URL https://arxiv.org/abs/2512.02013

  15. [15]

    Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

  16. [16]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  17. [17]

    Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

    Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

  18. [18]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, et al.π 0.5: a vision- language-action model with open-world generalization,

  19. [19]

    URL https://arxiv.org/abs/2504.16054

  20. [20]

    Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025

    Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, and Shanghang Zhang. Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025

  21. [21]

    Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

    Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

  22. [22]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16923–16930. IEEE, 2025

  23. [23]

    Pris- matic vlms: Investigating the design space of visually- conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning, 2024

  24. [24]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  25. [25]

    KIRI Engine: 3D Scanner App for iPhone, Android, and Web

    KIRI Innovations. KIRI Engine: 3D Scanner App for iPhone, Android, and Web. https://www.kiriengine.app/

  26. [27]

    Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025

    Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025. URL https://arxiv.org/abs/2510.14830

  27. [28]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  28. [29]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  29. [30]

    Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025

    Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025. URL https://arxiv. org/abs/2411.11839

  30. [31]

    Onetwovla: A unified vision-language-action model with adaptive reasoning,

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

  31. [32]

    Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

  32. [33]

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025

  33. [34]

    What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

    Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

  34. [35]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  35. [36]

    Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026

    Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026. URL https://arxiv.org/abs/2601. 05248

  36. [37]

    Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024

    Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, Liyi Luo, and Yongliang Shi. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024. URL https://arxiv.org/ abs/2408.14873

  37. [38]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

  38. [40]

    Rlif: Interactive imitation learning as reinforcement learning, 2024

    Jianlan Luo, Perry Dong, Yuexiang Zhai, Yi Ma, and Sergey Levine. Rlif: Interactive imitation learning as reinforcement learning, 2024. URL https://arxiv.org/abs/ 2311.12996

  39. [41]

    Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

    Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

  40. [42]

    Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024

    Russell Mendonca, Emmanuel Panov, Bernadette Bucher, Jiuguang Wang, and Deepak Pathak. Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024

  41. [43]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

  42. [44]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  43. [45]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  44. [46]

    franky: High-level control library for franka robots

    Tim Schneider. franky: High-level control library for franka robots. https://github.com/TimSchneider42/ franky, 2023. Software library. Python and C++ high- level control library for Franka robots, derived from the frankx project by Pantor

  45. [47]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  46. [48]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  47. [49]

    Perceiver-actor: A multi-task transformer for robotic ma- nipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022

  48. [50]

    Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

    Yifei Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, et al. Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

  49. [51]

    S ¸ucan et al

    Ioan A. S ¸ucan et al. The Open Motion Planning Li- brary.IEEE Robotics & Automation Magazine, 19(4):72– 82, December 2012. doi: 10.1109/MRA.2012.2205651. https://ompl.kavrakilab.org

  50. [52]

    Interactive Post-Training for Vision-Language-Action Models

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

  51. [53]

    Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024

    Marcel Torne, Arhan Jain, Jiayi Yuan, Vidaaranya Macha, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, and Abhishek Gupta. Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024

  52. [54]

    Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024

    Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024

  53. [55]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  54. [56]

    Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

    Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

  55. [57]

    Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

  56. [58]

    Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training

    Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, et al. Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training. In2025 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7933–7940. IEEE, 2025

  57. [59]

    R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,

    Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, and Jiwen Lu. R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,

  58. [60]

    URL https://arxiv.org/abs/2510.08547

  59. [61]

    Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

    Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, et al. Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

  60. [62]

    Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025

    Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025. URL https://arxiv. org/abs/2504.13175

  61. [63]

    Video2policy: Scaling up manipulation tasks in simulation through internet videos

    Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, and Pieter Abbeel. Video2policy: Scaling up manipulation tasks in simulation through internet videos. arXiv preprint arXiv:2502.09886, 2025

  62. [64]

    Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

    Justin Yu, Letian Fu, Huang Huang, Karim El- Refai, Rares Andrei Ambrus, Richard Cheng, Muham- mad Zubair Irshad, and Ken Goldberg. Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

  63. [65]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms be- yond the base model?arXiv preprint arXiv:2504.13837, 2025

  64. [66]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy.arXiv preprint arXiv:2403.03954, 2024

  65. [67]

    Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

    Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, et al. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

  66. [68]

    Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024

    Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. A-Only Policy A+B Policy (a) Experiment Setting (b) Policy Performance (c) HIL Online RL Efficiency Region A Region B 9.00 cm 15.00cm ...