TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
Pith reviewed 2026-05-21 13:08 UTC · model grok-4.3
The pith
A smartphone-reconstructed digital twin expands the exploration space for online reinforcement learning on vision-language-action models, enabling near-100 percent success in robotic manipulation tasks with only 20 minutes of real-world on-
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TwinRL reconstructs a high-fidelity digital twin from smartphone-captured scenes and applies it across three stages: supervised fine-tuning with an explicit exploration-space expansion step, a twin RL warm-up that treats the twin as an active exploration guide rather than simple data augmentation, and final real-world RL. The twin RL phase runs efficient parallel learning to fill the replay buffer, stabilize training, and identify informative yet failure-prone configurations for targeted human-in-the-loop rollouts. Across four tasks this yields near-100 percent success in both in-distribution and out-of-distribution regions together with more than 30 percent faster convergence than priorreal
What carries the argument
The smartphone-reconstructed digital twin acting as an exploration guide: it first broadens the support of the supervised fine-tuning trajectory distribution and then executes parallel reinforcement learning to generate trajectories that populate the replay buffer and stabilize the subsequent real-world learning phase.
If this is right
- The approach delivers near-100 percent task success on both seen and unseen configurations.
- Convergence occurs more than 30 percent faster than earlier real-world reinforcement learning baselines.
- Effective training requires only about 20 minutes of physical robot interaction time.
- Failure-prone states identified inside the twin allow focused human rollouts that further raise sample efficiency.
Where Pith is reading between the lines
- Updating the digital twin on the fly from new phone images could support ongoing adaptation when the physical scene changes.
- The same twin-guided buffer population step might transfer to other real-world control domains such as mobile navigation or assembly.
- Lowering the amount of real-robot time needed could make reinforcement learning on vision-language-action models practical in domestic or small-factory settings where safety and hardware wear matter.
Load-bearing premise
The digital twin built from smartphone images matches the real physical environment closely enough that trajectories generated inside it remain useful guides rather than sources of misleading information for real-world reinforcement learning.
What would settle it
A head-to-head experiment in which real-world reinforcement learning trained directly from the same supervised fine-tuning checkpoint, without any twin RL warm-up or twin-generated trajectories, reaches comparable success rates and convergence speed.
Figures
read the original abstract
Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TwinRL, a three-stage digital twin-real-world collaborative post-training framework for Vision-Language-Action (VLA) models in robotic manipulation. After SFT warm-up with an exploration-space-expansion strategy, a smartphone-reconstructed digital twin performs parallel RL to generate trajectories that populate the replay buffer, identify failure-prone states, and guide subsequent real-world RL. Across four tasks the method is reported to reach near-100% success in both in- and out-of-distribution regimes while achieving >30% faster convergence than prior real-world RL baselines with only 20 minutes of on-robot interaction.
Significance. If the digital-twin trajectories transfer reliably, the framework offers a practical route to reduce the real-world sample complexity of online RL for contact-rich manipulation. The explicit separation of SFT-induced exploration limits from the twin-guided warm-up stage is a clear conceptual contribution. The significance hinges on whether the smartphone reconstruction supplies sufficiently accurate inertial, frictional, and compliance parameters; absent that validation the efficiency and generalization claims rest on an unquantified sim-to-real assumption.
major comments (3)
- [Abstract / Results] Abstract and Results: the headline claims of near-100% success and >30% faster convergence are stated without error bars, trial counts, statistical tests, or ablation tables. Because these quantities are load-bearing for the central efficiency and generalization assertions, their absence prevents assessment of whether post-hoc selection or unstated variance affects the reported gains.
- [Twin RL Warm-up] Twin RL warm-up section: the claim that parallel RL in the smartphone-reconstructed twin can act as an exploration guide without large sim-to-real gaps is not supported by any quantitative dynamics validation (e.g., measured vs. simulated contact forces, friction coefficients, or inertial parameters). If these quantities differ materially, value estimates and trajectories generated in the twin will not stabilize the subsequent real-world RL stage as asserted.
- [SFT Warm-up] SFT warm-up and exploration-expansion strategy: the motivating observation that online-RL exploration is largely constrained by the SFT trajectory distribution is used to justify the framework, yet no ablation isolates the contribution of the expansion strategy versus the twin RL warm-up itself. Without this decomposition it is unclear whether the reported 30% speedup is attributable to the digital-twin component or to other unstated factors.
minor comments (2)
- [Figures / Methods] Figure captions and method diagrams should explicitly label the replay-buffer population step that occurs between twin RL and real-world RL to avoid ambiguity in the data-flow description.
- [Digital Twin Reconstruction] The manuscript would benefit from a short table summarizing the smartphone-capture protocol (number of views, lighting conditions, object textures) so that readers can gauge reconstruction repeatability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions planned for the resubmitted manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the headline claims of near-100% success and >30% faster convergence are stated without error bars, trial counts, statistical tests, or ablation tables. Because these quantities are load-bearing for the central efficiency and generalization assertions, their absence prevents assessment of whether post-hoc selection or unstated variance affects the reported gains.
Authors: We agree that the presentation of headline results would be strengthened by explicit statistical details. The original experiments consisted of five independent trials per task and method; we will revise the abstract and results section to report means with standard-error bars, state the trial count explicitly, and include paired statistical comparisons (t-tests) against baselines. An expanded ablation table will also be added to the main experiments section. revision: yes
-
Referee: [Twin RL Warm-up] Twin RL warm-up section: the claim that parallel RL in the smartphone-reconstructed twin can act as an exploration guide without large sim-to-real gaps is not supported by any quantitative dynamics validation (e.g., measured vs. simulated contact forces, friction coefficients, or inertial parameters). If these quantities differ materially, value estimates and trajectories generated in the twin will not stabilize the subsequent real-world RL stage as asserted.
Authors: The reconstruction pipeline combines visual geometry from smartphone video with a small number of real-world probing interactions to estimate friction and inertial parameters. While direct force-torque comparisons were outside the scope of the initial hardware setup, the observed transfer performance (near-100 % real-world success after twin warm-up and strong out-of-distribution generalization) supplies indirect evidence that the dynamics are sufficiently accurate for the tasks studied. We will add a new subsection that reports trajectory-distribution similarity metrics and the parameter-estimation procedure; a full contact-force validation would require additional instrumentation and is noted as future work. revision: partial
-
Referee: [SFT Warm-up] SFT warm-up and exploration-expansion strategy: the motivating observation that online-RL exploration is largely constrained by the SFT trajectory distribution is used to justify the framework, yet no ablation isolates the contribution of the expansion strategy versus the twin RL warm-up itself. Without this decomposition it is unclear whether the reported 30% speedup is attributable to the digital-twin component or to other unstated factors.
Authors: We concur that an explicit decomposition clarifies the source of the reported gains. The revised manuscript will include a four-way ablation (SFT baseline, SFT+expansion only, SFT+twin warm-up without expansion, and full TwinRL). The new results attribute the majority of the convergence speedup to the twin RL warm-up while showing that the expansion strategy further improves exploration coverage; both components are required to reach the final performance level. revision: yes
Circularity Check
TwinRL framework relies on empirical validation without circular derivations
full rationale
The paper's core contribution is an empirically motivated three-stage framework (SFT warm-up, twin RL warm-up, real-world RL) whose performance claims are supported by direct experimental results across four tasks rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the key observation about SFT-constrained exploration is presented as an experimental finding used to motivate the design, and the reported success rates and convergence improvements are measured outcomes, not quantities forced by construction from the same inputs. The digital-twin reconstruction and parallel RL steps are described as practical engineering choices whose validity is tested externally via on-robot interaction data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A smartphone-captured scene can be reconstructed into a high-fidelity digital twin that faithfully supports RL exploration and failure identification.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
exploration space expansion strategy that expands the support of the trajectory distribution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736, 2022
work page 2022
-
[2]
Paul J Besl and Neil D McKay. Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992
work page 1992
-
[3]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. pi0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InarXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Less is more: Em- powering gui agent with context-aware simplification
Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi- Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025
-
[7]
SAM 3D: 3Dfy Anything in Images
Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Conrft: A reinforced fine-tuning method for vla models via consistency policy
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025
-
[9]
Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025
-
[10]
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023
work page 2023
-
[12]
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025
-
[13]
Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,
Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Ren- rui Zhang, Peng Jia, Pheng-Ann Heng, and Shanghang Zhang. Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,
- [14]
-
[15]
Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025
-
[16]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018
work page 2018
-
[17]
Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025
-
[18]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, et al.π 0.5: a vision- language-action model with open-world generalization,
-
[19]
URL https://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, and Shanghang Zhang. Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025
-
[21]
Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025
-
[22]
Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning
Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16923–16930. IEEE, 2025
work page 2025
-
[23]
Pris- matic vlms: Investigating the design space of visually- conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[24]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
KIRI Engine: 3D Scanner App for iPhone, Android, and Web
KIRI Innovations. KIRI Engine: 3D Scanner App for iPhone, Android, and Web. https://www.kiriengine.app/
-
[27]
Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025
Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025. URL https://arxiv.org/abs/2510.14830
-
[28]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025
Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025. URL https://arxiv. org/abs/2411.11839
-
[31]
Onetwovla: A unified vision-language-action model with adaptive reasoning,
Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025
-
[32]
Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024
work page 2024
-
[33]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025
Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025
-
[35]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026
Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026. URL https://arxiv.org/abs/2601. 05248
work page 2026
-
[37]
Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, Liyi Luo, and Yongliang Shi. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024. URL https://arxiv.org/ abs/2408.14873
-
[38]
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Rlif: Interactive imitation learning as reinforcement learning, 2024
Jianlan Luo, Perry Dong, Yuexiang Zhai, Yi Ma, and Sergey Levine. Rlif: Interactive imitation learning as reinforcement learning, 2024. URL https://arxiv.org/abs/ 2311.12996
-
[41]
Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025
work page 2025
-
[42]
Russell Mendonca, Emmanuel Panov, Bernadette Bucher, Jiuguang Wang, and Deepak Pathak. Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024
-
[43]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023
work page 2023
-
[44]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[46]
franky: High-level control library for franka robots
Tim Schneider. franky: High-level control library for franka robots. https://github.com/TimSchneider42/ franky, 2023. Software library. Python and C++ high- level control library for Franka robots, derived from the frankx project by Pantor
work page 2023
-
[47]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Perceiver-actor: A multi-task transformer for robotic ma- nipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022
work page 2022
-
[50]
Yifei Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, et al. Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025
-
[51]
Ioan A. S ¸ucan et al. The Open Motion Planning Li- brary.IEEE Robotics & Automation Magazine, 19(4):72– 82, December 2012. doi: 10.1109/MRA.2012.2205651. https://ompl.kavrakilab.org
-
[52]
Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025
work page internal anchor Pith review arXiv 2025
-
[53]
Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024
Marcel Torne, Arhan Jain, Jiayi Yuan, Vidaaranya Macha, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, and Abhishek Gupta. Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024
-
[54]
Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024
-
[55]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024
-
[57]
Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025
work page 2025
-
[58]
Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, et al. Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training. In2025 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7933–7940. IEEE, 2025
work page 2025
-
[59]
R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,
Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, and Jiwen Lu. R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,
-
[60]
URL https://arxiv.org/abs/2510.08547
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, et al. Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025
-
[62]
Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025
Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025. URL https://arxiv. org/abs/2504.13175
-
[63]
Video2policy: Scaling up manipulation tasks in simulation through internet videos
Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, and Pieter Abbeel. Video2policy: Scaling up manipulation tasks in simulation through internet videos. arXiv preprint arXiv:2502.09886, 2025
-
[64]
Justin Yu, Letian Fu, Huang Huang, Karim El- Refai, Rares Andrei Ambrus, Richard Cheng, Muham- mad Zubair Irshad, and Ken Goldberg. Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025
-
[65]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms be- yond the base model?arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy.arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, et al. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025
-
[68]
Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024
Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. A-Only Policy A+B Policy (a) Experiment Setting (b) Policy Performance (c) HIL Online RL Efficiency Region A Region B 9.00 cm 15.00cm ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.