TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

Chenyang Gu; Gao Huang; Jiaming Liu; Nuowei Han; Peng Jia; Qinwen Xu; Rui Zhou; Shanghang Zhang; Shaojun Shi; Shuo Gu

arxiv: 2602.09023 · v4 · pith:DCPO2QGRnew · submitted 2026-02-09 · 💻 cs.RO

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

Qinwen Xu , Jiaming Liu , Rui Zhou , Shaojun Shi , Nuowei Han , Zhuoyang Liu , Chenyang Gu , Shuo Gu

show 6 more authors

Yang Yue Gao Huang Wenzhao Zheng Sirui Han Peng Jia Shanghang Zhang

This is my paper

Pith reviewed 2026-05-21 13:08 UTC · model grok-4.3

classification 💻 cs.RO

keywords digital twinreinforcement learningrobotic manipulationvision-language-actionexploration spacereal-world trainingsupervised fine-tuning

0 comments

The pith

A smartphone-reconstructed digital twin expands the exploration space for online reinforcement learning on vision-language-action models, enabling near-100 percent success in robotic manipulation tasks with only 20 minutes of real-world on-

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the trajectory distribution from supervised fine-tuning limits how effectively online reinforcement learning can explore in real-world robotic settings. TwinRL counters this by first building a high-fidelity digital twin from ordinary smartphone images of the scene. It then runs a dedicated twin RL warm-up phase that generates interactive trajectories in parallel to populate the replay buffer and flag failure-prone states before any substantial real-world training begins. The real-world RL stage that follows benefits from this guided buffer and targeted human interventions, producing faster convergence and high success on both familiar and novel task variations. A sympathetic reader cares because the approach addresses the core barrier of costly and slow real-world interaction for bringing vision-language-action models into practical robot use.

Core claim

TwinRL reconstructs a high-fidelity digital twin from smartphone-captured scenes and applies it across three stages: supervised fine-tuning with an explicit exploration-space expansion step, a twin RL warm-up that treats the twin as an active exploration guide rather than simple data augmentation, and final real-world RL. The twin RL phase runs efficient parallel learning to fill the replay buffer, stabilize training, and identify informative yet failure-prone configurations for targeted human-in-the-loop rollouts. Across four tasks this yields near-100 percent success in both in-distribution and out-of-distribution regions together with more than 30 percent faster convergence than priorreal

What carries the argument

The smartphone-reconstructed digital twin acting as an exploration guide: it first broadens the support of the supervised fine-tuning trajectory distribution and then executes parallel reinforcement learning to generate trajectories that populate the replay buffer and stabilize the subsequent real-world learning phase.

If this is right

The approach delivers near-100 percent task success on both seen and unseen configurations.
Convergence occurs more than 30 percent faster than earlier real-world reinforcement learning baselines.
Effective training requires only about 20 minutes of physical robot interaction time.
Failure-prone states identified inside the twin allow focused human rollouts that further raise sample efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Updating the digital twin on the fly from new phone images could support ongoing adaptation when the physical scene changes.
The same twin-guided buffer population step might transfer to other real-world control domains such as mobile navigation or assembly.
Lowering the amount of real-robot time needed could make reinforcement learning on vision-language-action models practical in domestic or small-factory settings where safety and hardware wear matter.

Load-bearing premise

The digital twin built from smartphone images matches the real physical environment closely enough that trajectories generated inside it remain useful guides rather than sources of misleading information for real-world reinforcement learning.

What would settle it

A head-to-head experiment in which real-world reinforcement learning trained directly from the same supervised fine-tuning checkpoint, without any twin RL warm-up or twin-generated trajectories, reaches comparable success rates and convergence speed.

Figures

Figures reproduced from arXiv: 2602.09023 by Chenyang Gu, Gao Huang, Jiaming Liu, Nuowei Han, Peng Jia, Qinwen Xu, Rui Zhou, Shanghang Zhang, Shaojun Shi, Shuo Gu, Sirui Han, Wenzhao Zheng, Yang Yue, Zhuoyang Liu.

**Figure 1.** Figure 1: Overview. (a) We propose TwinRL, a digital twin–real-world collaborative RL framework that expands the exploration space from in-distribution teleoperation data to out-of-distribution regions. TwinRL then performs efficient, parallel online RL in the digital twin to enable sim-to-real guided exploration, improving the convergence speed of real-world RL. b) Across four tasks, TwinRL converges faster in onli… view at source ↗

**Figure 2.** Figure 2: Exploration Bottlenecks.(a) We split the workspace into an in-distribution region (A) and an OOD region (B). Each region is defined by the manipulated object’s center location at task completion. (b) Heatmaps visualize the performance of different policies. (c) Learning curves show the online RL training dynamics of the A-only policy in both regions. autonomous online RL. Specifically, we compare two train… view at source ↗

**Figure 3.** Figure 3: TwinRL. Stage I: Starting from human teleoperation, we introduce an exploration-space expansion strategy that synthesizes diverse digital-twin demonstrations to broaden SFT coverage. Stage II: The SFT-initialized policy is then trained with scalable, parallel online RL in the digital twin to harvest RL-style rollouts, which are transferred to initialize the real-world replay buffer and stabilize online lea… view at source ↗

**Figure 4.** Figure 4: Real-world experimental setup. We consider four tasks, namely Pick-and-Place, Insert-Hexagon-Block, Insert-TripleColumn-Block, and Erase-Whiteboard, covering multi-step, precise, and contact-rich manipulation. The red and blue areas denote the in-distribution (ID) and out-of-distribution (OOD) evaluation regions, respectively. ages actions with higher critic-estimated Q values [15]. The value function Qθ … view at source ↗

**Figure 5.** Figure 5: Real-world Experiments. We report success-rate curves for online RL across four manipulation tasks under both ID and OOD settings. The y-axis shows success rate, and the x-axis reports both online training time and our model training steps. TABLE I: Ablation on exploration space expansion. We vary the number of twin-generated trajectories added during SFT warm-up and measure the resulting success rate (SR)… view at source ↗

**Figure 6.** Figure 6: Ablation on Sim-to-Real-Guided HiL. The guidance markedly accelerates RL learning, reaching 100% success at around 4k steps (∼14 min), while training without guidance improves more slowly and attains a lower success rate. C. Ablation Study We select Insert-Hexagon-Block for ablation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Motivation. (a) We split the workspace into an in-distribution region (A) and an out-of-distribution region (B), defined by the manipulated object’s center location at task completion. (b) Heatmaps visualize the success rate of different policies. (c) Learning curves show the online RL training dynamics of the A-only policy in both regions. APPENDIX A. ADDITIONAL MOTIVATION EXPERIMENTS Similar to the motiv… view at source ↗

**Figure 9.** Figure 9: Real-World Robot Setups and Experimental Assets. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison between real-world scenarios and their corresponding digital-twin renderings. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 12.** Figure 12: Additional ablations. We evaluate how augmenting SFT with digital-twin data affects performance, comparing (a) adding twin data only in the OOD region and (b) adding twin data in both the ID and OOD regions. Each grid cell is evaluated with five rollout trials. examine how the distribution of twin data affect warm-up SFT performance. In this study, we fix the amount of realworld data to 30 in-distributio… view at source ↗

**Figure 10.** Figure 10: D. ADDITIONAL RESULTS Episode Length: We provide additional analysis using episode length to characterize the efficiency of real-world online reinforcement learning. All experiments are conducted on the Insert-Triple-Column-Block task under both in-distribution (ID) and out-of-distribution (OOD) settings, following the same real-world training protocol and baselines as in Section IV-B of the main paper.… view at source ↗

**Figure 14.** Figure 14: Additional Robustness Evaluation. We additionally compare the SFT policy and the TwinRL-guided online RL policy on the Pick-and-Place task under previously unseen environmental perturbations. E. FAILURE CASE ANALYSIS In the four tasks designed for this study, several typical failure modes were identified. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Visualization of complete real-world task execution processes (left to right). [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of failure cases on different tasks, and red box highlights the failure positions. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TwinRL's three-stage pipeline using a phone-built digital twin to guide RL exploration is a concrete idea worth testing, but the sim-to-real dynamics match remains unproven in the writeup.

read the letter

The main thing to know is that this paper introduces TwinRL, a framework that builds a digital twin from smartphone images and uses it in a three-stage process to make real-world RL more efficient for VLA robotic manipulation. The novelty lies in the collaborative setup where the twin first helps expand the trajectory distribution during supervised fine-tuning, then runs its own RL to generate useful experiences for the real robot's replay buffer and to flag tricky configurations for human intervention. This goes beyond standard sim-to-real by positioning the twin as an active guide rather than passive data source. The experiments on four tasks show strong results with minimal real interaction time. It does well in tackling the high cost of real-world data and interaction for training instruction-following robots. The idea of using the twin to stabilize learning and target specific rollouts is a sensible engineering choice that could translate to practice. The soft spots are around the digital twin's accuracy. Smartphone reconstruction captures visuals but may not model the precise dynamics like friction and compliance needed for contact-rich manipulation. If there's a significant gap, the RL warm-up in the twin could provide misleading guidance, weakening the efficiency gains. The reported near-100% success and 30% faster convergence lack supporting details like ablations or statistical measures in the provided summary, so it's difficult to assess how general the findings are. This paper is for researchers in robotics and machine learning who work on bridging simulation and reality for complex control tasks. A reader interested in practical VLA deployment would get ideas from the pipeline. It deserves a serious referee because the claims are specific and the method is implementable, even if more validation is needed. I would recommend putting it through peer review to get feedback on the twin construction and experimental rigor.

Referee Report

3 major / 2 minor

Summary. The paper proposes TwinRL, a three-stage digital twin-real-world collaborative post-training framework for Vision-Language-Action (VLA) models in robotic manipulation. After SFT warm-up with an exploration-space-expansion strategy, a smartphone-reconstructed digital twin performs parallel RL to generate trajectories that populate the replay buffer, identify failure-prone states, and guide subsequent real-world RL. Across four tasks the method is reported to reach near-100% success in both in- and out-of-distribution regimes while achieving >30% faster convergence than prior real-world RL baselines with only 20 minutes of on-robot interaction.

Significance. If the digital-twin trajectories transfer reliably, the framework offers a practical route to reduce the real-world sample complexity of online RL for contact-rich manipulation. The explicit separation of SFT-induced exploration limits from the twin-guided warm-up stage is a clear conceptual contribution. The significance hinges on whether the smartphone reconstruction supplies sufficiently accurate inertial, frictional, and compliance parameters; absent that validation the efficiency and generalization claims rest on an unquantified sim-to-real assumption.

major comments (3)

[Abstract / Results] Abstract and Results: the headline claims of near-100% success and >30% faster convergence are stated without error bars, trial counts, statistical tests, or ablation tables. Because these quantities are load-bearing for the central efficiency and generalization assertions, their absence prevents assessment of whether post-hoc selection or unstated variance affects the reported gains.
[Twin RL Warm-up] Twin RL warm-up section: the claim that parallel RL in the smartphone-reconstructed twin can act as an exploration guide without large sim-to-real gaps is not supported by any quantitative dynamics validation (e.g., measured vs. simulated contact forces, friction coefficients, or inertial parameters). If these quantities differ materially, value estimates and trajectories generated in the twin will not stabilize the subsequent real-world RL stage as asserted.
[SFT Warm-up] SFT warm-up and exploration-expansion strategy: the motivating observation that online-RL exploration is largely constrained by the SFT trajectory distribution is used to justify the framework, yet no ablation isolates the contribution of the expansion strategy versus the twin RL warm-up itself. Without this decomposition it is unclear whether the reported 30% speedup is attributable to the digital-twin component or to other unstated factors.

minor comments (2)

[Figures / Methods] Figure captions and method diagrams should explicitly label the replay-buffer population step that occurs between twin RL and real-world RL to avoid ambiguity in the data-flow description.
[Digital Twin Reconstruction] The manuscript would benefit from a short table summarizing the smartphone-capture protocol (number of views, lighting conditions, object textures) so that readers can gauge reconstruction repeatability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions planned for the resubmitted manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the headline claims of near-100% success and >30% faster convergence are stated without error bars, trial counts, statistical tests, or ablation tables. Because these quantities are load-bearing for the central efficiency and generalization assertions, their absence prevents assessment of whether post-hoc selection or unstated variance affects the reported gains.

Authors: We agree that the presentation of headline results would be strengthened by explicit statistical details. The original experiments consisted of five independent trials per task and method; we will revise the abstract and results section to report means with standard-error bars, state the trial count explicitly, and include paired statistical comparisons (t-tests) against baselines. An expanded ablation table will also be added to the main experiments section. revision: yes
Referee: [Twin RL Warm-up] Twin RL warm-up section: the claim that parallel RL in the smartphone-reconstructed twin can act as an exploration guide without large sim-to-real gaps is not supported by any quantitative dynamics validation (e.g., measured vs. simulated contact forces, friction coefficients, or inertial parameters). If these quantities differ materially, value estimates and trajectories generated in the twin will not stabilize the subsequent real-world RL stage as asserted.

Authors: The reconstruction pipeline combines visual geometry from smartphone video with a small number of real-world probing interactions to estimate friction and inertial parameters. While direct force-torque comparisons were outside the scope of the initial hardware setup, the observed transfer performance (near-100 % real-world success after twin warm-up and strong out-of-distribution generalization) supplies indirect evidence that the dynamics are sufficiently accurate for the tasks studied. We will add a new subsection that reports trajectory-distribution similarity metrics and the parameter-estimation procedure; a full contact-force validation would require additional instrumentation and is noted as future work. revision: partial
Referee: [SFT Warm-up] SFT warm-up and exploration-expansion strategy: the motivating observation that online-RL exploration is largely constrained by the SFT trajectory distribution is used to justify the framework, yet no ablation isolates the contribution of the expansion strategy versus the twin RL warm-up itself. Without this decomposition it is unclear whether the reported 30% speedup is attributable to the digital-twin component or to other unstated factors.

Authors: We concur that an explicit decomposition clarifies the source of the reported gains. The revised manuscript will include a four-way ablation (SFT baseline, SFT+expansion only, SFT+twin warm-up without expansion, and full TwinRL). The new results attribute the majority of the convergence speedup to the twin RL warm-up while showing that the expansion strategy further improves exploration coverage; both components are required to reach the final performance level. revision: yes

Circularity Check

0 steps flagged

TwinRL framework relies on empirical validation without circular derivations

full rationale

The paper's core contribution is an empirically motivated three-stage framework (SFT warm-up, twin RL warm-up, real-world RL) whose performance claims are supported by direct experimental results across four tasks rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the key observation about SFT-constrained exploration is presented as an experimental finding used to motivate the design, and the reported success rates and convergence improvements are measured outcomes, not quantities forced by construction from the same inputs. The digital-twin reconstruction and parallel RL steps are described as practical engineering choices whose validity is tested externally via on-robot interaction data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified premise that a smartphone-derived digital twin is sufficiently accurate to expand exploration and stabilize real-world RL without introducing misleading trajectories.

axioms (1)

domain assumption A smartphone-captured scene can be reconstructed into a high-fidelity digital twin that faithfully supports RL exploration and failure identification.
Invoked when the abstract states 'TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes' and then uses it to guide real-world RL.

pith-pipeline@v0.9.0 · 5870 in / 1364 out tokens · 50171 ms · 2026-05-21T13:08:34.863014+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exploration space expansion strategy that expands the support of the trajectory distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
cs.RO 2026-04 unverdicted novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 2 Pith papers · 20 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

work page 2022
[2]

Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992

Paul J Besl and Neil D McKay. Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992

work page 1992
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. pi0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InarXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Less is more: Em- powering gui agent with context-aware simplification

Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi- Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

work page arXiv 2025
[7]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Conrft: A reinforced fine-tuning method for vla models via consistency policy

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025

work page arXiv 2025
[9]

Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025

Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025

work page arXiv 2025
[10]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

work page 2023
[12]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

work page arXiv 2025
[13]

Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,

Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Ren- rui Zhang, Peng Jia, Pheng-Ann Heng, and Shanghang Zhang. Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,

work page
[14]

URL https://arxiv.org/abs/2512.02013

work page arXiv
[15]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

work page arXiv 2025
[16]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[17]

Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

work page arXiv 2025
[18]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, et al.π 0.5: a vision- language-action model with open-world generalization,

work page
[19]

URL https://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025

Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, and Shanghang Zhang. Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025

work page arXiv 2025
[21]

Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

work page arXiv 2025
[22]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16923–16930. IEEE, 2025

work page 2025
[23]

Pris- matic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning, 2024

work page 2024
[24]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

KIRI Engine: 3D Scanner App for iPhone, Android, and Web

KIRI Innovations. KIRI Engine: 3D Scanner App for iPhone, Android, and Web. https://www.kiriengine.app/

work page
[27]

Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025

Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025. URL https://arxiv.org/abs/2510.14830

work page arXiv 2025
[28]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025

Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025. URL https://arxiv. org/abs/2411.11839

work page arXiv 2025
[31]

Onetwovla: A unified vision-language-action model with adaptive reasoning,

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

work page arXiv 2025
[32]

Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

work page 2024
[33]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

work page arXiv 2025
[35]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026

Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026. URL https://arxiv.org/abs/2601. 05248

work page 2026
[37]

Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024

Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, Liyi Luo, and Yongliang Shi. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024. URL https://arxiv.org/ abs/2408.14873

work page arXiv 2024
[38]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Rlif: Interactive imitation learning as reinforcement learning, 2024

Jianlan Luo, Perry Dong, Yuexiang Zhai, Yi Ma, and Sergey Levine. Rlif: Interactive imitation learning as reinforcement learning, 2024. URL https://arxiv.org/abs/ 2311.12996

work page arXiv 2024
[41]

Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

work page 2025
[42]

Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024

Russell Mendonca, Emmanuel Panov, Bernadette Bucher, Jiuguang Wang, and Deepak Pathak. Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024

work page arXiv 2024
[43]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

work page 2023
[44]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[46]

franky: High-level control library for franka robots

Tim Schneider. franky: High-level control library for franka robots. https://github.com/TimSchneider42/ franky, 2023. Software library. Python and C++ high- level control library for Franka robots, derived from the frankx project by Pantor

work page 2023
[47]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Perceiver-actor: A multi-task transformer for robotic ma- nipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022

work page 2022
[50]

Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

Yifei Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, et al. Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

work page arXiv 2025
[51]

S ¸ucan et al

Ioan A. S ¸ucan et al. The Open Motion Planning Li- brary.IEEE Robotics & Automation Magazine, 19(4):72– 82, December 2012. doi: 10.1109/MRA.2012.2205651. https://ompl.kavrakilab.org

work page doi:10.1109/mra.2012.2205651 2012
[52]

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

work page internal anchor Pith review arXiv 2025
[53]

Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024

Marcel Torne, Arhan Jain, Jiayi Yuan, Vidaaranya Macha, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, and Abhishek Gupta. Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024

work page arXiv 2024
[54]

Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024

Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024

work page arXiv 2024
[55]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

work page arXiv 2024
[57]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

work page 2025
[58]

Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training

Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, et al. Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training. In2025 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7933–7940. IEEE, 2025

work page 2025
[59]

R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,

Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, and Jiwen Lu. R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,

work page
[60]

URL https://arxiv.org/abs/2510.08547

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, et al. Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

work page arXiv 2025
[62]

Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025

Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025. URL https://arxiv. org/abs/2504.13175

work page arXiv 2025
[63]

Video2policy: Scaling up manipulation tasks in simulation through internet videos

Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, and Pieter Abbeel. Video2policy: Scaling up manipulation tasks in simulation through internet videos. arXiv preprint arXiv:2502.09886, 2025

work page arXiv 2025
[64]

Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

Justin Yu, Letian Fu, Huang Huang, Karim El- Refai, Rares Andrei Ambrus, Richard Cheng, Muham- mad Zubair Irshad, and Ken Goldberg. Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

work page arXiv 2025
[65]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms be- yond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy.arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, et al. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

work page arXiv 2025
[68]

Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. A-Only Policy A+B Policy (a) Experiment Setting (b) Policy Performance (c) HIL Online RL Efficiency Region A Region B 9.00 cm 15.00cm ...

work page arXiv 2024

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

work page 2022

[2] [2]

Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992

Paul J Besl and Neil D McKay. Sensor fusion iv: control paradigms and data structures.International Society for Optics and Photonics, 1611:586–607, 1992

work page 1992

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. pi0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InarXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Less is more: Em- powering gui agent with context-aware simplification

Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi- Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

work page arXiv 2025

[7] [7]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Conrft: A reinforced fine-tuning method for vla models via consistency policy

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025

work page arXiv 2025

[9] [9]

Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025

Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025

work page arXiv 2025

[10] [10]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

work page 2023

[12] [12]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

work page arXiv 2025

[13] [13]

Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,

Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Ren- rui Zhang, Peng Jia, Pheng-Ann Heng, and Shanghang Zhang. Manualvla: A unified vla model for chain- of-thought manual generation and robotic manipulation,

work page

[14] [14]

URL https://arxiv.org/abs/2512.02013

work page arXiv

[15] [15]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

work page arXiv 2025

[16] [16]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018

[17] [17]

Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

work page arXiv 2025

[18] [18]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, et al.π 0.5: a vision- language-action model with open-world generalization,

work page

[19] [19]

URL https://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025

Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, and Shanghang Zhang. Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling.arXiv preprint arXiv:2512.03044, 2025

work page arXiv 2025

[21] [21]

Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

work page arXiv 2025

[22] [22]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16923–16930. IEEE, 2025

work page 2025

[23] [23]

Pris- matic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning, 2024

work page 2024

[24] [24]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

KIRI Engine: 3D Scanner App for iPhone, Android, and Web

KIRI Innovations. KIRI Engine: 3D Scanner App for iPhone, Android, and Web. https://www.kiriengine.app/

work page

[26] [27]

Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025

Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning, 2025. URL https://arxiv.org/abs/2510.14830

work page arXiv 2025

[27] [28]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [29]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [30]

Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025

Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025. URL https://arxiv. org/abs/2411.11839

work page arXiv 2025

[30] [31]

Onetwovla: A unified vision-language-action model with adaptive reasoning,

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

work page arXiv 2025

[31] [32]

Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Effi- cient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

work page 2024

[32] [33]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

work page arXiv 2025

[34] [35]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [36]

Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026

Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last 0: Latent spatio- temporal chain-of-thought for robotic vision-language- action model, 2026. URL https://arxiv.org/abs/2601. 05248

work page 2026

[36] [37]

Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024

Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, Liyi Luo, and Yongliang Shi. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation, 2024. URL https://arxiv.org/ abs/2408.14873

work page arXiv 2024

[37] [38]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [40]

Rlif: Interactive imitation learning as reinforcement learning, 2024

Jianlan Luo, Perry Dong, Yuexiang Zhai, Yi Ma, and Sergey Levine. Rlif: Interactive imitation learning as reinforcement learning, 2024. URL https://arxiv.org/abs/ 2311.12996

work page arXiv 2024

[39] [41]

Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

work page 2025

[40] [42]

Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024

Russell Mendonca, Emmanuel Panov, Bernadette Bucher, Jiuguang Wang, and Deepak Pathak. Continuously improving mobile manipulation with autonomous real- world rl.arXiv preprint arXiv:2409.20568, 2024

work page arXiv 2024

[41] [43]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

work page 2023

[42] [44]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [45]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[44] [46]

franky: High-level control library for franka robots

Tim Schneider. franky: High-level control library for franka robots. https://github.com/TimSchneider42/ franky, 2023. Software library. Python and C++ high- level control library for Franka robots, derived from the frankx project by Pantor

work page 2023

[45] [47]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [49]

Perceiver-actor: A multi-task transformer for robotic ma- nipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022

work page 2022

[48] [50]

Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

Yifei Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, et al. Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

work page arXiv 2025

[49] [51]

S ¸ucan et al

Ioan A. S ¸ucan et al. The Open Motion Planning Li- brary.IEEE Robotics & Automation Magazine, 19(4):72– 82, December 2012. doi: 10.1109/MRA.2012.2205651. https://ompl.kavrakilab.org

work page doi:10.1109/mra.2012.2205651 2012

[50] [52]

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

work page internal anchor Pith review arXiv 2025

[51] [53]

Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024

Marcel Torne, Arhan Jain, Jiayi Yuan, Vidaaranya Macha, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, and Abhishek Gupta. Robot learning with super-linear scaling.arXiv preprint arXiv:2412.01770, 2024

work page arXiv 2024

[52] [54]

Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024

Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. Rec- onciling reality through simulation: A real-to-sim-to- real approach for robust manipulation.arXiv preprint arXiv:2403.03949, 2024

work page arXiv 2024

[53] [55]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [56]

Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Scal- ing robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

work page arXiv 2024

[55] [57]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

work page 2025

[56] [58]

Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training

Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, et al. Simlauncher: Launching sample-efficient real-world robotic reinforcement learn- ing via simulation pre-training. In2025 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7933–7940. IEEE, 2025

work page 2025

[57] [59]

R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,

Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, and Jiwen Lu. R2rgen: Real-to-real 3d data generation for spatially generalized manipulation,

work page

[58] [60]

URL https://arxiv.org/abs/2510.08547

work page internal anchor Pith review Pith/arXiv arXiv

[59] [61]

Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, et al. Demogen: Synthetic demonstration genera- tion for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

work page arXiv 2025

[60] [62]

Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025

Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Novel demonstration generation with gaussian splatting enables robust one-shot manipulation, 2025. URL https://arxiv. org/abs/2504.13175

work page arXiv 2025

[61] [63]

Video2policy: Scaling up manipulation tasks in simulation through internet videos

Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, and Pieter Abbeel. Video2policy: Scaling up manipulation tasks in simulation through internet videos. arXiv preprint arXiv:2502.09886, 2025

work page arXiv 2025

[62] [64]

Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

Justin Yu, Letian Fu, Huang Huang, Karim El- Refai, Rares Andrei Ambrus, Richard Cheng, Muham- mad Zubair Irshad, and Ken Goldberg. Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

work page arXiv 2025

[63] [65]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms be- yond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [66]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy.arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [67]

Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, et al. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

work page arXiv 2025

[66] [68]

Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. A-Only Policy A+B Policy (a) Experiment Setting (b) Policy Performance (c) HIL Online RL Efficiency Region A Region B 9.00 cm 15.00cm ...

work page arXiv 2024