pith. machine review for the scientific record. sign in

arxiv: 2502.05855 · v3 · submitted 2025-02-09 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:43 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language-actiondiffusion expertcross-embodiment learningrobot manipulationdexterous controllong-horizon tasksplug-in architecturegeneralization
0
0 comments X

The pith

DexVLA plugs a billion-parameter diffusion expert pre-trained across robot bodies into vision-language models for language-driven control on new embodiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a diffusion-based action expert can be pre-trained separately on data from multiple robot types and then plugged into a vision-language model to produce better action sequences. A three-stage curriculum first builds general action knowledge in the expert, then aligns the language component to a given robot body, and finally adapts quickly to specific tasks. This setup lets the combined system handle long sequences of actions on single-arm, two-handed, and dexterous-hand robots using only ordinary language instructions, without per-task retraining of the action part. The approach is shown to exceed the performance of existing models on these varied platforms.

Core claim

DexVLA introduces a diffusion-based action expert scaled to one billion parameters that is pre-trained on cross-embodiment data and remains separable from the vision-language component. A curriculum of pre-training the expert on mixed robot data, aligning the VLA to the target embodiment, and post-training for new tasks produces a system that completes complex, long-horizon behaviors on single-arm, bimanual, and dexterous-hand robots using only direct language prompts and without embodiment-specific action fine-tuning.

What carries the argument

The plug-in diffusion expert: a one-billion-parameter model pre-trained on cross-embodiment robot trajectories that generates actions when inserted into a vision-language backbone.

If this is right

  • The system controls single-arm, bimanual, and dexterous-hand robots without task-specific adaptation.
  • Dexterous skills can be acquired on novel embodiments with only limited data.
  • Complex long-horizon tasks such as laundry folding are completed using only direct language prompting.
  • Performance exceeds that of Octo, OpenVLA, and Diffusion Policy across the tested embodiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separable expert design could let developers swap in new action modules when hardware changes without retraining the language-understanding layers.
  • Rapid post-training adaptation implies that household robots might acquire new multi-step chores from short verbal descriptions rather than lengthy demonstrations.
  • If the cross-embodiment pre-training generalizes further, the same expert might support robots whose kinematics differ substantially from the training set.

Load-bearing premise

Pre-training the diffusion expert on cross-embodiment data produces action representations that transfer effectively when plugged into a new VLA without requiring embodiment-specific action fine-tuning.

What would settle it

A controlled test on a previously unseen robot embodiment in which the model requires substantial embodiment-specific action fine-tuning to reach the reported success rate on a long-horizon task such as laundry folding would falsify the transfer claim.

read the original abstract

Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert that is separable from the VLA on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA's adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DexVLA, a vision-language-action model featuring a separable 1B-parameter diffusion-based action expert pre-trained on cross-embodiment data. It proposes a three-stage curriculum—(1) pre-training the diffusion expert, (2) aligning the VLA to target embodiments, and (3) post-training for task adaptation—to enable superior performance on complex, long-horizon tasks (e.g., laundry folding) across single-arm, bimanual, and dexterous-hand embodiments using only direct language prompts, outperforming baselines such as Octo, OpenVLA, and Diffusion Policy.

Significance. If the central claims hold after proper isolation of components, the separable diffusion expert could meaningfully advance scalable robot learning by decoupling high-capacity action representation from the VLM backbone, potentially improving data efficiency and cross-embodiment transfer for long-horizon tasks.

major comments (2)
  1. [embodiment curriculum learning strategy] The central claim attributes performance gains to the plug-in diffusion expert pre-trained on cross-embodiment data, yet the manuscript provides no ablation that holds VLA alignment and post-training fixed while removing or randomizing the cross-embodiment pre-training stage. This omission prevents attribution of the reported deltas versus Octo/OpenVLA/Diffusion Policy to the separable expert rather than joint training or scale.
  2. [Abstract] Abstract and experimental claims of outperformance on multiple embodiments lack any quantitative metrics, error bars, or detailed ablation tables; without these, the magnitude and statistical reliability of improvements on long-horizon tasks cannot be assessed.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., success rate delta) to ground the superiority claim.
  2. [curriculum learning strategy] Clarify whether the 1B-parameter diffusion expert remains frozen during VLA alignment or receives any gradient updates in stages 2–3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the attribution of results and the clarity of claims.

read point-by-point responses
  1. Referee: [embodiment curriculum learning strategy] The central claim attributes performance gains to the plug-in diffusion expert pre-trained on cross-embodiment data, yet the manuscript provides no ablation that holds VLA alignment and post-training fixed while removing or randomizing the cross-embodiment pre-training stage. This omission prevents attribution of the reported deltas versus Octo/OpenVLA/Diffusion Policy to the separable expert rather than joint training or scale.

    Authors: We agree that an explicit ablation isolating the cross-embodiment pre-training stage—while keeping VLA alignment and post-training fixed—would provide stronger causal evidence for the separable expert's contribution. Our current comparisons to baselines (Octo, OpenVLA, Diffusion Policy) that lack this pre-training offer indirect support, but we acknowledge the referee's point. We will add a dedicated ablation study in the revised manuscript that directly removes or randomizes the cross-embodiment pre-training phase under otherwise identical conditions. revision: yes

  2. Referee: [Abstract] Abstract and experimental claims of outperformance on multiple embodiments lack any quantitative metrics, error bars, or detailed ablation tables; without these, the magnitude and statistical reliability of improvements on long-horizon tasks cannot be assessed.

    Authors: We accept this criticism. The current abstract is qualitative and does not convey the scale of improvements. We will revise the abstract to include key quantitative results (success rates with standard deviations) for the main long-horizon tasks across embodiments, along with explicit pointers to the full ablation tables and error-bar plots already present in the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on training and benchmarks, not self-referential derivation

full rationale

The manuscript describes an empirical training curriculum (pre-train separable diffusion expert on cross-embodiment data, then align VLA, then post-train) and reports performance deltas versus Octo/OpenVLA/Diffusion Policy on long-horizon tasks. No equations, uniqueness theorems, or fitted parameters are presented as predictions; the central claims are benchmark results, not derivations that reduce to their own inputs by construction. No self-citations of prior author work are invoked as load-bearing mathematical facts. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard assumptions of diffusion models for continuous action spaces and the transferability of pre-trained action priors across robot morphologies.

free parameters (1)
  • diffusion expert parameter count = 1 billion
    Scaled to one billion parameters to increase capacity for cross-embodiment action modeling.
axioms (1)
  • domain assumption Diffusion models can represent complex robot action distributions from cross-embodiment data
    Invoked to justify pre-training the separable action expert.
invented entities (1)
  • plug-in diffusion action expert no independent evidence
    purpose: Separate high-capacity action generator that can be pre-trained independently and inserted into VLA models
    New architectural component introduced to address action representation bottlenecks.

pith-pipeline@v0.9.0 · 5587 in / 1197 out tokens · 47248 ms · 2026-05-14T19:43:39.917656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Test-time Sparsity for Extreme Fast Action Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.

  2. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  5. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  6. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  7. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  8. Unified Noise Steering for Efficient Human-Guided VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

  9. ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

    cs.CV 2026-05 unverdicted novelty 6.0

    ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.

  10. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  11. AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.

  12. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  13. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  14. dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.

  15. SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces

    cs.RO 2026-04 unverdicted novelty 6.0

    SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.

  16. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  17. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  18. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    cs.RO 2025-06 unverdicted novelty 6.0

    RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

  19. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  20. Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

    cs.LG 2026-05 unverdicted novelty 5.0

    Di-BiLPS combines a variational autoencoder, latent diffusion, and contrastive learning to achieve state-of-the-art accuracy on PDE problems with as little as 3% observations while supporting zero-shot super-resolutio...

  21. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  22. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  23. Towards Robotic Dexterous Hand Intelligence: A Survey

    cs.RO 2026-05 unverdicted novelty 4.0

    A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 22 Pith papers · 18 internal anchors

  1. [1]

    T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik. Learning visuotactile skills with two multifingered hands. arXiv preprint arXiv:2404.16823, 2024

  2. [2]

    H. Shi, H. Xu, S. Clarke, Y . Li, and J. Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. arXiv preprint arXiv:2306.14447, 2023

  3. [4]

    Zhang, Z.-H

    K. Zhang, Z.-H. Yin, W. Ye, and Y . Gao. Learning manipulation skills through robot chain-of- thought with sparse failure guidance. arXiv preprint arXiv:2405.13573, 2024

  4. [5]

    A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasp- 9 ing and cross-domain image matching. The International Journal of Robotics Research, 41(7): 690–705, 2022

  5. [6]

    Qin, Y .-H

    Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Springer, 2022

  6. [7]

    Reuss, ¨O

    M. Reuss, ¨O. E. Ya˘gmurlu, F. Wenzel, and R. Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. 2024

  7. [8]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A vision-language-action flow model for general robot control, 2024. URL https://arxiv. ...

  8. [9]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  9. [10]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  10. [11]

    Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023

  11. [12]

    J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  12. [13]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  13. [14]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

  14. [15]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

  15. [16]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  16. [17]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025

  17. [18]

    W. Zhao, P. Ding, M. Zhang, Z. Gong, S. Bai, H. Zhao, and D. Wang. Vlas: Vision-language- action model with speech instructions for customized robot manipulation. arXiv preprint arXiv:2502.13508, 2025

  18. [19]

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision- language-action generative world model. arXiv preprint arXiv:2403.09631, 2024

  19. [20]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. 10

  20. [21]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  21. [22]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

  22. [23]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454, 2024

  23. [24]

    H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. arXiv preprint arXiv:2407.10353, 2024

  24. [25]

    J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: Personalized robot assistance with large language models. Au- tonomous Robots, 47(8):1087–1102, 2023

  25. [26]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

  26. [27]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  27. [28]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

  28. [29]

    M. Zhu, Y . Zhu, J. Li, J. Wen, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation.arXiv preprint arXiv:2409.14411, 2024

  29. [30]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

  30. [31]

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arxiv 2019. arXiv preprint arXiv:1910.01108, 2019

  31. [32]

    L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

  32. [33]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  33. [34]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  34. [35]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  35. [36]

    H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020. 11

  36. [37]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

  37. [38]

    H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023

  38. [39]

    Radosavovic, T

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023

  39. [40]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329, 2024

  40. [41]

    Geiger, P

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013

  41. [42]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  42. [43]

    J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024

  43. [44]

    Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

    Z. Zhang, K. Zheng, Z. Chen, J. Jang, Y . Li, C. Wang, M. Ding, D. Fox, and H. Yao. Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309, 2024

  44. [45]

    Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. arXiv preprint arXiv:2501.16664, 2025

  45. [46]

    Belkhale, T

    S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024

  46. [47]

    Yen-Chen, A

    L. Yen-Chen, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293. IEEE, 2020

  47. [48]

    Y . Du, M. Simchowitz, R. Tedrake, V . Sitzmann, B. Chen, and D. M. Monso. Diffusion forcing: Next-token prediction meets full-sequence diffusion. NeurIPS, 3, 2024

  48. [49]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  49. [50]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  50. [51]

    T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid. Aloha unleashed: A simple recipe for robot dexterity. In 8th Annual Conference on Robot Learning

  51. [52]

    Y . Wang, Y . Zhang, M. Huo, R. Tian, X. Zhang, Y . Xie, C. Xu, P. Ji, W. Zhan, M. Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning. arXiv preprint arXiv:2407.01531, 2024

  52. [53]

    Prasad, K

    A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. arXiv preprint arXiv:2405.07503, 2024. 12

  53. [54]

    Fine-tuning of continuous- time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194,

    M. Uehara, Y . Zhao, K. Black, E. Hajiramezanali, G. Scalia, N. L. Diamant, A. M. Tseng, T. Biancalani, and S. Levine. Fine-tuning of continuous-time diffusion models as entropy- regularized control. arXiv preprint arXiv:2402.15194, 2024

  54. [55]

    Uehara, Y

    M. Uehara, Y . Zhao, K. Black, E. Hajiramezanali, G. Scalia, N. L. Diamant, A. M. Tseng, S. Levine, and T. Biancalani. Feedback efficient online fine-tuning of diffusion models. arXiv preprint arXiv:2402.16359, 2024

  55. [56]

    Training Diffusion Models with Reinforcement Learning

    K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023

  56. [57]

    Black, M

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

  57. [58]

    Dasari, O

    S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic diffu- sion transformers. arXiv preprint arXiv:2410.10088, 2024

  58. [59]

    F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation, 2024. URL https://arxiv.org/abs/2410.18647

  59. [60]

    A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burch- fiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024

  60. [61]

    Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Yang, Y .-W. Chao, C. Perez-D’Arpino, D. Fox, and J. Shah. Inference-time policy steering through human interactions.arXiv preprint arXiv:2411.16627, 2024

  61. [62]

    N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022

  62. [63]

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  63. [64]

    T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024

  64. [65]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

  65. [66]

    Y . Ze, Z. Chen, W. Wang, T. Chen, X. He, Y . Yuan, X. B. Peng, and J. Wu. Generalizable humanoid manipulation with improved 3d diffusion policies.arXiv preprint arXiv:2410.10803, 2024

  66. [67]

    Yan, Y .-H

    G. Yan, Y .-H. Wu, and X. Wang. Dnact: Diffusion guided multi-task 3d policy learning.arXiv preprint arXiv:2403.04115, 2024

  67. [68]

    X. Jia, Q. Wang, A. Donat, B. Xing, G. Li, H. Zhou, O. Celik, D. Blessing, R. Lioutikov, and G. Neumann. Mail: Improving imitation learning with selective state space models. In 8th Annual Conference on Robot Learning

  68. [69]

    J. Wen, M. Zhu, Y . Zhu, Z. Tang, J. Li, Z. Zhou, C. Li, X. Liu, Y . Peng, C. Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024. 13

  69. [70]

    K. Wu, Y . Zhu, J. Li, J. Wen, N. Liu, Z. Xu, Q. Qiu, and J. Tang. Discrete policy: Learning dis- entangled action space for multi-task robotic manipulation. arXiv preprint arXiv:2409.18707, 2024

  70. [71]

    L. Wang, K. Zhang, A. Zhou, M. Simchowitz, and R. Tedrake. Fleet policy learning via weight merging and an application to robotic tool-use. arXiv preprint arXiv:2310.01362, 2023

  71. [72]

    L. Wang, J. Zhao, Y . Du, E. H. Adelson, and R. Tedrake. Poco: Policy composition from and for heterogeneous robot learning. arXiv preprint arXiv:2402.02511, 2024

  72. [73]

    ARX arm” and “PIPER arm

    L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers. arXiv preprint arXiv:2409.20537, 2024. 14 Unseen Drink and Unseen CupUnseen Scene and Unseen Cup Unseen White T-shirt and Unseen SceneUnseen Scene Figure 10: Example of visual generalization. Here lists some visual generalization set...