arxiv: 2502.05855 · v3 · submitted 2025-02-09 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen , Yichen Zhu , Jinming Li , Zhibin Tang , Chaomin Shen , Feifei Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:43 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords vision-language-actiondiffusion expertcross-embodiment learningrobot manipulationdexterous controllong-horizon tasksplug-in architecturegeneralization

0 comments

The pith

DexVLA plugs a billion-parameter diffusion expert pre-trained across robot bodies into vision-language models for language-driven control on new embodiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a diffusion-based action expert can be pre-trained separately on data from multiple robot types and then plugged into a vision-language model to produce better action sequences. A three-stage curriculum first builds general action knowledge in the expert, then aligns the language component to a given robot body, and finally adapts quickly to specific tasks. This setup lets the combined system handle long sequences of actions on single-arm, two-handed, and dexterous-hand robots using only ordinary language instructions, without per-task retraining of the action part. The approach is shown to exceed the performance of existing models on these varied platforms.

Core claim

DexVLA introduces a diffusion-based action expert scaled to one billion parameters that is pre-trained on cross-embodiment data and remains separable from the vision-language component. A curriculum of pre-training the expert on mixed robot data, aligning the VLA to the target embodiment, and post-training for new tasks produces a system that completes complex, long-horizon behaviors on single-arm, bimanual, and dexterous-hand robots using only direct language prompts and without embodiment-specific action fine-tuning.

What carries the argument

The plug-in diffusion expert: a one-billion-parameter model pre-trained on cross-embodiment robot trajectories that generates actions when inserted into a vision-language backbone.

If this is right

The system controls single-arm, bimanual, and dexterous-hand robots without task-specific adaptation.
Dexterous skills can be acquired on novel embodiments with only limited data.
Complex long-horizon tasks such as laundry folding are completed using only direct language prompting.
Performance exceeds that of Octo, OpenVLA, and Diffusion Policy across the tested embodiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separable expert design could let developers swap in new action modules when hardware changes without retraining the language-understanding layers.
Rapid post-training adaptation implies that household robots might acquire new multi-step chores from short verbal descriptions rather than lengthy demonstrations.
If the cross-embodiment pre-training generalizes further, the same expert might support robots whose kinematics differ substantially from the training set.

Load-bearing premise

Pre-training the diffusion expert on cross-embodiment data produces action representations that transfer effectively when plugged into a new VLA without requiring embodiment-specific action fine-tuning.

What would settle it

A controlled test on a previously unseen robot embodiment in which the model requires substantial embodiment-specific action fine-tuning to reach the reported success rate on a long-horizon task such as laundry folding would falsify the transfer claim.

read the original abstract

Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert that is separable from the VLA on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA's adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DexVLA scales a separable 1B diffusion expert with a three-stage curriculum for cross-embodiment VLAs, but the results do not isolate whether the pre-training step actually drives the reported gains.

read the letter

DexVLA adds a large diffusion-based action expert that plugs into a vision-language model and gets pre-trained separately on data from multiple robot embodiments. The three-stage curriculum—pre-train the expert, align the VLA to the embodiment, then post-train for tasks—is the main new piece, and they report it works across single-arm, bimanual, and dexterous setups for long-horizon jobs like folding laundry with plain language commands. The paper does a decent job laying out why action representation is a bottleneck in current VLAs and how scaling the expert to a billion parameters might help with generalization. Testing on varied embodiments without heavy per-robot fine-tuning is a practical direction, and keeping the expert separable could make it easier to reuse across models. The main weakness is that the results do not separate the pre-training of the expert from the alignment and adaptation stages. Without an ablation that keeps the later steps the same and varies only the cross-embodiment pre-training, it is difficult to know if the plug-in design is really carrying the load or if it is just the overall training recipe and model size. The abstract also leaves out the specific performance numbers and any error bars, so the claimed outperformance over Octo, OpenVLA, and Diffusion Policy is hard to assess from what is here. This paper is for researchers focused on scaling robot policies to new hardware with limited data. A reader interested in diffusion policies or VLA architectures could pick up useful details on how to structure the training stages, but the lack of isolating experiments means it needs careful follow-up work. I think it deserves a serious referee. The core idea targets a genuine limitation in the field, and the experiments, once the numbers are in, could be worth discussing even if revisions are needed to strengthen the claims.

Referee Report

2 major / 2 minor

Summary. The paper introduces DexVLA, a vision-language-action model featuring a separable 1B-parameter diffusion-based action expert pre-trained on cross-embodiment data. It proposes a three-stage curriculum—(1) pre-training the diffusion expert, (2) aligning the VLA to target embodiments, and (3) post-training for task adaptation—to enable superior performance on complex, long-horizon tasks (e.g., laundry folding) across single-arm, bimanual, and dexterous-hand embodiments using only direct language prompts, outperforming baselines such as Octo, OpenVLA, and Diffusion Policy.

Significance. If the central claims hold after proper isolation of components, the separable diffusion expert could meaningfully advance scalable robot learning by decoupling high-capacity action representation from the VLM backbone, potentially improving data efficiency and cross-embodiment transfer for long-horizon tasks.

major comments (2)

[embodiment curriculum learning strategy] The central claim attributes performance gains to the plug-in diffusion expert pre-trained on cross-embodiment data, yet the manuscript provides no ablation that holds VLA alignment and post-training fixed while removing or randomizing the cross-embodiment pre-training stage. This omission prevents attribution of the reported deltas versus Octo/OpenVLA/Diffusion Policy to the separable expert rather than joint training or scale.
[Abstract] Abstract and experimental claims of outperformance on multiple embodiments lack any quantitative metrics, error bars, or detailed ablation tables; without these, the magnitude and statistical reliability of improvements on long-horizon tasks cannot be assessed.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., success rate delta) to ground the superiority claim.
[curriculum learning strategy] Clarify whether the 1B-parameter diffusion expert remains frozen during VLA alignment or receives any gradient updates in stages 2–3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the attribution of results and the clarity of claims.

read point-by-point responses

Referee: [embodiment curriculum learning strategy] The central claim attributes performance gains to the plug-in diffusion expert pre-trained on cross-embodiment data, yet the manuscript provides no ablation that holds VLA alignment and post-training fixed while removing or randomizing the cross-embodiment pre-training stage. This omission prevents attribution of the reported deltas versus Octo/OpenVLA/Diffusion Policy to the separable expert rather than joint training or scale.

Authors: We agree that an explicit ablation isolating the cross-embodiment pre-training stage—while keeping VLA alignment and post-training fixed—would provide stronger causal evidence for the separable expert's contribution. Our current comparisons to baselines (Octo, OpenVLA, Diffusion Policy) that lack this pre-training offer indirect support, but we acknowledge the referee's point. We will add a dedicated ablation study in the revised manuscript that directly removes or randomizes the cross-embodiment pre-training phase under otherwise identical conditions. revision: yes
Referee: [Abstract] Abstract and experimental claims of outperformance on multiple embodiments lack any quantitative metrics, error bars, or detailed ablation tables; without these, the magnitude and statistical reliability of improvements on long-horizon tasks cannot be assessed.

Authors: We accept this criticism. The current abstract is qualitative and does not convey the scale of improvements. We will revise the abstract to include key quantitative results (success rates with standard deviations) for the main long-horizon tasks across embodiments, along with explicit pointers to the full ablation tables and error-bar plots already present in the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on training and benchmarks, not self-referential derivation

full rationale

The manuscript describes an empirical training curriculum (pre-train separable diffusion expert on cross-embodiment data, then align VLA, then post-train) and reports performance deltas versus Octo/OpenVLA/Diffusion Policy on long-horizon tasks. No equations, uniqueness theorems, or fitted parameters are presented as predictions; the central claims are benchmark results, not derivations that reduce to their own inputs by construction. No self-citations of prior author work are invoked as load-bearing mathematical facts. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard assumptions of diffusion models for continuous action spaces and the transferability of pre-trained action priors across robot morphologies.

free parameters (1)

diffusion expert parameter count = 1 billion
Scaled to one billion parameters to increase capacity for cross-embodiment action modeling.

axioms (1)

domain assumption Diffusion models can represent complex robot action distributions from cross-embodiment data
Invoked to justify pre-training the separable action expert.

invented entities (1)

plug-in diffusion action expert no independent evidence
purpose: Separate high-capacity action generator that can be pre-trained independently and inserted into VLA models
New architectural component introduced to address action representation bottlenecks.

pith-pipeline@v0.9.0 · 5587 in / 1197 out tokens · 47248 ms · 2026-05-14T19:43:39.917656+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-time Sparsity for Extreme Fast Action Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
cs.CV 2026-05 unverdicted novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
cs.RO 2026-04 unverdicted novelty 6.0

SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
cs.LG 2026-05 unverdicted novelty 5.0

Di-BiLPS combines a variational autoencoder, latent diffusion, and contrastive learning to achieve state-of-the-art accuracy on PDE problems with as little as 3% observations while supporting zero-shot super-resolutio...
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
Towards Robotic Dexterous Hand Intelligence: A Survey
cs.RO 2026-05 unverdicted novelty 4.0

A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 22 Pith papers · 18 internal anchors

[1]

T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik. Learning visuotactile skills with two multifingered hands. arXiv preprint arXiv:2404.16823, 2024

work page arXiv 2024
[2]

H. Shi, H. Xu, S. Clarke, Y . Li, and J. Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. arXiv preprint arXiv:2306.14447, 2023

work page arXiv 2023
[4]

Zhang, Z.-H

K. Zhang, Z.-H. Yin, W. Ye, and Y . Gao. Learning manipulation skills through robot chain-of- thought with sparse failure guidance. arXiv preprint arXiv:2405.13573, 2024

work page arXiv 2024
[5]

A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasp- 9 ing and cross-domain image matching. The International Journal of Robotics Research, 41(7): 690–705, 2022

work page 2022
[6]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Springer, 2022

work page 2022
[7]

Reuss, ¨O

M. Reuss, ¨O. E. Ya˘gmurlu, F. Wenzel, and R. Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. 2024

work page 2024
[8]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A vision-language-action flow model for general robot control, 2024. URL https://arxiv. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023

work page arXiv 2023
[12]

J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

work page arXiv 2025
[13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025

work page arXiv 2025
[18]

W. Zhao, P. Ding, M. Zhang, Z. Gong, S. Bai, H. Zhao, and D. Wang. Vlas: Vision-language- action model with speech instructions for customized robot manipulation. arXiv preprint arXiv:2502.13508, 2025

work page arXiv 2025
[19]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision- language-action generative world model. arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. 10

work page
[21]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

work page 2024
[22]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454, 2024

work page arXiv 2024
[24]

H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. arXiv preprint arXiv:2407.10353, 2024

work page arXiv 2024
[25]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: Personalized robot assistance with large language models. Au- tonomous Robots, 47(8):1087–1102, 2023

work page 2023
[26]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

work page 2020
[27]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024
[29]

M. Zhu, Y . Zhu, J. Li, J. Wen, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation.arXiv preprint arXiv:2409.14411, 2024

work page arXiv 2024
[30]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arxiv 2019. arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

work page arXiv 2024
[33]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[35]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020. 11

work page 2020
[37]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

work page 2022
[38]

H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023

work page 2023
[39]

Radosavovic, T

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023

work page 2023
[40]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review arXiv 2024
[41]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013
[42]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024

work page arXiv 2024
[44]

Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

Z. Zhang, K. Zheng, Z. Chen, J. Jang, Y . Li, C. Wang, M. Ding, D. Fox, and H. Yao. Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309, 2024

work page arXiv 2024
[45]

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. arXiv preprint arXiv:2501.16664, 2025

work page arXiv 2025
[46]

Belkhale, T

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024

work page arXiv 2024
[47]

Yen-Chen, A

L. Yen-Chen, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293. IEEE, 2020

work page 2020
[48]

Y . Du, M. Simchowitz, R. Tedrake, V . Sitzmann, B. Chen, and D. M. Monso. Diffusion forcing: Next-token prediction meets full-sequence diffusion. NeurIPS, 3, 2024

work page 2024
[49]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[50]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[51]

T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid. Aloha unleashed: A simple recipe for robot dexterity. In 8th Annual Conference on Robot Learning

work page
[52]

Y . Wang, Y . Zhang, M. Huo, R. Tian, X. Zhang, Y . Xie, C. Xu, P. Ji, W. Zhan, M. Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning. arXiv preprint arXiv:2407.01531, 2024

work page arXiv 2024
[53]

Prasad, K

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. arXiv preprint arXiv:2405.07503, 2024. 12

work page arXiv 2024
[54]

Fine-tuning of continuous- time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194,

M. Uehara, Y . Zhao, K. Black, E. Hajiramezanali, G. Scalia, N. L. Diamant, A. M. Tseng, T. Biancalani, and S. Levine. Fine-tuning of continuous-time diffusion models as entropy- regularized control. arXiv preprint arXiv:2402.15194, 2024

work page arXiv 2024
[55]

Uehara, Y

M. Uehara, Y . Zhao, K. Black, E. Hajiramezanali, G. Scalia, N. L. Diamant, A. M. Tseng, S. Levine, and T. Biancalani. Feedback efficient online fine-tuning of diffusion models. arXiv preprint arXiv:2402.16359, 2024

work page arXiv 2024
[56]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Black, M

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

work page arXiv 2023
[58]

Dasari, O

S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic diffu- sion transformers. arXiv preprint arXiv:2410.10088, 2024

work page arXiv 2024
[59]

F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation, 2024. URL https://arxiv.org/abs/2410.18647

work page arXiv 2024
[60]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burch- fiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024

work page arXiv 2024
[61]

Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Yang, Y .-W. Chao, C. Perez-D’Arpino, D. Fox, and J. Shah. Inference-time policy steering through human interactions.arXiv preprint arXiv:2411.16627, 2024

work page arXiv 2024
[62]

N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022

work page 2022
[63]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024

work page arXiv 2024
[65]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

work page 2024
[66]

Y . Ze, Z. Chen, W. Wang, T. Chen, X. He, Y . Yuan, X. B. Peng, and J. Wu. Generalizable humanoid manipulation with improved 3d diffusion policies.arXiv preprint arXiv:2410.10803, 2024

work page arXiv 2024
[67]

Yan, Y .-H

G. Yan, Y .-H. Wu, and X. Wang. Dnact: Diffusion guided multi-task 3d policy learning.arXiv preprint arXiv:2403.04115, 2024

work page arXiv 2024
[68]

X. Jia, Q. Wang, A. Donat, B. Xing, G. Li, H. Zhou, O. Celik, D. Blessing, R. Lioutikov, and G. Neumann. Mail: Improving imitation learning with selective state space models. In 8th Annual Conference on Robot Learning

work page
[69]

J. Wen, M. Zhu, Y . Zhu, Z. Tang, J. Li, Z. Zhou, C. Li, X. Liu, Y . Peng, C. Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024. 13

work page arXiv 2024
[70]

K. Wu, Y . Zhu, J. Li, J. Wen, N. Liu, Z. Xu, Q. Qiu, and J. Tang. Discrete policy: Learning dis- entangled action space for multi-task robotic manipulation. arXiv preprint arXiv:2409.18707, 2024

work page arXiv 2024
[71]

L. Wang, K. Zhang, A. Zhou, M. Simchowitz, and R. Tedrake. Fleet policy learning via weight merging and an application to robotic tool-use. arXiv preprint arXiv:2310.01362, 2023

work page arXiv 2023
[72]

L. Wang, J. Zhao, Y . Du, E. H. Adelson, and R. Tedrake. Poco: Policy composition from and for heterogeneous robot learning. arXiv preprint arXiv:2402.02511, 2024

work page arXiv 2024
[73]

ARX arm” and “PIPER arm

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers. arXiv preprint arXiv:2409.20537, 2024. 14 Unseen Drink and Unseen CupUnseen Scene and Unseen Cup Unseen White T-shirt and Unseen SceneUnseen Scene Figure 10: Example of visual generalization. Here lists some visual generalization set...

work page arXiv 2024