Recognition: 2 theorem links
· Lean TheoremDexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Pith reviewed 2026-05-14 19:43 UTC · model grok-4.3
The pith
DexVLA plugs a billion-parameter diffusion expert pre-trained across robot bodies into vision-language models for language-driven control on new embodiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DexVLA introduces a diffusion-based action expert scaled to one billion parameters that is pre-trained on cross-embodiment data and remains separable from the vision-language component. A curriculum of pre-training the expert on mixed robot data, aligning the VLA to the target embodiment, and post-training for new tasks produces a system that completes complex, long-horizon behaviors on single-arm, bimanual, and dexterous-hand robots using only direct language prompts and without embodiment-specific action fine-tuning.
What carries the argument
The plug-in diffusion expert: a one-billion-parameter model pre-trained on cross-embodiment robot trajectories that generates actions when inserted into a vision-language backbone.
If this is right
- The system controls single-arm, bimanual, and dexterous-hand robots without task-specific adaptation.
- Dexterous skills can be acquired on novel embodiments with only limited data.
- Complex long-horizon tasks such as laundry folding are completed using only direct language prompting.
- Performance exceeds that of Octo, OpenVLA, and Diffusion Policy across the tested embodiments.
Where Pith is reading between the lines
- The separable expert design could let developers swap in new action modules when hardware changes without retraining the language-understanding layers.
- Rapid post-training adaptation implies that household robots might acquire new multi-step chores from short verbal descriptions rather than lengthy demonstrations.
- If the cross-embodiment pre-training generalizes further, the same expert might support robots whose kinematics differ substantially from the training set.
Load-bearing premise
Pre-training the diffusion expert on cross-embodiment data produces action representations that transfer effectively when plugged into a new VLA without requiring embodiment-specific action fine-tuning.
What would settle it
A controlled test on a previously unseen robot embodiment in which the model requires substantial embodiment-specific action fine-tuning to reach the reported success rate on a long-horizon task such as laundry folding would falsify the transfer claim.
read the original abstract
Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert that is separable from the VLA on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA's adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DexVLA, a vision-language-action model featuring a separable 1B-parameter diffusion-based action expert pre-trained on cross-embodiment data. It proposes a three-stage curriculum—(1) pre-training the diffusion expert, (2) aligning the VLA to target embodiments, and (3) post-training for task adaptation—to enable superior performance on complex, long-horizon tasks (e.g., laundry folding) across single-arm, bimanual, and dexterous-hand embodiments using only direct language prompts, outperforming baselines such as Octo, OpenVLA, and Diffusion Policy.
Significance. If the central claims hold after proper isolation of components, the separable diffusion expert could meaningfully advance scalable robot learning by decoupling high-capacity action representation from the VLM backbone, potentially improving data efficiency and cross-embodiment transfer for long-horizon tasks.
major comments (2)
- [embodiment curriculum learning strategy] The central claim attributes performance gains to the plug-in diffusion expert pre-trained on cross-embodiment data, yet the manuscript provides no ablation that holds VLA alignment and post-training fixed while removing or randomizing the cross-embodiment pre-training stage. This omission prevents attribution of the reported deltas versus Octo/OpenVLA/Diffusion Policy to the separable expert rather than joint training or scale.
- [Abstract] Abstract and experimental claims of outperformance on multiple embodiments lack any quantitative metrics, error bars, or detailed ablation tables; without these, the magnitude and statistical reliability of improvements on long-horizon tasks cannot be assessed.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., success rate delta) to ground the superiority claim.
- [curriculum learning strategy] Clarify whether the 1B-parameter diffusion expert remains frozen during VLA alignment or receives any gradient updates in stages 2–3.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the attribution of results and the clarity of claims.
read point-by-point responses
-
Referee: [embodiment curriculum learning strategy] The central claim attributes performance gains to the plug-in diffusion expert pre-trained on cross-embodiment data, yet the manuscript provides no ablation that holds VLA alignment and post-training fixed while removing or randomizing the cross-embodiment pre-training stage. This omission prevents attribution of the reported deltas versus Octo/OpenVLA/Diffusion Policy to the separable expert rather than joint training or scale.
Authors: We agree that an explicit ablation isolating the cross-embodiment pre-training stage—while keeping VLA alignment and post-training fixed—would provide stronger causal evidence for the separable expert's contribution. Our current comparisons to baselines (Octo, OpenVLA, Diffusion Policy) that lack this pre-training offer indirect support, but we acknowledge the referee's point. We will add a dedicated ablation study in the revised manuscript that directly removes or randomizes the cross-embodiment pre-training phase under otherwise identical conditions. revision: yes
-
Referee: [Abstract] Abstract and experimental claims of outperformance on multiple embodiments lack any quantitative metrics, error bars, or detailed ablation tables; without these, the magnitude and statistical reliability of improvements on long-horizon tasks cannot be assessed.
Authors: We accept this criticism. The current abstract is qualitative and does not convey the scale of improvements. We will revise the abstract to include key quantitative results (success rates with standard deviations) for the main long-horizon tasks across embodiments, along with explicit pointers to the full ablation tables and error-bar plots already present in the experimental section. revision: yes
Circularity Check
No circularity: empirical claims rest on training and benchmarks, not self-referential derivation
full rationale
The manuscript describes an empirical training curriculum (pre-train separable diffusion expert on cross-embodiment data, then align VLA, then post-train) and reports performance deltas versus Octo/OpenVLA/Diffusion Policy on long-horizon tasks. No equations, uniqueness theorems, or fitted parameters are presented as predictions; the central claims are benchmark results, not derivations that reduce to their own inputs by construction. No self-citations of prior author work are invoked as load-bearing mathematical facts. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion expert parameter count =
1 billion
axioms (1)
- domain assumption Diffusion models can represent complex robot action distributions from cross-embodiment data
invented entities (1)
-
plug-in diffusion action expert
no independent evidence
Forward citations
Cited by 23 Pith papers
-
Test-time Sparsity for Extreme Fast Action Diffusion
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
Di-BiLPS combines a variational autoencoder, latent diffusion, and contrastive learning to achieve state-of-the-art accuracy on PDE problems with as little as 3% observations while supporting zero-shot super-resolutio...
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
Towards Robotic Dexterous Hand Intelligence: A Survey
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[4]
K. Zhang, Z.-H. Yin, W. Ye, and Y . Gao. Learning manipulation skills through robot chain-of- thought with sparse failure guidance. arXiv preprint arXiv:2405.13573, 2024
-
[5]
A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasp- 9 ing and cross-domain image matching. The International Journal of Robotics Research, 41(7): 690–705, 2022
work page 2022
-
[6]
Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Springer, 2022
work page 2022
- [7]
-
[8]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A vision-language-action flow model for general robot control, 2024. URL https://arxiv. ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
- [12]
-
[13]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [17]
- [18]
-
[19]
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision- language-action generative world model. arXiv preprint arXiv:2403.09631, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. 10
-
[21]
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
work page 2024
-
[22]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [23]
- [24]
-
[25]
J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: Personalized robot assistance with large language models. Au- tonomous Robots, 47(8):1087–1102, 2023
work page 2023
- [26]
-
[27]
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024
work page 2024
- [29]
-
[30]
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arxiv 2019. arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [32]
-
[33]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[35]
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020. 11
work page 2020
-
[37]
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022
work page 2022
-
[38]
H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023
work page 2023
-
[39]
I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023
work page 2023
-
[40]
C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329, 2024
work page internal anchor Pith review arXiv 2024
- [41]
-
[42]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [43]
-
[44]
Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,
Z. Zhang, K. Zheng, Z. Chen, J. Jang, Y . Li, C. Wang, M. Ding, D. Fox, and H. Yao. Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309, 2024
- [45]
-
[46]
S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024
-
[47]
L. Yen-Chen, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293. IEEE, 2020
work page 2020
-
[48]
Y . Du, M. Simchowitz, R. Tedrake, V . Sitzmann, B. Chen, and D. M. Monso. Diffusion forcing: Next-token prediction meets full-sequence diffusion. NeurIPS, 3, 2024
work page 2024
-
[49]
W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[50]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[51]
T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid. Aloha unleashed: A simple recipe for robot dexterity. In 8th Annual Conference on Robot Learning
- [52]
- [53]
-
[54]
M. Uehara, Y . Zhao, K. Black, E. Hajiramezanali, G. Scalia, N. L. Diamant, A. M. Tseng, T. Biancalani, and S. Levine. Fine-tuning of continuous-time diffusion models as entropy- regularized control. arXiv preprint arXiv:2402.15194, 2024
- [55]
-
[56]
Training Diffusion Models with Reinforcement Learning
K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [57]
- [58]
- [59]
- [60]
- [61]
-
[62]
N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022
work page 2022
-
[63]
Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [64]
-
[65]
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024
work page 2024
- [66]
-
[67]
G. Yan, Y .-H. Wu, and X. Wang. Dnact: Diffusion guided multi-task 3d policy learning.arXiv preprint arXiv:2403.04115, 2024
-
[68]
X. Jia, Q. Wang, A. Donat, B. Xing, G. Li, H. Zhou, O. Celik, D. Blessing, R. Lioutikov, and G. Neumann. Mail: Improving imitation learning with selective state space models. In 8th Annual Conference on Robot Learning
- [69]
- [70]
- [71]
- [72]
-
[73]
L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers. arXiv preprint arXiv:2409.20537, 2024. 14 Unseen Drink and Unseen CupUnseen Scene and Unseen Cup Unseen White T-shirt and Unseen SceneUnseen Scene Figure 10: Example of visual generalization. Here lists some visual generalization set...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.