DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
Pith reviewed 2026-06-28 21:56 UTC · model grok-4.3
The pith
DeMaVLA shows that one VLA model can acquire generalizable folding skills for varied household objects by mixing demonstrations with corrective trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeMaVLA adopts a VLM backbone with an action expert that uses flow matching for continuous actions; the expert is built by pruning every other transformer layer while keeping alignment with the backbone. The model is first pre-trained on approximately 5,000 hours of real-world dual-arm demonstrations to learn general manipulation priors, then post-trained on mixed folding data that combines self-collected demonstrations and corrective trajectories gathered from multiple folding tasks via a human-in-the-loop DAgger pipeline. On this basis the model reaches competitive performance on RoboTwin 2.0 and strong results on a household folding benchmark involving diverse items and scenes.
What carries the argument
Pruned action expert aligned with the VLM backbone that generates continuous actions via flow matching, trained with a DAgger pipeline that aggregates demonstrations and corrective trajectories across multiple folding tasks.
If this is right
- A single policy replaces separate category-specific folding policies.
- Layer pruning reduces training and inference cost while preserving performance.
- Corrective trajectories from real failures improve robustness across varied initial states and materials.
- Pre-training on broad dual-arm data supplies reusable priors that transfer to deformable tasks.
Where Pith is reading between the lines
- The same aggregation of corrective data could be applied to other multi-task robotic domains such as unfolding or packing.
- Efficiency from the pruned expert might allow the model to run on lower-power home robots.
- Further scaling of the pre-training corpus could extend the same generalizability pattern to non-folding manipulation skills.
Load-bearing premise
Aggregating demonstrations and corrective trajectories from multiple folding tasks will avoid task interference or overfitting to the collected failure modes and scenes.
What would settle it
If the model is evaluated on a new clothing category or household scene outside the collected data and its success rate drops to the level of naive mixed training without DAgger corrections, the generalizability claim would be refuted.
read the original abstract
Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeMaVLA, a VLA foundation model for generalizable deformable manipulation. It uses a VLM backbone with a pruned action expert and flow matching for action generation, pre-trains on ~5,000 hours of real-world dual-arm data, and post-trains on mixed folding demonstrations plus corrective DAgger trajectories collected via human-in-the-loop across multiple tasks to overcome category-specific policies and task interference. The central empirical claim is competitive performance on RoboTwin 2.0 and strong real-world results on a household folding benchmark.
Significance. If the results hold with proper validation, the combination of large-scale real-world pre-training, efficient architecture, and corrective DAgger for multi-task deformable manipulation would represent a meaningful step toward reusable VLA policies that generalize across object categories, geometries, and scenes without per-category retraining.
major comments (2)
- [Abstract] Abstract: performance claims (competitive on RoboTwin 2.0, strong real-world results) are stated without any quantitative numbers, error bars, data splits, ablation studies, or statistical tests, making it impossible to evaluate support for the generalization claim.
- [Post-training and Experiments] Post-training description: the text explicitly states that naive mixed multi-task training causes task interference and degradation, yet supplies no ablations, metrics, or mitigation details (task conditioning, loss balancing, selective replay) showing that the human-in-the-loop DAgger pipeline on aggregated folding data avoids interference or overfitting to collected failure modes.
minor comments (1)
- [Abstract] The phrase 'approximately 5,000 hours' should be accompanied by exact collection statistics or ranges when the full experimental section is presented.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results and experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance claims (competitive on RoboTwin 2.0, strong real-world results) are stated without any quantitative numbers, error bars, data splits, ablation studies, or statistical tests, making it impossible to evaluate support for the generalization claim.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will update the abstract to report specific success rates on RoboTwin 2.0 (with baseline comparisons) and the household folding benchmark. Error bars, data splits, and statistical details remain in the main experiments section, but we will add a brief reference to them in the abstract for context. revision: yes
-
Referee: [Post-training and Experiments] Post-training description: the text explicitly states that naive mixed multi-task training causes task interference and degradation, yet supplies no ablations, metrics, or mitigation details (task conditioning, loss balancing, selective replay) showing that the human-in-the-loop DAgger pipeline on aggregated folding data avoids interference or overfitting to collected failure modes.
Authors: The statement on task interference in naive multi-task training reflects observations from our preliminary development runs. The DAgger pipeline mitigates this through targeted corrective data collection on real-robot failures. We acknowledge that the current version lacks explicit ablations or quantitative metrics isolating the interference reduction. In revision, we will add an ablation comparing naive mixed training against the DAgger approach, reporting multi-task success rates and interference indicators. revision: yes
Circularity Check
No circularity: empirical benchmark claims rest on external data evaluation
full rationale
The paper describes a data-driven VLA model pre-trained on real-world demonstrations and post-trained via DAgger on aggregated folding trajectories, with performance claims evaluated on RoboTwin 2.0 and a household folding benchmark. No derivation chain, equations, or predictions are presented that reduce by construction to fitted parameters or self-citations defined inside the paper; the central results are empirical and externally falsifiable via benchmark metrics rather than internally forced.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[2]
Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[3]
Gr00t n1: An open foundation model for generalist humanoid robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[4]
arXiv preprint arXiv:2410.24164, 2024
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[5]
Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025
arXiv 2025
-
[6]
Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026
Kevin Black, Manuel Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026
2026
-
[7]
Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026
arXiv 2026
-
[8]
Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022
Carlos Celemin, Rodrigo Pérez-Dattari, Eugenio Chisari, Giovanni Franzese, Leandro de Souza Rosa, Ravi Prakash, Zlatan Ajanović, Marta Ferraz, Abhinav Valada, and Jens Kober. Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022
2022
-
[9]
Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025
Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025
Pith/arXiv arXiv 2025
-
[10]
Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025
Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025
Pith/arXiv arXiv 2025
-
[11]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
Pith/arXiv arXiv 2025
-
[12]
StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026
Pith/arXiv arXiv 2026
-
[13]
Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024
Pith/arXiv arXiv 2024
-
[14]
Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021
arXiv 2021
-
[15]
Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025
arXiv 2025
-
[16]
Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025
Pith/arXiv arXiv 2025
-
[17]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
-
[18]
Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025. 11
arXiv 2025
-
[19]
Hg-dagger: Interactive imitation learning with human experts
Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019
2019
-
[20]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[21]
Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020
Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020
2020
-
[22]
Quanyi Li, Zhenghao Peng, and Bolei Zhou. Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022
arXiv 2022
-
[23]
Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025
arXiv 2025
-
[24]
Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026
Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026
arXiv 2026
-
[25]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
- [26]
-
[27]
Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020
arXiv 2012
-
[28]
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
Pith/arXiv arXiv 2025
-
[29]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011
2011
-
[30]
GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026
arXiv 2026
-
[31]
Philipp Wu, Yide Shentu, Qiayuan Liao, Ding Jin, Menglong Guo, Koushil Sreenath, Xingyu Lin, and Pieter Abbeel. Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025
arXiv 2025
-
[32]
A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[33]
Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026
Pith/arXiv arXiv 2026
-
[34]
Checheng Yu, Chonghao Sima, Gangcheng Jiang, Hai Zhang, Haoguang Mai, Hongyang Li, Huijie Wang, Jin Chen, Kaiyang Wu, Li Chen, et al.χ0: Resource-aware robust manipulation via taming distributional inconsis- tencies.arXiv preprint arXiv:2602.09021, 2026
arXiv 2026
-
[35]
Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
arXiv 2025
-
[36]
Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026
Tianle Zhang, Zhihao Yuan, Dafeng Chi, Peidong Liu, Dongwei Li, Kejun Hu, Likui Zhang, Junnan Nie, Ziming Wei, Zengjue Chen, et al. Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026. 12
Pith/arXiv arXiv 2026
-
[37]
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025
Pith/arXiv arXiv 2025
-
[38]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Table A.1Evaluation on RoboTwin Simulation Benchmark. Simulation Task π0 ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.