pith. sign in

arxiv: 2602.11236 · v2 · submitted 2026-02-11 · 💻 cs.CV · cs.CL· cs.RO

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Pith reviewed 2026-05-16 03:01 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.RO
keywords robotic manipulationaction manifold learningvision language action modeldiffusion transformerfoundation modelembodied intelligencedata curationpolicy stability
0
0 comments X

The pith

ABot-M0 learns continuous robot action sequences by projecting directly onto feasible low-dimensional manifolds using a DiT backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ABot-M0 as a vision-language-action foundation model that unifies data from six public datasets into over 6 million trajectories for pre-training. It introduces the Action Manifold Hypothesis asserting that robot actions reside on a low-dimensional smooth manifold shaped by physics and constraints, and develops Action Manifold Learning to predict clean actions directly with a diffusion transformer. This change in objective from denoising to manifold projection is intended to boost decoding speed and policy stability while supporting cross-platform generalization. A reader would care because it tackles fragmented data and inefficient action generation in scaling embodied agents to varied hardware.

Core claim

The central discovery is that effective robot actions lie on a low-dimensional smooth manifold, and Action Manifold Learning with a DiT backbone can predict clean continuous action sequences directly by projecting onto this manifold, shifting away from denoising processes to achieve faster decoding and greater policy stability in robotic manipulation.

What carries the argument

Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly by projecting onto feasible manifolds defined by physical laws and task constraints.

Load-bearing premise

Effective robot actions lie on a low-dimensional, smooth manifold governed by physical laws and task constraints rather than occupying the full high-dimensional space.

What would settle it

An experiment where action prediction via manifold projection shows no improvement in decoding speed or policy stability compared to standard diffusion denoising on the UniACT-dataset or similar robotic tasks.

read the original abstract

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ABot-M0, a VLA foundation model for robotic manipulation. It describes a systematic data curation pipeline that cleans, standardizes, and balances samples from six public datasets to produce the UniACT-dataset (over 6 million trajectories and 9,500 hours). The central technical contribution is the Action Manifold Hypothesis—that effective robot actions occupy a low-dimensional smooth manifold governed by physical laws and task constraints—together with Action Manifold Learning (AML), which replaces denoising diffusion with direct prediction of clean action sequences via a DiT backbone. A dual-stream perception module integrates VLM semantics with geometric priors from plug-and-play 3D components (VGGT, Qwen-Image-Edit). The authors state that the components operate independently and yield additive benefits in decoding speed, policy stability, and cross-platform generalization.

Significance. If the manifold hypothesis is substantiated and AML demonstrably reduces effective action dimensionality while improving speed and stability over standard regression or diffusion baselines, the work would offer a practical route toward more efficient general-purpose embodied agents. The scale of the unified UniACT-dataset and the modular perception design address real fragmentation issues in robotics data and VLM 3D reasoning. Releasing code and pipelines would further increase impact. Significance is currently limited by the absence of quantitative validation for the manifold claim.

major comments (2)
  1. [Abstract / AML description] Abstract and AML formulation: no manifold-regularization term, intrinsic-dimension analysis, or explicit constraint enforcement appears in the AML objective. It is therefore unclear whether DiT direct prediction actually projects onto a low-dimensional manifold or simply performs high-capacity regression in the original action space; without this distinction the claimed mechanistic advantage in speed and stability does not follow.
  2. [Experiments] Experiments section: the statement that “experiments show components operate independently with additive benefits” is unsupported by any reported metrics, ablation tables, success rates, latency numbers, or stability measures. Without these data the empirical grounding of the central claims cannot be evaluated.
minor comments (1)
  1. Notation for action sequences, DiT conditioning, and the dual-stream fusion mechanism should be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and empirical support.

read point-by-point responses
  1. Referee: [Abstract / AML description] Abstract and AML formulation: no manifold-regularization term, intrinsic-dimension analysis, or explicit constraint enforcement appears in the AML objective. It is therefore unclear whether DiT direct prediction actually projects onto a low-dimensional manifold or simply performs high-capacity regression in the original action space; without this distinction the claimed mechanistic advantage in speed and stability does not follow.

    Authors: We agree the abstract and AML description do not explicitly include a regularization term or intrinsic-dimension analysis. The AML formulation trains the DiT to directly regress clean action sequences drawn from the physically constrained UniACT-dataset; the manifold structure is therefore induced by the data distribution rather than an added loss term. To substantiate the distinction from high-capacity regression, we will add to the methods section: (i) a formal description of the implicit projection, (ii) an intrinsic-dimension estimate of the action space (via PCA and nearest-neighbor methods on held-out trajectories), and (iii) quantitative comparisons of effective dimensionality, decoding latency, and action stability versus standard regression and diffusion baselines. revision: yes

  2. Referee: [Experiments] Experiments section: the statement that “experiments show components operate independently with additive benefits” is unsupported by any reported metrics, ablation tables, success rates, latency numbers, or stability measures. Without these data the empirical grounding of the central claims cannot be evaluated.

    Authors: We acknowledge that the current manuscript lacks the quantitative ablation data needed to support the independence and additivity claims. We will expand the experiments section with new tables reporting: task success rates, end-to-end latency, action-sequence variance (stability), and cross-platform generalization scores for the full model, each isolated component, and all pairwise combinations. These results will be compared against diffusion and regression baselines to demonstrate additive gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Action Manifold Hypothesis is an explicit modeling premise

full rationale

The paper states the Action Manifold Hypothesis as a proposed assumption ('effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold') and then defines AML as the direct use of a DiT backbone to predict clean sequences. No equations, fitted parameters, or self-citations are shown that reduce the claimed projection benefit or stability gain to the inputs by construction. The derivation chain consists of data curation followed by an architectural choice justified by the hypothesis; the hypothesis itself is not derived from the model's outputs or prior self-referential results. This is a standard non-circular modeling decision.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The load-bearing premise is the Action Manifold Hypothesis, treated as an ad-hoc modeling assumption rather than derived from first principles. No explicit free parameters are named in the abstract. The dual-stream perception module relies on external 3D modules (VGGT, Qwen-Image-Edit) whose outputs are assumed to provide useful geometric priors.

axioms (1)
  • ad hoc to paper Effective robot actions lie on a low-dimensional smooth manifold governed by physical laws and task constraints.
    This is the Action Manifold Hypothesis introduced to justify the AML approach; it is not derived within the paper.
invented entities (1)
  • Action Manifold no independent evidence
    purpose: Low-dimensional surface on which valid robot actions are assumed to lie, enabling direct prediction instead of denoising.
    Postulated to explain improved stability and speed; no independent falsifiable prediction (e.g., specific manifold dimension or curvature) is given in the abstract.

pith-pipeline@v0.9.0 · 5648 in / 1455 out tokens · 123687 ms · 2026-05-16T03:01:44.595250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  4. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  5. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  6. Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

    cs.RO 2026-05 unverdicted novelty 6.0

    Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...

  7. FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

    cs.RO 2026-05 unverdicted novelty 6.0

    FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

  8. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  9. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  10. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  11. Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

    cs.RO 2026-05 unverdicted novelty 5.0

    A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent c...

  12. Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...

  13. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  14. Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

    cs.RO 2026-04 unverdicted novelty 5.0

    A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

  15. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

  16. ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

    cs.CV 2026-04 unverdicted novelty 4.0

    ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 13 Pith papers · 26 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Qidong Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayi...

  2. [2]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    Univla: Learning to act anywhere with task-centric latent actions.RSS, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.RSS, 2025

  5. [5]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch.https://github.com/huggingface/lerobot, 2024

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch.https://github.com/hug...

  6. [6]

    Topology and data.Bulletin of the American Mathematical Society, 46(2):255–308, 2009

    Gunnar Carlsson. Topology and data.Bulletin of the American Mathematical Society, 46(2):255–308, 2009

  7. [7]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  8. [8]

    Semi-Supervised Learning

    Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 09

  9. [9]

    and Chater, Nick , year =

    ISBN 9780262033589. doi: 10.7551/mitpress/9780262033589.001.0001. URLhttps://doi.org/10.7551/ mitpress/9780262033589.001.0001

  10. [10]

    GR-3 Technical Report

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  11. [11]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  12. [12]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  13. [13]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  14. [14]

    Agibot world colosseum

    AgiBot World Colosseum contributors. Agibot world colosseum. https://github.com/OpenDriveLab/ AgiBot-World, 2024

  15. [15]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    InternVLA-M1 Contributors. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

  16. [16]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025

  17. [17]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025. 23

  18. [18]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.pi_{0.5}: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  19. [19]

    Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning.arXiv preprint arXiv:2512.13100, 2025

    Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning. arXiv preprint arXiv:2512.13100, 2025

  20. [20]

    Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  21. [21]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  22. [22]

    Fine-tuning vision-language-action models: Optimizing speed and success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. RSS, 2025

  23. [23]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  24. [24]

    Let:full-size humanoid robot real-world dataset

    LejuRobotics. Let:full-size humanoid robot real-world dataset. https://huggingface.co/datasets/ LejuRobotics/let_dataset, 2025

  25. [25]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  26. [26]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720, 2025

  27. [27]

    Robo360: a3domnispectivemulti-materialroboticmanipulationdataset

    Litian Liang, Liuyu Bian, Caiwei Xiao, Jialin Zhang, Linghao Chen, Isabella Liu, Fanbo Xiang, Zhiao Huang, and HaoSu. Robo360: a3domnispectivemulti-materialroboticmanipulationdataset. arXivpreprintarXiv:2312.06686, 2023

  28. [28]

    Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies. arXiv preprint arXiv:2508.20072, 2025

  29. [29]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

  30. [30]

    arXiv preprint arXiv:2602.03310 (2026)

    Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

  31. [31]

    F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

  32. [32]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You ...

  33. [33]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. 24

  34. [34]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022

  35. [35]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  36. [36]

    Spatialvla: Exploring spatial representations for visual-language-action model.RSS, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.RSS, 2025

  37. [37]

    Rlds: an ecosystem to generate, share and use datasets in reinforcement learning.arXiv preprint arXiv:2111.02767, 2021

    Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely, Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning.arXiv preprint arXiv:2111.02767, 2021

  38. [38]

    E., Otto, F., and Lioutikov, R

    Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025

  39. [39]

    Starvla: A lego-like codebase for vision-language-action model developing

    starVLA Contributors. Starvla: A lego-like codebase for vision-language-action model developing. GitHub repository, 1 2025. URLhttps://github.com/starVLA/starVLA

  40. [40]

    Interactive Post-Training for Vision-Language-Action Models

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models. arXiv preprint arXiv:2505.17016, 2025

  41. [41]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprintarXiv:2405.12213, 2024

  42. [42]

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010

  43. [43]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736, 2023

  44. [44]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  45. [45]

    Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.AAAI, 2026

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.AAAI, 2026

  46. [46]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  47. [47]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

  48. [48]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

  49. [49]

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685, 2025

  50. [50]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InCVPR, 2025. 25

  51. [51]

    Fastumi: A scalable and hardware-independent universal manipulation interface with dataset

    Zhaxizhuom Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. InConference on Robot Learning, pages 3069–3093. PMLR, 2025

  52. [52]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. ICLR, 2025

  53. [53]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 26