pith. sign in

arxiv: 2606.01241 · v2 · pith:XWE3S4E4new · submitted 2026-05-31 · 💻 cs.RO

OneVLA: A Unified Framework for Embodied Tasks

Pith reviewed 2026-06-28 17:02 UTC · model grok-4.3

classification 💻 cs.RO
keywords Vision-Language-Actionrobot navigationmanipulationunified frameworkembodied AIChain-of-Thoughtprogressive training
0
0 comments X

The pith

OneVLA combines navigation and manipulation in one Vision-Language-Action model through a shared action head and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OneVLA to overcome the split between navigation-only and manipulation-only Vision-Language-Action models. It proposes a single architecture that produces both kinds of actions from the same head and uses progressive multi-stage training with curated data and Chain-of-Thought fine-tuning to create positive transfer between the tasks. Experiments in simulation and on real robots show the unified model beats both specialized single-task systems and prior cross-task approaches. A sympathetic reader would care because this points toward robots that can follow mixed natural-language instructions without switching between separate models. The central claim is that unification plus staged training is sufficient to achieve this without losing performance on either capability.

Core claim

OneVLA is a unified architecture that integrates navigation and manipulation into a single cohesive framework by means of a unified action head capable of generating both types of actions without task-specific variants, together with a multi-stage progressive training strategy that incorporates curated data construction and Chain-of-Thought fine-tuning to produce strong positive transfer and mutual reinforcement between the two domains, resulting in state-of-the-art performance that significantly outperforms specialized single-task and existing cross-task models in both simulated and real-world environments.

What carries the argument

The unified action head that generates both navigation and manipulation actions without requiring task-specific variants, supported by multi-stage progressive training for positive transfer between domains.

If this is right

  • A single model can execute mixed sequences of navigation and manipulation commands without architecture changes.
  • Training on one domain improves performance on the other through the shared training pipeline.
  • General-purpose embodied agents become feasible without maintaining separate specialist models.
  • The same framework supports both simulated and real-robot deployment at higher accuracy than prior cross-task baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on additional embodied tasks such as tool use or multi-robot coordination to check whether the unified head generalizes further.
  • If positive transfer holds, it suggests that future scaling laws for embodied models may favor joint training over modular specialization.
  • Real-world deployment would still require addressing safety constraints that arise when navigation and manipulation occur in the same continuous policy.

Load-bearing premise

A single action head without task-specific variants can produce both navigation and manipulation actions, and the multi-stage training strategy reliably creates strong positive transfer between the two domains.

What would settle it

A controlled ablation in which replacing the unified action head with separate navigation and manipulation heads, or removing the progressive training stages, produces equal or higher success rates on both tasks.

read the original abstract

Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic agents. To bridge this gap, we introduce OneVLA, a unified architecture that integrates these distinct tasks into a single, cohesive framework. Specifically, we design a unified action head capable of generating both navigation and manipulation actions without requiring task-specific variants. Furthermore, we propose a multi stage progressive training strategy-incorporating curated data construction and Chain-of-Thought (CoT) fine-tuning that facilitates strong positive transfer and mutual reinforcement between the two domains. Extensive experiments in both simulated and real-world environments demonstrate that OneVLA achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models. By unifying these core capabilities, OneVLA paves the way for truly general-purpose robotic systems. The model and source code will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces OneVLA, a unified Vision-Language-Action architecture that combines navigation and manipulation into a single framework via a unified action head and a multi-stage progressive training procedure that includes curated data construction and Chain-of-Thought fine-tuning. It asserts that this design produces strong positive transfer between domains and delivers state-of-the-art results on both simulated and real-world benchmarks, outperforming specialized single-task and prior cross-task models.

Significance. If the empirical claims hold, the work would constitute a meaningful step toward general-purpose embodied agents by showing that navigation and manipulation can share a single action parameterization and training regime. The public release of model and code would further increase its utility. However, the absence of any quantitative metrics, baselines, ablations, or action-head parameterization in the supplied text makes it impossible to determine whether the significance claim is realized.

major comments (3)
  1. [Abstract] Abstract: the central claim that the unified action head 'generates both navigation and manipulation actions without requiring task-specific variants' is stated without any equation, output parameterization, or dimensionality description, leaving the compatibility of 2-4D navigation commands with 6-7D manipulation actions unverified and load-bearing for the unification thesis.
  2. [Abstract] Abstract: the assertion of 'strong positive transfer and mutual reinforcement' via multi-stage training and CoT fine-tuning is presented without ablation results that isolate the unified head from data curation choices or that quantify cross-domain gains versus multi-task averaging; this directly undermines the claim that the architecture itself produces the reported gains.
  3. [Abstract] Abstract: the statement that OneVLA 'achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models' is unsupported by any metrics, baselines, error bars, or table references, rendering the SOTA claim unverifiable from the provided manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We agree that the abstract would benefit from additional technical specificity and references to the supporting sections and results in the full manuscript. We will revise the abstract in the next version to address these points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the unified action head 'generates both navigation and manipulation actions without requiring task-specific variants' is stated without any equation, output parameterization, or dimensionality description, leaving the compatibility of 2-4D navigation commands with 6-7D manipulation actions unverified and load-bearing for the unification thesis.

    Authors: The full manuscript (Section 3) specifies the unified action head parameterization, including how a shared output representation accommodates navigation (2-4D) and manipulation (6-7D) actions via appropriate masking and projection layers without task-specific heads. We will revise the abstract to include a concise reference to this parameterization and point to Section 3 for the equations and dimensionality details. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of 'strong positive transfer and mutual reinforcement' via multi-stage training and CoT fine-tuning is presented without ablation results that isolate the unified head from data curation choices or that quantify cross-domain gains versus multi-task averaging; this directly undermines the claim that the architecture itself produces the reported gains.

    Authors: The manuscript contains ablation studies in Section 4.3 that isolate the unified action head from data curation effects and quantify cross-domain transfer gains relative to multi-task baselines. We will revise the abstract to reference these ablations and the observed transfer benefits. revision: yes

  3. Referee: [Abstract] Abstract: the statement that OneVLA 'achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models' is unsupported by any metrics, baselines, error bars, or table references, rendering the SOTA claim unverifiable from the provided manuscript.

    Authors: Quantitative results, baselines, error bars, and table references supporting the SOTA claims are provided in Section 4. We will revise the abstract to include direct references to the relevant tables and key performance metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and training evaluated on external benchmarks

full rationale

The paper describes a neural architecture (unified action head) and multi-stage training procedure (data curation + CoT fine-tuning) whose performance is measured on independent simulation and real-world benchmarks. No equations, first-principles derivations, or 'predictions' are presented that reduce by construction to fitted inputs or self-citations. The central claims are empirical and externally falsifiable; the reader's assessment of score 1.0 is consistent with the absence of any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus the unverified premise that positive transfer occurs via the described training pipeline; no new physical entities are postulated.

free parameters (1)
  • multi-stage training hyperparameters and data curation choices
    The progressive training strategy and curated data construction involve multiple choices of stages, data selection, and fine-tuning parameters that are not enumerated in the abstract.
axioms (2)
  • domain assumption A single neural network action head can produce both navigation trajectories and manipulation commands without task-specific architectural variants
    This is the core architectural assumption stated in the abstract.
  • domain assumption Chain-of-Thought fine-tuning on curated data produces positive transfer between navigation and manipulation domains
    This is the key training assumption invoked to explain mutual reinforcement.

pith-pipeline@v0.9.1-grok · 5755 in / 1362 out tokens · 31029 ms · 2026-06-28T17:02:57.518850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

    cs.RO 2026-06 unverdicted novelty 5.0

    FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.

Reference graph

Works this paper leans on

60 extracted references · 15 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    Embodied robot manipulation in the era of foundation models: Planning and learning perspectives

    Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Zhe Li, Pengxiang Ding, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives. arXiv preprint arXiv:2512.22983, 2025

  3. [3]

    arXiv preprint arXiv:2410.24164, 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    In Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π0.5: a vision-language-action model with open-world generalization. In Annual Conference on Robot Learning, 2025

  5. [5]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

  6. [6]

    Univla: Learning to act anywhere with task-centric latent actions.Proceedings of Robotics: Science and Systems, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.Proceedings of Robotics: Science and Systems, 2025

  7. [7]

    Navila: Legged robot vision-language-action model for navigation.Robotics: Science and Systems, 2025

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.Robotics: Science and Systems, 2025

  8. [8]

    Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

    Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InConference on Robot Learning, pages 496–512, 2025

  9. [9]

    Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map prediction

    Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map prediction. arXiv preprint arXiv:2602.21589, 2026

  10. [10]

    Stairway to success: An online floor-aware zero-shot object-goal navigation framework via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

    Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Guoyang Zhao, Yiyi Ding, and Junwei Liang. Stairway to success: An online floor-aware zero-shot object-goal navigation framework via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

  11. [11]

    Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

    Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

  12. [12]

    Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025

    Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025. 11

  13. [13]

    Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

  14. [14]

    Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

    Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

  15. [15]

    H2r-bm: Can leveraging human videos enhance performance and generalizability in robotic bimanual manipulation?Pattern Recognition, 179:113637, 2026

    Xiaoshuai Hao, Huaihai Lyu, Lingfeng Zhang, Rui Liu, Dayan Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance performance and generalizability in robotic bimanual manipulation?Pattern Recognition, 179:113637, 2026

  16. [16]

    Omnivla: An omni-modal vision-language- action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

    Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision-language- action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

  17. [17]

    Fine-tuning vision-language-action models: Optimizing speed and success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

  18. [18]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, pages 2679–2713, 2025

  19. [19]

    The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014, 2026

    Lingdong Kong, Shaoyuan Xie, Zeying Gong, Ye Li, Meng Chu, Ao Liang, Yuhao Dong, Tianshuai Hu, Ronghe Qiu, Rong Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014, 2026

  20. [20]

    Beyond the nav-graph: Vision-and- language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120, 2020

  21. [21]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392–4412, 2020

  22. [22]

    Weather- conditioned branch routing for robust lidar-radar 3d object detection.arXiv preprint arXiv:2604.05405, 2026

    Hongsheng Li, Lingfeng Zhang, Zexian Yang, Liang Li, Rong Yin, Xiaoshuai Hao, and Wenbo Ding. Weather- conditioned branch routing for robust lidar-radar 3d object detection.arXiv preprint arXiv:2604.05405, 2026

  23. [23]

    Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  24. [24]

    Evaluating real-world robot manipulation policies in simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, pages 3705–3728, 2025

  25. [25]

    Embodied intelligence: A synergy of morphology, action, perception and learning

    Huaping Liu, Di Guo, and Angelo Cangelosi. Embodied intelligence: A synergy of morphology, action, perception and learning. ACM Computing Surveys, 57(7), 2025

  26. [26]

    Towards generalist robot policies: What matters in building vision-language-action models

    Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What matters in building vision-language-action models. 2025

  27. [27]

    Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. Advancesin Neural Information Processing Systems, 37:40085–40110, 2024

  28. [28]

    Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  29. [29]

    Thinking in text and images: Interleaved vision–language reasoning traces for long-horizon robot manipulation

    Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, YuAn Wang, Long Chen, Hangjun Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and images: Interleaved vision–language reasoning traces for long-horizon robot manipulation. arXiv preprint arXiv:2605.00438, 2026. 12

  30. [30]

    Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364, 2025

    Peiran Liu, Qiang Zhang, Daojie Peng, Lingfeng Zhang, Yihao Qin, Hang Zhou, Jun Ma, Renjing Xu, and Yiding Ji. Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364, 2025

  31. [31]

    Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

    Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, and Wenbo Ding. Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

  32. [32]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  33. [33]

    Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

    Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

  34. [34]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration0. InIEEE International Conference on Robotics and Automation, pages 6892–6903, 2024

  35. [35]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  36. [36]

    Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  37. [37]

    Spatialvla: Exploring spatial representations for visual-language-action model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. InRobotics: Science and Systems, 2025

  38. [38]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

  39. [39]

    Reconvla: Reconstructive vision-language-action model as effective robot perceiver

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025

  40. [40]

    Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration

    Huajie Tan, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Yaoxu Lyu, Mingyu Cao, Zhongyuan Wang, and Shang- hang Zhang. Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration. arXiv preprint arXiv:2505.03673, 2025

  41. [41]

    Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

    Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12706–12713, 2025

  42. [42]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024

  43. [43]

    Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

  44. [44]

    Streamvln: Streaming vision-and-language navigation via slowfast context modeling

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling. arXiv preprint arXiv:2507.05240, 2025

  45. [45]

    Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. In Robotics: Science and Systems, 2025

  46. [46]

    Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study.TechRxiv preprint techrxiv.174495686.69962588/v1, 2025

    Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study.TechRxiv preprint techrxiv.174495686.69962588/v1, 2025. 13

  47. [47]

    Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, and Xiaoshuai Hao. Team xiaomi ev-ad vla: Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track.arXiv preprint arXiv:2510.07871, 2025

  48. [48]

    Xiaomi onevl: One-step latent reasoning and planning with vision-language explanation

    Xiaomi Embodied Intelligence Team. Xiaomi onevl: One-step latent reasoning and planning with vision-language explanation. arXiv preprint arXiv:2604.18486, 2026

  49. [49]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

  50. [50]

    Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

    Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

  51. [51]

    Vtla: Vision-tactile- language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

    Chaofan Zhang, Peng Hao, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Vtla: Vision-tactile- language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

  52. [52]

    Navid: Video-based vlm plans the next step for vision-and-language navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. InRobotics: Science and Systems, 2024

  53. [53]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. In Robotics: Science and Systems, 2025

  54. [54]

    Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation.arXiv preprint arXiv:2511.13269, 2025

    Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, and Wenbo Ding. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation.arXiv preprint arXiv:2511.13269, 2025

  55. [55]

    Walk with me: Long-horizon social navigation for human-centric outdoor assistance

    Lingfeng Zhang, Xiaoshuai Hao, Xizhou Bu, Yingbo Tang, Hongsheng Li, Jinghui Lu, Xiu-shen Wei, Jiayi Ma, Yu Liu, Jing Zhang, et al. Walk with me: Long-horizon social navigation for human-centric outdoor assistance. arXiv preprint arXiv:2604.26839, 2026

  56. [56]

    Socialnav-map: Dynamic mapping with human trajectory prediction for zero-shot social navigation

    Lingfeng Zhang et al. Socialnav-map: Dynamic mapping with human trajectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025

  57. [57]

    Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation

    Lingfeng Zhang et al. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InThe Annual Meeting of the Association for Computational Linguistics, pages 13032–13056, 2025

  58. [58]

    Meshmimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

    Qiang Zhang, Jiahao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zifan Wang, Jingkai Sun, Wei Cui, Jialin Yu, Gang Han, et al. Meshmimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

  59. [59]

    Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought

    Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. arXiv preprint arXiv:2506.08817, 2025

  60. [60]

    Universal actions for enhanced embodied foundation models

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025. 14