OneVLA: A Unified Framework for Embodied Tasks

Chenhao Zhang; Hangjun Ye; Hongsheng Li; Jinkun Liu; Lei Zhou; Lingfeng Zhang; Long Chen; Qiang Zhang; Shuyi Zhang; Wenbo Ding

arxiv: 2606.01241 · v2 · pith:XWE3S4E4new · submitted 2026-05-31 · 💻 cs.RO

OneVLA: A Unified Framework for Embodied Tasks

Lingfeng Zhang , Xiaoshuai Hao , Yingbo Tang , Lei Zhou , Shuyi Zhang , Jinkun Liu , Hongsheng Li , Chenhao Zhang

show 5 more authors

Qiang Zhang Hangjun Ye Xiaojun Liang Long Chen Wenbo Ding

This is my paper

Pith reviewed 2026-06-28 17:02 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-Actionrobot navigationmanipulationunified frameworkembodied AIChain-of-Thoughtprogressive training

0 comments

The pith

OneVLA combines navigation and manipulation in one Vision-Language-Action model through a shared action head and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OneVLA to overcome the split between navigation-only and manipulation-only Vision-Language-Action models. It proposes a single architecture that produces both kinds of actions from the same head and uses progressive multi-stage training with curated data and Chain-of-Thought fine-tuning to create positive transfer between the tasks. Experiments in simulation and on real robots show the unified model beats both specialized single-task systems and prior cross-task approaches. A sympathetic reader would care because this points toward robots that can follow mixed natural-language instructions without switching between separate models. The central claim is that unification plus staged training is sufficient to achieve this without losing performance on either capability.

Core claim

OneVLA is a unified architecture that integrates navigation and manipulation into a single cohesive framework by means of a unified action head capable of generating both types of actions without task-specific variants, together with a multi-stage progressive training strategy that incorporates curated data construction and Chain-of-Thought fine-tuning to produce strong positive transfer and mutual reinforcement between the two domains, resulting in state-of-the-art performance that significantly outperforms specialized single-task and existing cross-task models in both simulated and real-world environments.

What carries the argument

The unified action head that generates both navigation and manipulation actions without requiring task-specific variants, supported by multi-stage progressive training for positive transfer between domains.

If this is right

A single model can execute mixed sequences of navigation and manipulation commands without architecture changes.
Training on one domain improves performance on the other through the shared training pipeline.
General-purpose embodied agents become feasible without maintaining separate specialist models.
The same framework supports both simulated and real-robot deployment at higher accuracy than prior cross-task baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on additional embodied tasks such as tool use or multi-robot coordination to check whether the unified head generalizes further.
If positive transfer holds, it suggests that future scaling laws for embodied models may favor joint training over modular specialization.
Real-world deployment would still require addressing safety constraints that arise when navigation and manipulation occur in the same continuous policy.

Load-bearing premise

A single action head without task-specific variants can produce both navigation and manipulation actions, and the multi-stage training strategy reliably creates strong positive transfer between the two domains.

What would settle it

A controlled ablation in which replacing the unified action head with separate navigation and manipulation heads, or removing the progressive training stages, produces equal or higher success rates on both tasks.

read the original abstract

Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic agents. To bridge this gap, we introduce OneVLA, a unified architecture that integrates these distinct tasks into a single, cohesive framework. Specifically, we design a unified action head capable of generating both navigation and manipulation actions without requiring task-specific variants. Furthermore, we propose a multi stage progressive training strategy-incorporating curated data construction and Chain-of-Thought (CoT) fine-tuning that facilitates strong positive transfer and mutual reinforcement between the two domains. Extensive experiments in both simulated and real-world environments demonstrate that OneVLA achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models. By unifying these core capabilities, OneVLA paves the way for truly general-purpose robotic systems. The model and source code will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OneVLA claims a single action head plus multi-stage CoT training can unify navigation and manipulation, but the abstract supplies no metrics, equations, or ablations to check whether that actually works.

read the letter

The paper's main move is a unified action head that outputs both navigation commands and manipulation actions without task-specific branches, paired with a multi-stage training pipeline that adds curated data and Chain-of-Thought fine-tuning to encourage transfer between the two domains.

That combination is presented as new relative to the single-task VLAs they cite. The architecture description is straightforward and the motivation is clear: current models fragment along task lines and a shared head plus progressive training could reduce that.

The soft spot is the complete absence of evidence. The abstract says they ran extensive experiments and beat both single-task and cross-task baselines, yet it gives no numbers, no tables, no error bars, and no ablation that isolates the unified head from the data curation or the CoT stages. The stress-test concern lands: without the output parameterization or an equation showing how one head produces valid actions in two different spaces, you cannot tell whether the head is doing real unification or whether any gains come from simply training on more data.

If the full paper contains those details and the results survive scrutiny, the training strategy could be worth testing. Right now the central claim rests on assertion.

This is for people already working on VLA models or generalist robot policies who want concrete ideas for cross-domain training. A reader looking for a starting point on unified heads might extract the multi-stage recipe, but anyone needing reproducible results will have to wait for the released code and numbers.

It deserves peer review because the problem is real and the proposed fix is specific enough to evaluate once the missing experiments are added.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces OneVLA, a unified Vision-Language-Action architecture that combines navigation and manipulation into a single framework via a unified action head and a multi-stage progressive training procedure that includes curated data construction and Chain-of-Thought fine-tuning. It asserts that this design produces strong positive transfer between domains and delivers state-of-the-art results on both simulated and real-world benchmarks, outperforming specialized single-task and prior cross-task models.

Significance. If the empirical claims hold, the work would constitute a meaningful step toward general-purpose embodied agents by showing that navigation and manipulation can share a single action parameterization and training regime. The public release of model and code would further increase its utility. However, the absence of any quantitative metrics, baselines, ablations, or action-head parameterization in the supplied text makes it impossible to determine whether the significance claim is realized.

major comments (3)

[Abstract] Abstract: the central claim that the unified action head 'generates both navigation and manipulation actions without requiring task-specific variants' is stated without any equation, output parameterization, or dimensionality description, leaving the compatibility of 2-4D navigation commands with 6-7D manipulation actions unverified and load-bearing for the unification thesis.
[Abstract] Abstract: the assertion of 'strong positive transfer and mutual reinforcement' via multi-stage training and CoT fine-tuning is presented without ablation results that isolate the unified head from data curation choices or that quantify cross-domain gains versus multi-task averaging; this directly undermines the claim that the architecture itself produces the reported gains.
[Abstract] Abstract: the statement that OneVLA 'achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models' is unsupported by any metrics, baselines, error bars, or table references, rendering the SOTA claim unverifiable from the provided manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We agree that the abstract would benefit from additional technical specificity and references to the supporting sections and results in the full manuscript. We will revise the abstract in the next version to address these points.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the unified action head 'generates both navigation and manipulation actions without requiring task-specific variants' is stated without any equation, output parameterization, or dimensionality description, leaving the compatibility of 2-4D navigation commands with 6-7D manipulation actions unverified and load-bearing for the unification thesis.

Authors: The full manuscript (Section 3) specifies the unified action head parameterization, including how a shared output representation accommodates navigation (2-4D) and manipulation (6-7D) actions via appropriate masking and projection layers without task-specific heads. We will revise the abstract to include a concise reference to this parameterization and point to Section 3 for the equations and dimensionality details. revision: yes
Referee: [Abstract] Abstract: the assertion of 'strong positive transfer and mutual reinforcement' via multi-stage training and CoT fine-tuning is presented without ablation results that isolate the unified head from data curation choices or that quantify cross-domain gains versus multi-task averaging; this directly undermines the claim that the architecture itself produces the reported gains.

Authors: The manuscript contains ablation studies in Section 4.3 that isolate the unified action head from data curation effects and quantify cross-domain transfer gains relative to multi-task baselines. We will revise the abstract to reference these ablations and the observed transfer benefits. revision: yes
Referee: [Abstract] Abstract: the statement that OneVLA 'achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models' is unsupported by any metrics, baselines, error bars, or table references, rendering the SOTA claim unverifiable from the provided manuscript.

Authors: Quantitative results, baselines, error bars, and table references supporting the SOTA claims are provided in Section 4. We will revise the abstract to include direct references to the relevant tables and key performance metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and training evaluated on external benchmarks

full rationale

The paper describes a neural architecture (unified action head) and multi-stage training procedure (data curation + CoT fine-tuning) whose performance is measured on independent simulation and real-world benchmarks. No equations, first-principles derivations, or 'predictions' are presented that reduce by construction to fitted inputs or self-citations. The central claims are empirical and externally falsifiable; the reader's assessment of score 1.0 is consistent with the absence of any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus the unverified premise that positive transfer occurs via the described training pipeline; no new physical entities are postulated.

free parameters (1)

multi-stage training hyperparameters and data curation choices
The progressive training strategy and curated data construction involve multiple choices of stages, data selection, and fine-tuning parameters that are not enumerated in the abstract.

axioms (2)

domain assumption A single neural network action head can produce both navigation trajectories and manipulation commands without task-specific architectural variants
This is the core architectural assumption stated in the abstract.
domain assumption Chain-of-Thought fine-tuning on curated data produces positive transfer between navigation and manipulation domains
This is the key training assumption invoked to explain mutual reinforcement.

pith-pipeline@v0.9.1-grok · 5755 in / 1362 out tokens · 31029 ms · 2026-06-28T17:02:57.518850+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation
cs.RO 2026-06 unverdicted novelty 5.0

FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.

Reference graph

Works this paper leans on

60 extracted references · 15 linked inside Pith · cited by 1 Pith paper

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[2]

Embodied robot manipulation in the era of foundation models: Planning and learning perspectives

Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Zhe Li, Pengxiang Ding, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives. arXiv preprint arXiv:2512.22983, 2025

arXiv 2025
[3]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[4]

In Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π0.5: a vision-language-action model with open-world generalization. In Annual Conference on Robot Learning, 2025

2025
[5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[6]

Univla: Learning to act anywhere with task-centric latent actions.Proceedings of Robotics: Science and Systems, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.Proceedings of Robotics: Science and Systems, 2025

2025
[7]

Navila: Legged robot vision-language-action model for navigation.Robotics: Science and Systems, 2025

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.Robotics: Science and Systems, 2025

2025
[8]

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InConference on Robot Learning, pages 496–512, 2025

2025
[9]

Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map prediction

Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map prediction. arXiv preprint arXiv:2602.21589, 2026

arXiv 2026
[10]

Stairway to success: An online floor-aware zero-shot object-goal navigation framework via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Guoyang Zhao, Yiyi Ding, and Junwei Liang. Stairway to success: An online floor-aware zero-shot object-goal navigation framework via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

arXiv 2025
[11]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

arXiv 2025
[12]

Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025

Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025. 11

arXiv 2025
[13]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

arXiv 2025
[14]

Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

Pith/arXiv arXiv 2025
[15]

H2r-bm: Can leveraging human videos enhance performance and generalizability in robotic bimanual manipulation?Pattern Recognition, 179:113637, 2026

Xiaoshuai Hao, Huaihai Lyu, Lingfeng Zhang, Rui Liu, Dayan Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance performance and generalizability in robotic bimanual manipulation?Pattern Recognition, 179:113637, 2026

2026
[16]

Omnivla: An omni-modal vision-language- action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision-language- action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

arXiv 2025
[17]

Fine-tuning vision-language-action models: Optimizing speed and success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[18]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, pages 2679–2713, 2025

2025
[19]

The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014, 2026

Lingdong Kong, Shaoyuan Xie, Zeying Gong, Ye Li, Meng Chu, Ao Liang, Yuhao Dong, Tianshuai Hu, Ronghe Qiu, Rong Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014, 2026

arXiv 2026
[20]

Beyond the nav-graph: Vision-and- language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120, 2020

2020
[21]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392–4412, 2020

2020
[22]

Weather- conditioned branch routing for robust lidar-radar 3d object detection.arXiv preprint arXiv:2604.05405, 2026

Hongsheng Li, Lingfeng Zhang, Zexian Yang, Liang Li, Rong Yin, Xiaoshuai Hao, and Wenbo Ding. Weather- conditioned branch routing for robust lidar-radar 3d object detection.arXiv preprint arXiv:2604.05405, 2026

Pith/arXiv arXiv 2026
[23]

Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024
[24]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, pages 3705–3728, 2025

2025
[25]

Embodied intelligence: A synergy of morphology, action, perception and learning

Huaping Liu, Di Guo, and Angelo Cangelosi. Embodied intelligence: A synergy of morphology, action, perception and learning. ACM Computing Surveys, 57(7), 2025

2025
[26]

Towards generalist robot policies: What matters in building vision-language-action models

Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What matters in building vision-language-action models. 2025

2025
[27]

Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. Advancesin Neural Information Processing Systems, 37:40085–40110, 2024

2024
[28]

Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

Pith/arXiv arXiv 2025
[29]

Thinking in text and images: Interleaved vision–language reasoning traces for long-horizon robot manipulation

Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, YuAn Wang, Long Chen, Hangjun Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and images: Interleaved vision–language reasoning traces for long-horizon robot manipulation. arXiv preprint arXiv:2605.00438, 2026. 12

Pith/arXiv arXiv 2026
[30]

Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364, 2025

Peiran Liu, Qiang Zhang, Daojie Peng, Lingfeng Zhang, Yihao Qin, Hang Zhou, Jun Ma, Renjing Xu, and Yiding Ji. Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364, 2025

arXiv 2025
[31]

Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, and Wenbo Ding. Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Pith/arXiv arXiv 2026
[32]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[33]

Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

Pith/arXiv arXiv 2026
[34]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration0. InIEEE International Conference on Robotics and Automation, pages 6892–6903, 2024

2024
[35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[36]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[37]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. InRobotics: Science and Systems, 2025

2025
[38]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

2019
[39]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025

arXiv 2025
[40]

Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration

Huajie Tan, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Yaoxu Lyu, Mingyu Cao, Zhongyuan Wang, and Shang- hang Zhang. Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration. arXiv preprint arXiv:2505.03673, 2025

arXiv 2025
[41]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12706–12713, 2025

2025
[42]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024

2024
[43]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

arXiv 2025
[44]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling. arXiv preprint arXiv:2507.05240, 2025

arXiv 2025
[45]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. In Robotics: Science and Systems, 2025

2025
[46]

Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study.TechRxiv preprint techrxiv.174495686.69962588/v1, 2025

Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study.TechRxiv preprint techrxiv.174495686.69962588/v1, 2025. 13

arXiv 2025
[47]

Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, and Xiaoshuai Hao. Team xiaomi ev-ad vla: Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track.arXiv preprint arXiv:2510.07871, 2025

arXiv 2025
[48]

Xiaomi onevl: One-step latent reasoning and planning with vision-language explanation

Xiaomi Embodied Intelligence Team. Xiaomi onevl: One-step latent reasoning and planning with vision-language explanation. arXiv preprint arXiv:2604.18486, 2026

Pith/arXiv arXiv 2026
[49]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

2025
[50]

Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

arXiv 2024
[51]

Vtla: Vision-tactile- language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

Chaofan Zhang, Peng Hao, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Vtla: Vision-tactile- language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

arXiv 2025
[52]

Navid: Video-based vlm plans the next step for vision-and-language navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. InRobotics: Science and Systems, 2024

2024
[53]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. In Robotics: Science and Systems, 2025

2025
[54]

Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation.arXiv preprint arXiv:2511.13269, 2025

Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, and Wenbo Ding. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation.arXiv preprint arXiv:2511.13269, 2025

arXiv 2025
[55]

Walk with me: Long-horizon social navigation for human-centric outdoor assistance

Lingfeng Zhang, Xiaoshuai Hao, Xizhou Bu, Yingbo Tang, Hongsheng Li, Jinghui Lu, Xiu-shen Wei, Jiayi Ma, Yu Liu, Jing Zhang, et al. Walk with me: Long-horizon social navigation for human-centric outdoor assistance. arXiv preprint arXiv:2604.26839, 2026

Pith/arXiv arXiv 2026
[56]

Socialnav-map: Dynamic mapping with human trajectory prediction for zero-shot social navigation

Lingfeng Zhang et al. Socialnav-map: Dynamic mapping with human trajectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025

arXiv 2025
[57]

Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation

Lingfeng Zhang et al. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InThe Annual Meeting of the Association for Computational Linguistics, pages 13032–13056, 2025

2025
[58]

Meshmimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

Qiang Zhang, Jiahao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zifan Wang, Jingkai Sun, Wei Cui, Jialin Yu, Gang Han, et al. Meshmimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

arXiv 2026
[59]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. arXiv preprint arXiv:2506.08817, 2025

arXiv 2025
[60]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025. 14

2025

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[2] [2]

Embodied robot manipulation in the era of foundation models: Planning and learning perspectives

Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Zhe Li, Pengxiang Ding, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives. arXiv preprint arXiv:2512.22983, 2025

arXiv 2025

[3] [3]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[4] [4]

In Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π0.5: a vision-language-action model with open-world generalization. In Annual Conference on Robot Learning, 2025

2025

[5] [5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[6] [6]

Univla: Learning to act anywhere with task-centric latent actions.Proceedings of Robotics: Science and Systems, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.Proceedings of Robotics: Science and Systems, 2025

2025

[7] [7]

Navila: Legged robot vision-language-action model for navigation.Robotics: Science and Systems, 2025

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.Robotics: Science and Systems, 2025

2025

[8] [8]

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InConference on Robot Learning, pages 496–512, 2025

2025

[9] [9]

Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map prediction

Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map prediction. arXiv preprint arXiv:2602.21589, 2026

arXiv 2026

[10] [10]

Stairway to success: An online floor-aware zero-shot object-goal navigation framework via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Guoyang Zhao, Yiyi Ding, and Junwei Liang. Stairway to success: An online floor-aware zero-shot object-goal navigation framework via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

arXiv 2025

[11] [11]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

arXiv 2025

[12] [12]

Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025

Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025. 11

arXiv 2025

[13] [13]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

arXiv 2025

[14] [14]

Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

Pith/arXiv arXiv 2025

[15] [15]

H2r-bm: Can leveraging human videos enhance performance and generalizability in robotic bimanual manipulation?Pattern Recognition, 179:113637, 2026

Xiaoshuai Hao, Huaihai Lyu, Lingfeng Zhang, Rui Liu, Dayan Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance performance and generalizability in robotic bimanual manipulation?Pattern Recognition, 179:113637, 2026

2026

[16] [16]

Omnivla: An omni-modal vision-language- action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision-language- action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

arXiv 2025

[17] [17]

Fine-tuning vision-language-action models: Optimizing speed and success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[18] [18]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, pages 2679–2713, 2025

2025

[19] [19]

The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014, 2026

Lingdong Kong, Shaoyuan Xie, Zeying Gong, Ye Li, Meng Chu, Ao Liang, Yuhao Dong, Tianshuai Hu, Ronghe Qiu, Rong Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014, 2026

arXiv 2026

[20] [20]

Beyond the nav-graph: Vision-and- language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120, 2020

2020

[21] [21]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392–4412, 2020

2020

[22] [22]

Weather- conditioned branch routing for robust lidar-radar 3d object detection.arXiv preprint arXiv:2604.05405, 2026

Hongsheng Li, Lingfeng Zhang, Zexian Yang, Liang Li, Rong Yin, Xiaoshuai Hao, and Wenbo Ding. Weather- conditioned branch routing for robust lidar-radar 3d object detection.arXiv preprint arXiv:2604.05405, 2026

Pith/arXiv arXiv 2026

[23] [23]

Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024

[24] [24]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, pages 3705–3728, 2025

2025

[25] [25]

Embodied intelligence: A synergy of morphology, action, perception and learning

Huaping Liu, Di Guo, and Angelo Cangelosi. Embodied intelligence: A synergy of morphology, action, perception and learning. ACM Computing Surveys, 57(7), 2025

2025

[26] [26]

Towards generalist robot policies: What matters in building vision-language-action models

Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What matters in building vision-language-action models. 2025

2025

[27] [27]

Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. Advancesin Neural Information Processing Systems, 37:40085–40110, 2024

2024

[28] [28]

Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

Pith/arXiv arXiv 2025

[29] [29]

Thinking in text and images: Interleaved vision–language reasoning traces for long-horizon robot manipulation

Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, YuAn Wang, Long Chen, Hangjun Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and images: Interleaved vision–language reasoning traces for long-horizon robot manipulation. arXiv preprint arXiv:2605.00438, 2026. 12

Pith/arXiv arXiv 2026

[30] [30]

Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364, 2025

Peiran Liu, Qiang Zhang, Daojie Peng, Lingfeng Zhang, Yihao Qin, Hang Zhou, Jun Ma, Renjing Xu, and Yiding Ji. Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364, 2025

arXiv 2025

[31] [31]

Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, and Wenbo Ding. Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Pith/arXiv arXiv 2026

[32] [32]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[33] [33]

Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

Pith/arXiv arXiv 2026

[34] [34]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration0. InIEEE International Conference on Robotics and Automation, pages 6892–6903, 2024

2024

[35] [35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[36] [36]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[37] [37]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. InRobotics: Science and Systems, 2025

2025

[38] [38]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

2019

[39] [39]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025

arXiv 2025

[40] [40]

Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration

Huajie Tan, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Yaoxu Lyu, Mingyu Cao, Zhongyuan Wang, and Shang- hang Zhang. Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration. arXiv preprint arXiv:2505.03673, 2025

arXiv 2025

[41] [41]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12706–12713, 2025

2025

[42] [42]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024

2024

[43] [43]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

arXiv 2025

[44] [44]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling. arXiv preprint arXiv:2507.05240, 2025

arXiv 2025

[45] [45]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. In Robotics: Science and Systems, 2025

2025

[46] [46]

Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study.TechRxiv preprint techrxiv.174495686.69962588/v1, 2025

Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study.TechRxiv preprint techrxiv.174495686.69962588/v1, 2025. 13

arXiv 2025

[47] [47]

Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, and Xiaoshuai Hao. Team xiaomi ev-ad vla: Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track.arXiv preprint arXiv:2510.07871, 2025

arXiv 2025

[48] [48]

Xiaomi onevl: One-step latent reasoning and planning with vision-language explanation

Xiaomi Embodied Intelligence Team. Xiaomi onevl: One-step latent reasoning and planning with vision-language explanation. arXiv preprint arXiv:2604.18486, 2026

Pith/arXiv arXiv 2026

[49] [49]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

2025

[50] [50]

Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

arXiv 2024

[51] [51]

Vtla: Vision-tactile- language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

Chaofan Zhang, Peng Hao, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Vtla: Vision-tactile- language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

arXiv 2025

[52] [52]

Navid: Video-based vlm plans the next step for vision-and-language navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. InRobotics: Science and Systems, 2024

2024

[53] [53]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. In Robotics: Science and Systems, 2025

2025

[54] [54]

Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation.arXiv preprint arXiv:2511.13269, 2025

Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, and Wenbo Ding. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation.arXiv preprint arXiv:2511.13269, 2025

arXiv 2025

[55] [55]

Walk with me: Long-horizon social navigation for human-centric outdoor assistance

Lingfeng Zhang, Xiaoshuai Hao, Xizhou Bu, Yingbo Tang, Hongsheng Li, Jinghui Lu, Xiu-shen Wei, Jiayi Ma, Yu Liu, Jing Zhang, et al. Walk with me: Long-horizon social navigation for human-centric outdoor assistance. arXiv preprint arXiv:2604.26839, 2026

Pith/arXiv arXiv 2026

[56] [56]

Socialnav-map: Dynamic mapping with human trajectory prediction for zero-shot social navigation

Lingfeng Zhang et al. Socialnav-map: Dynamic mapping with human trajectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025

arXiv 2025

[57] [57]

Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation

Lingfeng Zhang et al. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InThe Annual Meeting of the Association for Computational Linguistics, pages 13032–13056, 2025

2025

[58] [58]

Meshmimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

Qiang Zhang, Jiahao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zifan Wang, Jingkai Sun, Wei Cui, Jialin Yu, Gang Han, et al. Meshmimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

arXiv 2026

[59] [59]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. arXiv preprint arXiv:2506.08817, 2025

arXiv 2025

[60] [60]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025. 14

2025