pith. machine review for the scientific record. sign in

arxiv: 2604.20100 · v2 · submitted 2026-04-22 · 💻 cs.RO

Recognition: unknown

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic autonomyvision-language-action modelfoundation modelcross-embodiment learningaction space unificationmulti-source pretrainingrobotic manipulationgeneralizable manipulation
0
0 comments X

The pith

Training on unified human and robot data bridges embodiment gaps in robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JoyAI-RA as a vision-language-action model designed to overcome data scarcity and embodiment differences in robotic autonomy. It achieves this by pretraining on a mix of web data, human videos, simulations, and real robot trajectories, with a key step of mapping all actions into a common space. This unification allows the model to transfer knowledge from human demonstrations to robot control effectively. If successful, it suggests that large-scale heterogeneous data can train foundation models capable of generalizing across different robot bodies and tasks in open environments.

Core claim

JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

What carries the argument

The multi-source multi-level pretraining framework with explicit action-space unification that maps actions from different embodiments into a shared representation.

Load-bearing premise

Heterogeneous data sources can be unified via action-space mapping without introducing inconsistencies or losing critical task information.

What would settle it

A direct comparison showing that removing the human video data or the action unification step causes the model to underperform on robot tasks relative to baselines trained only on robot data.

read the original abstract

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces JoyAI-RA 0.1, a vision-language-action (VLA) embodied foundation model for generalizable robotic manipulation. It describes a multi-source multi-level pretraining framework integrating web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through explicit action-space unification during training on this heterogeneous data, the model is claimed to bridge embodiment gaps (especially human-to-robot) and enhance cross-embodiment behavior learning. The paper asserts that JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, particularly on diverse tasks with high generalization demands.

Significance. If the performance and generalization claims hold under rigorous validation, the work could be significant for robotic learning by demonstrating scalable use of heterogeneous data sources to address data scarcity and embodiment transfer. The explicit focus on action-space unification targets a core technical barrier in VLA models. The ambitious data integration scope is a potential strength, but the absence of supporting quantitative evidence limits assessment of its actual contribution.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that 'JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks' but supplies no quantitative metrics, baselines, error bars, ablation studies, or experimental protocols. This makes it impossible to verify support for the central claim of superior performance and enhanced generalization on diverse tasks.
  2. [Abstract] Abstract: The claim that 'training on heterogeneous multi-source data with explicit action-space unification' bridges embodiment gaps and enhances cross-embodiment learning rests on the unshown assumption that action mapping (particularly from egocentric human videos lacking explicit actions) preserves task-critical kinematics, contact dynamics, and temporal information without introducing inconsistencies or biases. No details on the unification procedure, retargeting method, or validation of information preservation are provided, which is load-bearing for the bridging effect.
minor comments (1)
  1. [Abstract] Abstract: The version designation '0.1' implies an early release; the manuscript would benefit from explicit statements on model availability, code release, or planned updates to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the abstract and clarify key technical aspects.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that 'JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks' but supplies no quantitative metrics, baselines, error bars, ablation studies, or experimental protocols. This makes it impossible to verify support for the central claim of superior performance and enhanced generalization on diverse tasks.

    Authors: We agree that the abstract, as a concise summary, does not include specific quantitative metrics or experimental details, which limits immediate verifiability of the performance claims. The full manuscript presents these results in Sections 4 and 5, including direct comparisons to baselines such as RT-2 and OpenVLA with success rates, generalization metrics, error bars, and ablation studies. To address this, we have revised the abstract to incorporate key quantitative highlights (e.g., relative improvements on generalization-heavy tasks) while referencing the evaluation protocols. revision: yes

  2. Referee: [Abstract] Abstract: The claim that 'training on heterogeneous multi-source data with explicit action-space unification' bridges embodiment gaps and enhances cross-embodiment learning rests on the unshown assumption that action mapping (particularly from egocentric human videos lacking explicit actions) preserves task-critical kinematics, contact dynamics, and temporal information without introducing inconsistencies or biases. No details on the unification procedure, retargeting method, or validation of information preservation are provided, which is load-bearing for the bridging effect.

    Authors: We acknowledge that the abstract does not detail the action-space unification procedure or its validation. The methods section of the manuscript describes the multi-source pretraining framework, including the explicit unification steps (pseudo-action generation from egocentric videos via pose estimation and retargeting to robot embodiments, with consistency checks on kinematics and temporal alignment). We have revised the abstract to include a brief outline of this procedure and tied the bridging claim more explicitly to the cross-embodiment generalization results shown in the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation

full rationale

The manuscript presents an empirical VLA foundation model whose central claim is that multi-source pretraining with action-space unification improves cross-embodiment transfer. No equations, first-principles derivations, or 'predictions' are offered that reduce by construction to fitted inputs or self-citations. The unification step is described as an explicit training choice whose effectiveness is evaluated on external simulation and real-world benchmarks rather than being definitionally equivalent to the reported gains. No load-bearing self-citation chains or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim implicitly rests on unstated compatibility of data sources.

pith-pipeline@v0.9.0 · 5693 in / 1095 out tokens · 46387 ms · 2026-05-10T00:59:25.675335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 34 canonical work pages · 21 internal anchors

  1. [1]

    Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

  2. [2]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    BeingBeyond Team. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2025

  3. [3]

    Being-h0.7: A latent world-action model from egocentric videos

    BeingBeyond Team. Being-h0.7: A latent world-action model from egocentric videos. https://research. beingbeyond.com/being-h07, 2026

  4. [4]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  8. [8]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

  9. [9]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  10. [10]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing, 2026. URL https://arxiv.org/abs/2604.05014

  11. [11]

    Interndata-m1.https://github.com/InternRobotics/InternManip, 2025

    InternData-M1 contributors. Interndata-m1.https://github.com/InternRobotics/InternManip, 2025

  12. [12]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

  13. [13]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  14. [14]

    Fourier-robotics.https://www.fftai.com/, 2026

    Fourier. Fourier-robotics.https://www.fftai.com/, 2026

  15. [16]

    Gemini robotics: Multimodal robotics foundation models for generalization and interaction

    Gemini Robotics Team et al. Gemini robotics: Multimodal robotics foundation models for generalization and interaction. arXiv preprint arXiv:2503.09682, 2025

  16. [17]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  17. [18]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 15

  18. [19]

    Robomind 2.0: A multimodal, biman- ual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

    Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence. arXiv preprint arXiv:2512.24653, 2025

  19. [20]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational conference on machine learning, pages 4651–4664. PMLR, 2021

  20. [21]

    Galaxea open-world dataset and g0 dual-system vla model,

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  21. [22]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  22. [23]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, and Sergey Levine. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  23. [24]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

  24. [25]

    Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

    Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, Tao Kong, Hanbo Zhang, and Huaping Liu. What matters in building vision-language-action models for generalist robots.arXiv preprint arXiv:2412.14058, 2024

  25. [26]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523, 2024

  26. [27]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  27. [28]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abhishek Padalkar, Abhishek Pooley, Jing Lu, Yifeng Xing, Ayzaan Wahid, Abraham Stone, Stephen Tian, Rose O’Neill, Kent Rose, Kiran Rao, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023

  28. [29]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  29. [30]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  30. [31]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025

  31. [32]

    Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

    Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

  32. [33]

    Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

    Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

  33. [34]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

  34. [35]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023. 16

  35. [36]

    arXiv preprint arXiv:2509.09372 (2025) 1, 9, 11, 25, 26

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2024

  36. [37]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

  37. [38]

    arXiv preprint arXiv:2601.18692 (2026)

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  38. [39]

    ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning. arXiv preprint arXiv:2602.11236, 2026

  39. [40]

    Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot

    Chenghao Yin, Da Huang, Di Yang, Jichao Wang, Nanshu Zhao, Chen Xu, Wenjun Sun, Linjie Hou, Zhijun Li, Junhui Wu, et al. Genie sim 3.0: A high-fidelity comprehensive simulation platform for humanoid robot.arXiv preprint arXiv:2601.02078, 2026

  40. [41]

    Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, and Yuke Zhu

    Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

  41. [42]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  42. [43]

    Universal actions for enhanced embodied foundation models

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models.arXiv preprint arXiv:2501.10105, 2025

  43. [44]

    Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

    Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, et al. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models. arXiv preprint arXiv:2603.22280, 2026

  44. [45]

    Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics. arXiv preprint arXiv:2506.04308, 2025. 17 Appendix A Contributions •Core Contributors: Tianle Zhang, Zhihao Yuan, Dafeng Chi, Peidong Li...