Extending Embodied Question Answering from Perception to Decision
Pith reviewed 2026-06-29 21:28 UTC · model grok-4.3
The pith
EQA-Decision supplies a unified benchmark of four million question-answer pairs spanning scene construction, spatial understanding, task dynamics, and instant decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that EQA-Decision systematically covers four complementary dimensions of embodied reasoning—static scene construction, spatial understanding, task dynamics reasoning, and instant decision—and that the accompanying RoboDecision model supplies a unified framework which jointly evaluates perception, reasoning, and action-level decision-making in embodied environments.
What carries the argument
EQA-Decision dataset of over four million hierarchical question-answer pairs organized across the four listed dimensions of embodied reasoning.
If this is right
- The dataset supplies a single large-scale test that replaces several narrower existing benchmarks.
- RoboDecision demonstrates joint training and evaluation of perception, reasoning, and action inside one model.
- Results on the benchmark directly measure progress in spatial and interaction reasoning for vision-language models.
- The hierarchical annotations allow diagnosis of failures at different levels of embodied reasoning.
Where Pith is reading between the lines
- Models improved on this benchmark may transfer more readily to real robot platforms that must choose actions from visual input.
- The four-million-pair scale could support pre-training of larger embodied agents before fine-tuning on specific hardware.
- If the dimensions prove non-redundant in practice, future work could add further axes such as multi-agent coordination without restarting the benchmark design.
Load-bearing premise
The four dimensions together form a comprehensive and non-redundant coverage of what embodied reasoning requires.
What would settle it
A vision-language model that scores highly on all four dimensions of EQA-Decision yet shows no measurable gain in success rate on physical robot tasks that require the same skills would falsify the claim that the benchmark captures the needed capabilities.
Figures
read the original abstract
Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EQA-Decision, a large-scale embodied QA dataset that covers four complementary dimensions of embodied reasoning (static scene construction, spatial understanding, task dynamics reasoning, and instant decision) and contains over four million question-answer pairs with hierarchical annotations across diverse scenarios. It also presents RoboDecision as a baseline model that jointly evaluates perception, reasoning, and action-level decision-making, claiming that the benchmark effectively enhances VLM capabilities in spatial and interaction reasoning.
Significance. A unified large-scale dataset and baseline for embodied reasoning could provide a valuable foundation for embodied intelligence research if the dataset construction, annotations, and baseline performance are rigorously validated with quantitative results.
major comments (2)
- [Abstract] Abstract: the central claim that 'Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities' is unsupported because the provided manuscript contains no experimental results, metrics, ablation studies, error analysis, or comparisons to prior EQA datasets.
- [Abstract] Abstract: no details are given on how the four dimensions were operationalized, how the >4M QA pairs were generated or validated, or what the hierarchical annotations consist of, which are load-bearing for the claim of a 'unified large-scale framework'.
minor comments (1)
- The abstract refers to 'diverse embodied scenarios' without naming the simulators, environments, or object categories used.
Simulated Author's Rebuttal
We thank the referee for their comments. We address each major point below and will revise the manuscript to ensure claims are supported by content and that key construction details are clearly presented.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities' is unsupported because the provided manuscript contains no experimental results, metrics, ablation studies, error analysis, or comparisons to prior EQA datasets.
Authors: We agree that the abstract claim is unsupported in the current manuscript. The paper introduces the EQA-Decision dataset and RoboDecision baseline but does not contain experimental results, metrics, or comparisons. We will revise the abstract to remove or qualify the sentence beginning 'Results demonstrate...' to accurately reflect the manuscript's scope as a dataset and baseline introduction. revision: yes
-
Referee: [Abstract] Abstract: no details are given on how the four dimensions were operationalized, how the >4M QA pairs were generated or validated, or what the hierarchical annotations consist of, which are load-bearing for the claim of a 'unified large-scale framework'.
Authors: The manuscript body (Section 3) describes the four dimensions, generation via simulation environments, validation steps, and hierarchical annotations. However, to address the concern that these are not evident from the abstract or sufficiently highlighted, we will add a brief overview paragraph in the introduction summarizing the operationalization, generation, validation, and annotation structure. revision: yes
Circularity Check
No significant circularity in dataset and baseline introduction
full rationale
The paper presents EQA-Decision as a new large-scale dataset spanning four dimensions of embodied reasoning and introduces RoboDecision as a baseline model. No mathematical derivations, equations, fitted parameters, or predictions appear. The work is framed as benchmark construction rather than a derived result from prior inputs. No self-citations are invoked as load-bearing premises, and the four dimensions are stated as complementary by design without reducing to self-definition or renaming of known results. The contribution is self-contained as an empirical resource.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2
2022
-
[3]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Affordances from human videos as a versa- tile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versa- tile representation for robotics. 2023. 4
2023
-
[5]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 1
2025
-
[6]
Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023. 4
2023
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manip- ulation platform for scalable and intelligent embodied sys- tems.arXiv preprint arXiv:2503.06669, 2025. 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R San- keti, and Ken Goldberg. Robo2vlm: Visual question answer- ing from large-scale in-the-wild robot manipulation datasets. arXiv preprint arXiv:2505.15517, 2025. 1, 2, 3
-
[11]
Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Dongbin Zhao, and He Wang. Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks.arXiv preprint arXiv:2311.15649,
-
[12]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6
2023
-
[13]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 2, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafi- ullah, and Lerrel Pinto. From play to policy: Conditional be- havior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022. 4
-
[15]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 3, 4
2017
-
[16]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 1
2018
-
[17]
The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 43(11):4125–4141,
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 43(11):4125–4141,
-
[18]
Embodied question answer- ing
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018. 1, 2
2018
-
[19]
Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. 4
2023
-
[20]
Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Pira- muthu, Michael Johnston, Reza Ghanadhan, and Dinesh Manocha. Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024. 2
-
[21]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023. 3
2023
-
[22]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1, 4, 5
2022
-
[23]
Efficient multi- modal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024
Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multi- modal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024. 3, 4
-
[24]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6
2022
-
[25]
Eqa-mx: Embodied question answering using multi- modal expression
Md Mofijul Islam, Alexi Gladstone, Riashat Islam, and Tariq Iqbal. Eqa-mx: Embodied question answering using multi- modal expression. InThe Twelfth International Conference on Learning Representations, 2023. 1
2023
-
[26]
BC-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Fred- erik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In5th Annual Conference on Robot Learning,
-
[27]
Robobrain: A unified brain model for robotic manipulation from abstract to concrete
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 3
2025
-
[28]
Context-aware planning and environment-aware memory for instruction following em- bodied agents
Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, and Jonghyun Choi. Context-aware planning and environment-aware memory for instruction following em- bodied agents. InICCV, 2023. 3
2023
-
[29]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Robot learning on the job: Human-in-the- loop autonomy and learning during deployment
Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the- loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023. 4
2023
-
[33]
Visual embodied brain: Let multimodal large language models see, think, and control in spaces
Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Hao- nan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. 3
-
[34]
Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 4
2023
-
[35]
Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yi- tao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 2
-
[36]
Openeqa: Embodied question answering in the era of foun- dation models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488– 16498, 2024. 1, 2
2024
-
[37]
Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity
Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 1048–105...
2019
-
[38]
Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023. 3
2023
-
[39]
Learning and retrieval from prior data for skill- based imitation learning
Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. InConference on Robot Learning (CoRL), 2022. 4
2022
-
[40]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, ...
2024
-
[41]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Explore until confi- dent: Efficient exploration for embodied question answering
Allen Z Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh. Explore until confi- dent: Efficient exploration for embodied question answering. arXiv preprint arXiv:2403.15941, 2024. 2
-
[43]
Robovqa: Multimodal long-horizon reasoning for robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, De- bidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE,
-
[44]
Mu- tex: Learning unified policies from multimodal task specifi- cations
Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. Mu- tex: Learning unified policies from multimodal task specifi- cations. In7th Annual Conference on Robot Learning, 2023. 4
2023
-
[45]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyim- ing Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision- language-action models.arXiv preprint arXiv:2502.19417,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 3, 4
2020
-
[48]
Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,
BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. 3, 7, 8
-
[49]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Edan: An emg-controlled daily assistant to help people with physical disabilities
J ¨orn V ogel, Annette Hagengruber, Maged Iskandar, Gabriel Quere, Ulrike Leipscher, Samuel Bustamante, Alexander Di- etrich, Hannes H ¨oppner, Daniel Leidner, and Alin Albu- Sch¨affer. Edan: An emg-controlled daily assistant to help people with physical disabilities. In2020 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pag...
2020
-
[52]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 4
2023
-
[53]
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Ran- gan Majumder, and Furu Wei. Multilingual e5 text embed- dings: A technical report.arXiv preprint arXiv:2402.05672,
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
Embodied question answering in photorealistic environments with point cloud perception
Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Ab- hishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, and Dhruv Batra. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6659–6668, 2019. 2
2019
-
[55]
Tao Wu, Chuhao Zhou, Yen Heng Wong, Lin Gu, and Jianfei Yang. Noisyeqa: Benchmarking embodied question answer- ing against noisy queries.arXiv preprint arXiv:2412.10726,
-
[56]
Building Generalizable Agents with a Realistic and Rich 3D Environment
Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d en- vironment.arXiv preprint arXiv:1801.02209, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[57]
Multi-target embodied question answering
Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L Berg, and Dhruv Batra. Multi-target embodied question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6309–6318, 2019. 2
2019
-
[58]
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 3, 4, 5, 8
-
[59]
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
2025
-
[61]
Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, et al. Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025. 3
-
[62]
Fanuc ma- nipulation: A dataset for learning-based manipulation with fanuc mate 200id robot.https://sites.google
Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc ma- nipulation: A dataset for learning-based manipulation with fanuc mate 200id robot.https://sites.google. com/berkeley.edu/fanuc-manipulation, 2023. 4
2023
-
[63]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 3
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.