HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory
Pith reviewed 2026-06-26 08:25 UTC · model grok-4.3
The pith
HoloAgent-0 couples Embodied AgentOS, 3D spatial memory, and embodied skills into one framework that converts language instructions into closed-loop robot actions on physical hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. The framework is deployed and evaluated on real hardware for spatial memory accuracy, long-horizon navigation, and closed-loop performance across motion generation, object search, cross-robot coordination, and mobile manipulation.
What carries the argument
Three coupled layers (Embodied AgentOS for execution control, 3D spatial memory for grounding, and embodied skills for action) that together convert language into monitored, feedback-driven robot behavior.
If this is right
- Language instructions become executable skill graphs that the OS layer can schedule and monitor.
- Runtime feedback can trigger clarification requests or full re-planning without external intervention.
- The same structure supports multiple robot types and tasks including cross-robot coordination and mobile manipulation on physical hardware.
- Spatial memory supplies the grounding needed for navigation and object search over extended sequences.
Where Pith is reading between the lines
- The OS layer abstraction may make it easier to swap in new robot hardware without rewriting the reasoning components.
- Extending the skill graph representation to include probabilistic outcomes could improve handling of sensor noise.
- The framework's separation of memory from skills suggests it could be combined with existing simulation-to-real transfer methods for faster initial training.
Load-bearing premise
The three layers can be coupled to produce reliable closed-loop execution on physical robots despite continuous motion, embodiment dependence, uncertainty, and safety constraints.
What would settle it
A real-robot trial in which the system fails to detect execution errors or produce a safe re-plan during a long-horizon task that involves moving objects or changing obstacles.
Figures
read the original abstract
LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. The framework organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. The authors claim deployment on real hardware with evaluation of spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.
Significance. If the integration of the three layers produces reliable closed-loop execution on physical robots, the work could advance practical embodied AI by addressing continuous motion, uncertainty, and safety constraints in a unified manner. However, the provided text supplies no quantitative results, error bars, or detailed methods, which limits assessment of whether the claimed unification delivers measurable improvements over existing specialized modules.
major comments (2)
- [Abstract] Abstract: The abstract states that the system was deployed and evaluated but supplies no quantitative results, error bars, methods details, or data to support the central claims of successful integration and performance.
- [Abstract] Abstract: The coupling mechanism between Embodied AgentOS, 3D spatial memory, and embodied skills remains unspecified. No description is given of the data structures passed between layers, how spatial memory updates are synchronized with continuous robot motion, or the precise conditions under which feedback triggers re-planning versus clarification.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without misrepresenting its current content.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states that the system was deployed and evaluated but supplies no quantitative results, error bars, methods details, or data to support the central claims of successful integration and performance.
Authors: We agree that the current abstract lacks quantitative results, error bars, and methods details. The provided manuscript text does not contain these elements. In the revised version we will expand the abstract to include concise quantitative highlights from the hardware evaluations (e.g., task success rates and execution metrics) together with a brief statement of the evaluation protocol. revision: yes
-
Referee: [Abstract] Abstract: The coupling mechanism between Embodied AgentOS, 3D spatial memory, and embodied skills remains unspecified. No description is given of the data structures passed between layers, how spatial memory updates are synchronized with continuous robot motion, or the precise conditions under which feedback triggers re-planning versus clarification.
Authors: We agree that the abstract does not specify the coupling mechanisms, data structures, synchronization, or re-planning conditions. The current manuscript text does not provide these details. We will revise the abstract to add a short description of the interfaces (skill graphs as the primary data structure, periodic 3D map updates for synchronization, and threshold-based triggers for re-planning versus clarification requests) and will expand the methods section accordingly. revision: yes
Circularity Check
No significant circularity; framework description contains no derivations or fitted predictions.
full rationale
The paper presents an architectural framework (Embodied AgentOS, 3D spatial memory, embodied skills) and reports real-robot deployments without any equations, parameter fitting, predictions, or first-principles derivations. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The coupling claim is descriptive rather than derived, so no circularity analysis applies. This is the expected outcome for a systems paper lacking mathematical content.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023
2023
-
[2]
Toolformer: Language models can teach them- selves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools. In Advances in Neural Information Processing Systems , 2023
2023
-
[3]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023
2023
-
[4]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023
Pith/arXiv arXiv 2023
-
[5]
Voyager: An open-ended embodied agent with large language models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 , 2023
Pith/arXiv arXiv 2023
-
[6]
Hierarchical task and motion planning in the now
Leslie Pack Kaelbling and Tomás Lozano-Pérez. Hierarchical task and motion planning in the now. In 2011 IEEE international conference on robotics and automation , pages 1470–1477. IEEE, 2011
2011
-
[7]
Integrated task and motion planning
Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kael- bling, and Tomás Lozano-Pérez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems , 4(1):265–293, 2021
2021
-
[8]
Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T. Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar...
2023
-
[9]
Inner monologue: Embodied reasoning through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Proceedings of The 6th Conference on Rob...
2023
-
[10]
pi0: A vision-language-action flow model for general robot control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024
Pith/arXiv arXiv 2024
-
[11]
Gr00t n1: An open foundation model for generalist humanoid robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025
Pith/arXiv arXiv 2025
-
[12]
Openvla: An open-source vision-language- action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model. arXiv preprint arXiv:2406.09246 , 2024. 17
Pith/arXiv arXiv 2024
-
[13]
What matters in building vision-language-action models for generalist robots
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. What matters in building vision-language-action models for generalist robots. arXiv preprint arXiv:2412.14058 , 2024
Pith/arXiv arXiv 2024
-
[14]
Spatialvla: Exploring spatial representations for visual-language-action model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830 , 2025
Pith/arXiv arXiv 2025
-
[15]
Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. HoloBrain-0 technical report. arXiv preprint arXiv:2602.12062 , 2026
arXiv 2026
-
[16]
Unleashing large-scale video generative pre-training for visual robot manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 , 2023
Pith/arXiv arXiv 2023
-
[17]
Causal world modeling for robot control
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 , 2026
Pith/arXiv arXiv 2026
-
[18]
Motus: A unified latent action world model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025
Pith/arXiv arXiv 2025
-
[19]
World action models are zero-shot policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 , 2026
Pith/arXiv arXiv 2026
-
[20]
ScanQA: 3D question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022
2022
-
[21]
3D-LLM: Injecting the 3D world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: Injecting the 3D world into large language models. In Advances in Neural Information Processing Systems, 2023
2023
-
[22]
EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI
Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19757–19767, 2024
2024
-
[23]
SpatialVLM: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168 , 2024
arXiv 2024
-
[24]
VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction
Jianing Yang, Haoran Zhang, Yikai Wang, Tianheng Cheng, Wenyu Liu, Xinggang Wang, and Ziwei Liu. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction. arXiv preprint arXiv:2505.20279 , 2025
Pith/arXiv arXiv 2025
-
[25]
Spa3R: Predictive spatial field modeling for 3D visual reasoning
Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Spa3R: Predictive spatial field modeling for 3D visual reasoning. arXiv preprint arXiv:2602.21186, 2026
arXiv 2026
-
[26]
Bip3d: Bridging 2d images and 3d perception for embodied intelligence
Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, and Zhizhong Su. Bip3d: Bridging 2d images and 3d perception for embodied intelligence. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9007–9016, 2025
2025
-
[27]
Vision-and-language navigation today and tomorrow: A survey in the era of foundation models
Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. arXiv preprint arXiv:2407.07035 , 2024. 18
arXiv 2024
-
[28]
Mapdream: Task-driven map learning for vision-language navigation
Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Mapdream: Task-driven map learning for vision-language navigation. arXiv preprint arXiv:2602.00222 , 2026
arXiv 2026
-
[29]
Progress-think: Semantic progress reasoning for vision-language navigation
Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, and Zhaoxin Fan. Progress-think: Semantic progress reasoning for vision-language navigation. arXiv preprint arXiv:2511.17097 , 2025
Pith/arXiv arXiv 2025
-
[30]
Monodream: Monocular vision-language navi- gation with panoramic dreaming
Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language navi- gation with panoramic dreaming. arXiv preprint arXiv:2508.02549 , 2025
arXiv 2025
-
[31]
Aux-think: Exploring reasoning strategies for data-efficient vision-language navigation
Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reasoning strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886 , 2025
arXiv 2025
-
[32]
GMT: General motion tracking for humanoid whole-body control
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. GMT: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025
arXiv 2025
-
[33]
Track any motions under any disturbances
Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Huaping Liu, He Wang, and Li Yi. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833 , 2025
arXiv 2025
-
[34]
SONIC: Supersizing motion tracking for natural humanoid whole-body control
Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. SONIC: Supersizing motion tracking for natura...
Pith/arXiv arXiv 2025
-
[35]
Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, and Zhizhong Su. HoloMotion-1 technical report. arXiv preprint arXiv:2605.15336 , 2026
Pith/arXiv arXiv 2026
-
[36]
π0.5: A vision-language-action model with open-world generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025
Pith/arXiv arXiv 2025
-
[37]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...
2023
-
[38]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , 2023
2023
-
[39]
Xiaolin Zhou, Tingyang Xiao, Liu Liu, Yucheng Wang, Maiyue Chen, Xinrui Meng, Xinjie Wang, Wei Feng, Wei Sui, and Zhizhong Su. Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph. arXiv preprint arXiv:2509.13733 , 2025
arXiv 2025
-
[40]
F AST-LIVO: Fast and tightly-coupled sparse-direct LiDAR-inertial-visual odometry
Chunran Zheng, Qingyan Zhu, Wei Xu, Xiyuan Liu, Qizhi Guo, and Fu Zhang. F AST-LIVO: Fast and tightly-coupled sparse-direct LiDAR-inertial-visual odometry. In IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 4003–4009, 2022
2022
-
[41]
Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, and Zhizhong Su. GeoFlow-SLAM: A robust tightly-coupled RGBD-inertial and legged odometry fusion SLAM for dynamic legged robotics. arXiv preprint arXiv:2503.14247 , 2025. 19
arXiv 2025
-
[42]
Oswald, and Javier Civera
Tomas Berriel Martins, Martin R. Oswald, and Javier Civera. Open-vocabulary online semantic mapping for SLAM. IEEE Robotics and Automation Letters , 2025
2025
-
[43]
IRIS-SLAM: Unified geo-instance representations for robust semantic localization and mapping
Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, and Zhizhong Su. IRIS-SLAM: Unified geo-instance representations for robust semantic localization and mapping. arXiv preprint arXiv:2602.18709 , 2026
arXiv 2026
-
[44]
MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation
Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 37154–37163, 2026
2026
-
[45]
VLFM: Vision- language frontier maps for zero-shot semantic navigation
Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. VLFM: Vision- language frontier maps for zero-shot semantic navigation. In IEEE International Conference on Robotics and Automation , pages 42–48, 2024
2024
-
[46]
SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation
Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation. In Advances in Neural Information Processing Systems , 2024
2024
-
[47]
DORAEMON: Decentralized ontology-aware reliable agent with enhanced memory oriented navigation
Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, and Xin Tan. DORAEMON: Decentralized ontology-aware reliable agent with enhanced memory oriented navigation. arXiv preprint arXiv:2505.21969 , 2025
arXiv 2025
-
[48]
WMNav: Integrating vision- language models into world models for object goal navigation
Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, and Long Chen. WMNav: Integrating vision- language models into world models for object goal navigation. arXiv preprint arXiv:2503.02247 , 2025
arXiv 2025
-
[49]
OK-Robot: What really matters in integrating open-knowledge models for robotics
Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. OK-Robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024
arXiv 2024
-
[50]
Mobility VLA: Multimodal instruction navigation with long- context VLMs and topological graphs
Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility VLA: Multimodal instructi...
2025
-
[51]
Hier- archical open-vocabulary 3d scene graphs for language-grounded robot navigation
Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hier- archical open-vocabulary 3d scene graphs for language-grounded robot navigation. In Robotics: Science and Systems , 2024
2024
-
[52]
O’Brien, Carrie J
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , 2023
2023
-
[53]
Robot operating system 2: Design, architecture, and uses in the wild
Steve Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074, 2022
2022
-
[54]
Christopher E. Mower, Yuhui Wan, Hongzhan Yu, Antoine Grosnit, Jonas Gonzalez-Billandon, Matthieu Zimmer, Jinlong Wang, Xinyu Zhang, Yao Zhao, Anbang Zhai, Puze Liu, Daniel Palenicek, Davide Tateo, Cesar Cadena, Marco Hutter, Jan Peters, Guangjian Tian, Yuzheng Zhuang, Kun Shao, Xingyue Quan, Jianye Hao, Jun Wang, and Haitham Bou-Ammar. ROS-LLM: A ROS fra...
Pith/arXiv arXiv 2024
-
[55]
RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration
Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration. arXiv preprint arXiv:2505.03673 , 2025. 20
arXiv 2025
-
[56]
EMOS: Embodiment-aware heterogeneous multi-robot operating system with LLM agents
Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, and Lin Shao. EMOS: Embodiment-aware heterogeneous multi-robot operating system with LLM agents. arXiv preprint arXiv:2410.22662 , 2024
arXiv 2024
-
[57]
AEROS: A single-agent operating archi- tecture with embodied capability modules
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. AEROS: A single-agent operating archi- tecture with embodied capability modules. arXiv preprint arXiv:2604.07039 , 2026
Pith/arXiv arXiv 2026
-
[58]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation , pages 9493–9500, 2023
2023
-
[59]
π0.7: A steerable generalist robotic foundation model with emergent capabilities
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. π0.7: A steerable generalist robotic foundation model with emergent capabilities. arXiv preprint arXiv:2604.15483 , 2026
Pith/arXiv arXiv 2026
-
[60]
Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese
Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5664–5673, 2019
2019
-
[61]
Visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In IEEE International Conference on Robotics and Automation , 2023
2023
-
[62]
Openscene: 3d scene understanding with open vocabularies
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 815–824, 2023
2023
-
[63]
Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull
Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In IEEE Internat...
2024
-
[64]
Sem: Enhancing spatial understanding for robust robot manipulation
Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Yiwei Jin, Keyu Li, and Zhizhong Su. Sem: Enhancing spatial understanding for robust robot manipulation. arXiv preprint arXiv:2505.16196 , 2025
arXiv 2025
-
[65]
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3674–3683, 2018
2018
-
[66]
Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages 4392–4412, 2020
2020
-
[67]
EmbodiedGen: Towards a generative 3D world engine for embodied intelligence
Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. EmbodiedGen: Towards a generative 3D world engine for embodied intelligence. arXiv preprint arXiv:2506.10600, 2025. 21
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.