HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory

Boyang Yu; Fa Fu; JiaLiang Han; Liu Liu; Tingyang Xiao; Wei Feng; Wei Sui; Xiaolin Zhou; Xinjie Wang; Xinrui Meng

arxiv: 2606.23565 · v1 · pith:RXLV23P6new · submitted 2026-06-22 · 💻 cs.RO · cs.CV

HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory

Xiaolin Zhou , Liu Liu , Tingyang Xiao , Wei Feng , Fa Fu , Xinrui Meng , Xinjie Wang , Jialiang Han

show 4 more authors

Boyang Yu Yun Du Wei Sui Zhizhong Su

This is my paper

Pith reviewed 2026-06-26 08:25 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords embodied agents3D spatial memoryclosed-loop executionskill graphsrobot deploymentmobile manipulationmulti-robot coordinationlanguage to action

0 comments

The pith

HoloAgent-0 couples Embodied AgentOS, 3D spatial memory, and embodied skills into one framework that converts language instructions into closed-loop robot actions on physical hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HoloAgent-0 as a way to bring the standard LLM agent execution loop (reason, act, observe, revise) into physical robots. It achieves this by organizing robot systems into three coupled layers that handle planning, world grounding, and low-level actions. Embodied AgentOS turns instructions into skill graphs, allocates resources, watches execution, and calls for re-planning when feedback arrives. The authors test the system on real robots for navigation, object search, coordination, and manipulation. A reader would care because it offers a single structure that keeps digital reasoning connected to continuous, uncertain physical execution instead of leaving those capabilities as separate modules.

Core claim

HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. The framework is deployed and evaluated on real hardware for spatial memory accuracy, long-horizon navigation, and closed-loop performance across motion generation, object search, cross-robot coordination, and mobile manipulation.

What carries the argument

Three coupled layers (Embodied AgentOS for execution control, 3D spatial memory for grounding, and embodied skills for action) that together convert language into monitored, feedback-driven robot behavior.

If this is right

Language instructions become executable skill graphs that the OS layer can schedule and monitor.
Runtime feedback can trigger clarification requests or full re-planning without external intervention.
The same structure supports multiple robot types and tasks including cross-robot coordination and mobile manipulation on physical hardware.
Spatial memory supplies the grounding needed for navigation and object search over extended sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The OS layer abstraction may make it easier to swap in new robot hardware without rewriting the reasoning components.
Extending the skill graph representation to include probabilistic outcomes could improve handling of sensor noise.
The framework's separation of memory from skills suggests it could be combined with existing simulation-to-real transfer methods for faster initial training.

Load-bearing premise

The three layers can be coupled to produce reliable closed-loop execution on physical robots despite continuous motion, embodiment dependence, uncertainty, and safety constraints.

What would settle it

A real-robot trial in which the system fails to detect execution errors or produce a safe re-plan during a long-horizon task that involves moving objects or changing obstacles.

Figures

Figures reproduced from arXiv: 2606.23565 by Boyang Yu, Fa Fu, JiaLiang Han, Liu Liu, Tingyang Xiao, Wei Feng, Wei Sui, Xiaolin Zhou, Xinjie Wang, Xinrui Meng, Yun Du, Zhizhong Su.

**Figure 1.** Figure 1: Overview of the HoloAgent-0 framework. HoloAgent-0 connects Embodied AgentOS, embodied memory, and embodied skills through a ROS2 command/status bus, forming a closed loop for spatial retrieval, skill-graph planning, execution monitoring, memory updates, and feedback-driven re-planning. 1 arXiv:2606.23565v1 [cs.RO] 22 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Representative closed-loop executions with HoloAgent-0. (a) Prompt Motion Control: execute and verify short-horizon whole-body commands. (b) Active Object Search: explore, build the map, and verify the target coffee machine. (c) Cross-Robot Coordination: route one robot while another performs a dance skill. (d) Long-Horizon Mobile Manipulation: decompose laundry folding into navigation, pick-and-place, mo… view at source ↗

**Figure 3.** Figure 3: HoloNavi object-navigation pipeline. Given a language goal, HoloNavi first performs hierarchical semantic navigation by parsing the instruction and matching it against floor-, room-, view-, and object-level memory. After reaching a candidate viewpoint, an online verification loop validates the target with open-vocabulary detection and local viewpoint adjustment. If verification fails, a frontier explorati… view at source ↗

**Figure 4.** Figure 4: Overview of the HoloAgent-0 semantic mapping framework. The semantic memory lifts open-vocabulary 2D features onto geometry memory, associates observations with persistent 3D instances, and provides language-queryable object and region evidence for AgentOS. Semantic memory turns metric geometry into language-grounded 3D spatial memory. On top of geometry memory, an open-vocabulary online mapping module lif… view at source ↗

**Figure 5.** Figure 5: Instance association. The semantic memory projects existing 3D instances into the current view and matches them with new 2D masks to maintain persistent object identities. 4.3 Hierarchical Multimodal Scene Graph for Structured Spatial Retrieval The hierarchical multimodal scene graph (HMSG) provides the structured retrieval interface between semantic memory and Embodied AgentOS. Following FSR-VLN [39], HM… view at source ↗

**Figure 6.** Figure 6: Hierarchical Multimodal Scene Graph (HMSG). HMSG organizes memory into floor, room, view, and object layers. In HoloAgent-0, this hierarchy acts as the retrieval index for AgentOS, supporting coarse-to-fine target grounding, VLM-based verification, and memory updates from execution feedback. resume a long-horizon task after interruptions, bind later skill calls to earlier decisions, and avoid losing contex… view at source ↗

**Figure 7.** Figure 7: Robot platforms used in real-world experiments. HoloAgent-0 is deployed on heterogeneous embodiments, including humanoid platforms for navigation, interaction, and whole-body motion, and a wheeled dual-arm platform for mobile manipulation. 5 Experiments 5.1 Experimental Setup Our evaluation separates repeatable quantitative measurements from broader real-robot system demonstrations. The quantitative exper… view at source ↗

**Figure 8.** Figure 8: Dynamic memory update. The memory layer localizes the robot in the existing map, refreshes changed geometry and semantic instances around the current observation, and updates only the affected scene-graph nodes and relations. The traces highlight the role of the typed skill abstraction. AgentOS can plan over spatial memory, verify targets through navigation and perception feedback, coordinate multiple rob… view at source ↗

read the original abstract

LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HoloAgent-0 is a three-layer framework paper that claims real-robot closed-loop execution but supplies no quantitative results or coupling details in the abstract.

read the letter

The main takeaway is that this paper introduces HoloAgent-0 as a unified framework with three layers—Embodied AgentOS for turning language into skill graphs and handling feedback, 3D spatial memory for grounding, and embodied skills for actions—and states it was deployed on real hardware for navigation, object search, coordination, and manipulation.

What stands out as new is the explicit three-layer organization and the AgentOS component that schedules resources and triggers re-planning from runtime signals. The paper does a reasonable job of pulling together ideas that often sit in separate modules and framing them as one practical stack for heterogeneous robots.

The soft spots are straightforward. The abstract contains zero numbers, no baselines, no error rates, and no description of the actual interfaces between layers—how spatial memory updates sync with continuous motion, what data structures move between components, or the exact rules for choosing clarification versus re-planning. That gap directly affects the central claim of reliable closed-loop behavior under uncertainty and safety constraints. The stress-test note on unspecified coupling is accurate based on the provided text.

This is the kind of systems paper that might interest researchers building integrated robot agents who want to see one group's way of wiring the pieces together. A reader already working on embodied LLM agents or spatial memory could extract useful high-level structure, but anyone looking for evidence that the integration works better than prior combinations will come away empty.

It deserves peer review. The deployment claim across multiple tasks is concrete enough to warrant referees asking for the missing methods, metrics, and interface specifications rather than a desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. The framework organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. The authors claim deployment on real hardware with evaluation of spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.

Significance. If the integration of the three layers produces reliable closed-loop execution on physical robots, the work could advance practical embodied AI by addressing continuous motion, uncertainty, and safety constraints in a unified manner. However, the provided text supplies no quantitative results, error bars, or detailed methods, which limits assessment of whether the claimed unification delivers measurable improvements over existing specialized modules.

major comments (2)

[Abstract] Abstract: The abstract states that the system was deployed and evaluated but supplies no quantitative results, error bars, methods details, or data to support the central claims of successful integration and performance.
[Abstract] Abstract: The coupling mechanism between Embodied AgentOS, 3D spatial memory, and embodied skills remains unspecified. No description is given of the data structures passed between layers, how spatial memory updates are synchronized with continuous robot motion, or the precise conditions under which feedback triggers re-planning versus clarification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without misrepresenting its current content.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that the system was deployed and evaluated but supplies no quantitative results, error bars, methods details, or data to support the central claims of successful integration and performance.

Authors: We agree that the current abstract lacks quantitative results, error bars, and methods details. The provided manuscript text does not contain these elements. In the revised version we will expand the abstract to include concise quantitative highlights from the hardware evaluations (e.g., task success rates and execution metrics) together with a brief statement of the evaluation protocol. revision: yes
Referee: [Abstract] Abstract: The coupling mechanism between Embodied AgentOS, 3D spatial memory, and embodied skills remains unspecified. No description is given of the data structures passed between layers, how spatial memory updates are synchronized with continuous robot motion, or the precise conditions under which feedback triggers re-planning versus clarification.

Authors: We agree that the abstract does not specify the coupling mechanisms, data structures, synchronization, or re-planning conditions. The current manuscript text does not provide these details. We will revise the abstract to add a short description of the interfaces (skill graphs as the primary data structure, periodic 3D map updates for synchronization, and threshold-based triggers for re-planning versus clarification requests) and will expand the methods section accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework description contains no derivations or fitted predictions.

full rationale

The paper presents an architectural framework (Embodied AgentOS, 3D spatial memory, embodied skills) and reports real-robot deployments without any equations, parameter fitting, predictions, or first-principles derivations. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The coupling claim is descriptive rather than derived, so no circularity analysis applies. This is the expected outcome for a systems paper lacking mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond naming the framework and its layers; no numerical fits or unstated background assumptions are detailed.

pith-pipeline@v0.9.1-grok · 5760 in / 1187 out tokens · 23422 ms · 2026-06-26T08:25:25.499334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 19 linked inside Pith

[1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

2023
[2]

Toolformer: Language models can teach them- selves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools. In Advances in Neural Information Processing Systems , 2023

2023
[3]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023

2023
[4]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

Pith/arXiv arXiv 2023
[5]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 , 2023

Pith/arXiv arXiv 2023
[6]

Hierarchical task and motion planning in the now

Leslie Pack Kaelbling and Tomás Lozano-Pérez. Hierarchical task and motion planning in the now. In 2011 IEEE international conference on robotics and automation , pages 1470–1477. IEEE, 2011

2011
[7]

Integrated task and motion planning

Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kael- bling, and Tomás Lozano-Pérez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems , 4(1):265–293, 2021

2021
[8]

Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T. Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar...

2023
[9]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Proceedings of The 6th Conference on Rob...

2023
[10]

pi0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024

Pith/arXiv arXiv 2024
[11]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

Pith/arXiv arXiv 2025
[12]

Openvla: An open-source vision-language- action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model. arXiv preprint arXiv:2406.09246 , 2024. 17

Pith/arXiv arXiv 2024
[13]

What matters in building vision-language-action models for generalist robots

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. What matters in building vision-language-action models for generalist robots. arXiv preprint arXiv:2412.14058 , 2024

Pith/arXiv arXiv 2024
[14]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830 , 2025

Pith/arXiv arXiv 2025
[15]

HoloBrain-0 technical report

Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. HoloBrain-0 technical report. arXiv preprint arXiv:2602.12062 , 2026

arXiv 2026
[16]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 , 2023

Pith/arXiv arXiv 2023
[17]

Causal world modeling for robot control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 , 2026

Pith/arXiv arXiv 2026
[18]

Motus: A unified latent action world model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[19]

World action models are zero-shot policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 , 2026

Pith/arXiv arXiv 2026
[20]

ScanQA: 3D question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022

2022
[21]

3D-LLM: Injecting the 3D world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: Injecting the 3D world into large language models. In Advances in Neural Information Processing Systems, 2023

2023
[22]

EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19757–19767, 2024

2024
[23]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168 , 2024

arXiv 2024
[24]

VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction

Jianing Yang, Haoran Zhang, Yikai Wang, Tianheng Cheng, Wenyu Liu, Xinggang Wang, and Ziwei Liu. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction. arXiv preprint arXiv:2505.20279 , 2025

Pith/arXiv arXiv 2025
[25]

Spa3R: Predictive spatial field modeling for 3D visual reasoning

Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Spa3R: Predictive spatial field modeling for 3D visual reasoning. arXiv preprint arXiv:2602.21186, 2026

arXiv 2026
[26]

Bip3d: Bridging 2d images and 3d perception for embodied intelligence

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, and Zhizhong Su. Bip3d: Bridging 2d images and 3d perception for embodied intelligence. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9007–9016, 2025

2025
[27]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. arXiv preprint arXiv:2407.07035 , 2024. 18

arXiv 2024
[28]

Mapdream: Task-driven map learning for vision-language navigation

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Mapdream: Task-driven map learning for vision-language navigation. arXiv preprint arXiv:2602.00222 , 2026

arXiv 2026
[29]

Progress-think: Semantic progress reasoning for vision-language navigation

Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, and Zhaoxin Fan. Progress-think: Semantic progress reasoning for vision-language navigation. arXiv preprint arXiv:2511.17097 , 2025

Pith/arXiv arXiv 2025
[30]

Monodream: Monocular vision-language navi- gation with panoramic dreaming

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language navi- gation with panoramic dreaming. arXiv preprint arXiv:2508.02549 , 2025

arXiv 2025
[31]

Aux-think: Exploring reasoning strategies for data-eﬀicient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reasoning strategies for data-eﬀicient vision-language navigation. arXiv preprint arXiv:2505.11886 , 2025

arXiv 2025
[32]

GMT: General motion tracking for humanoid whole-body control

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. GMT: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025

arXiv 2025
[33]

Track any motions under any disturbances

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Huaping Liu, He Wang, and Li Yi. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833 , 2025

arXiv 2025
[34]

SONIC: Supersizing motion tracking for natural humanoid whole-body control

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. SONIC: Supersizing motion tracking for natura...

Pith/arXiv arXiv 2025
[35]

HoloMotion-1 technical report

Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, and Zhizhong Su. HoloMotion-1 technical report. arXiv preprint arXiv:2605.15336 , 2026

Pith/arXiv arXiv 2026
[36]

π0.5: A vision-language-action model with open-world generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025

Pith/arXiv arXiv 2025
[37]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

2023
[38]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , 2023

2023
[39]

Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph

Xiaolin Zhou, Tingyang Xiao, Liu Liu, Yucheng Wang, Maiyue Chen, Xinrui Meng, Xinjie Wang, Wei Feng, Wei Sui, and Zhizhong Su. Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph. arXiv preprint arXiv:2509.13733 , 2025

arXiv 2025
[40]

F AST-LIVO: Fast and tightly-coupled sparse-direct LiDAR-inertial-visual odometry

Chunran Zheng, Qingyan Zhu, Wei Xu, Xiyuan Liu, Qizhi Guo, and Fu Zhang. F AST-LIVO: Fast and tightly-coupled sparse-direct LiDAR-inertial-visual odometry. In IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 4003–4009, 2022

2022
[41]

GeoFlow-SLAM: A robust tightly-coupled RGBD-inertial and legged odometry fusion SLAM for dynamic legged robotics

Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, and Zhizhong Su. GeoFlow-SLAM: A robust tightly-coupled RGBD-inertial and legged odometry fusion SLAM for dynamic legged robotics. arXiv preprint arXiv:2503.14247 , 2025. 19

arXiv 2025
[42]

Oswald, and Javier Civera

Tomas Berriel Martins, Martin R. Oswald, and Javier Civera. Open-vocabulary online semantic mapping for SLAM. IEEE Robotics and Automation Letters , 2025

2025
[43]

IRIS-SLAM: Unified geo-instance representations for robust semantic localization and mapping

Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, and Zhizhong Su. IRIS-SLAM: Unified geo-instance representations for robust semantic localization and mapping. arXiv preprint arXiv:2602.18709 , 2026

arXiv 2026
[44]

MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 37154–37163, 2026

2026
[45]

VLFM: Vision- language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. VLFM: Vision- language frontier maps for zero-shot semantic navigation. In IEEE International Conference on Robotics and Automation , pages 42–48, 2024

2024
[46]

SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation. In Advances in Neural Information Processing Systems , 2024

2024
[47]

DORAEMON: Decentralized ontology-aware reliable agent with enhanced memory oriented navigation

Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, and Xin Tan. DORAEMON: Decentralized ontology-aware reliable agent with enhanced memory oriented navigation. arXiv preprint arXiv:2505.21969 , 2025

arXiv 2025
[48]

WMNav: Integrating vision- language models into world models for object goal navigation

Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, and Long Chen. WMNav: Integrating vision- language models into world models for object goal navigation. arXiv preprint arXiv:2503.02247 , 2025

arXiv 2025
[49]

OK-Robot: What really matters in integrating open-knowledge models for robotics

Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. OK-Robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024

arXiv 2024
[50]

Mobility VLA: Multimodal instruction navigation with long- context VLMs and topological graphs

Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility VLA: Multimodal instructi...

2025
[51]

Hier- archical open-vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hier- archical open-vocabulary 3d scene graphs for language-grounded robot navigation. In Robotics: Science and Systems , 2024

2024
[52]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , 2023

2023
[53]

Robot operating system 2: Design, architecture, and uses in the wild

Steve Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074, 2022

2022
[54]

Christopher E. Mower, Yuhui Wan, Hongzhan Yu, Antoine Grosnit, Jonas Gonzalez-Billandon, Matthieu Zimmer, Jinlong Wang, Xinyu Zhang, Yao Zhao, Anbang Zhai, Puze Liu, Daniel Palenicek, Davide Tateo, Cesar Cadena, Marco Hutter, Jan Peters, Guangjian Tian, Yuzheng Zhuang, Kun Shao, Xingyue Quan, Jianye Hao, Jun Wang, and Haitham Bou-Ammar. ROS-LLM: A ROS fra...

Pith/arXiv arXiv 2024
[55]

RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration

Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration. arXiv preprint arXiv:2505.03673 , 2025. 20

arXiv 2025
[56]

EMOS: Embodiment-aware heterogeneous multi-robot operating system with LLM agents

Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, and Lin Shao. EMOS: Embodiment-aware heterogeneous multi-robot operating system with LLM agents. arXiv preprint arXiv:2410.22662 , 2024

arXiv 2024
[57]

AEROS: A single-agent operating archi- tecture with embodied capability modules

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. AEROS: A single-agent operating archi- tecture with embodied capability modules. arXiv preprint arXiv:2604.07039 , 2026

Pith/arXiv arXiv 2026
[58]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation , pages 9493–9500, 2023

2023
[59]

π0.7: A steerable generalist robotic foundation model with emergent capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. π0.7: A steerable generalist robotic foundation model with emergent capabilities. arXiv preprint arXiv:2604.15483 , 2026

Pith/arXiv arXiv 2026
[60]

Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5664–5673, 2019

2019
[61]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In IEEE International Conference on Robotics and Automation , 2023

2023
[62]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 815–824, 2023

2023
[63]

Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In IEEE Internat...

2024
[64]

Sem: Enhancing spatial understanding for robust robot manipulation

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Yiwei Jin, Keyu Li, and Zhizhong Su. Sem: Enhancing spatial understanding for robust robot manipulation. arXiv preprint arXiv:2505.16196 , 2025

arXiv 2025
[65]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3674–3683, 2018

2018
[66]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages 4392–4412, 2020

2020
[67]

EmbodiedGen: Towards a generative 3D world engine for embodied intelligence

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. EmbodiedGen: Towards a generative 3D world engine for embodied intelligence. arXiv preprint arXiv:2506.10600, 2025. 21

arXiv 2025

[1] [1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

2023

[2] [2]

Toolformer: Language models can teach them- selves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools. In Advances in Neural Information Processing Systems , 2023

2023

[3] [3]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023

2023

[4] [4]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

Pith/arXiv arXiv 2023

[5] [5]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 , 2023

Pith/arXiv arXiv 2023

[6] [6]

Hierarchical task and motion planning in the now

Leslie Pack Kaelbling and Tomás Lozano-Pérez. Hierarchical task and motion planning in the now. In 2011 IEEE international conference on robotics and automation , pages 1470–1477. IEEE, 2011

2011

[7] [7]

Integrated task and motion planning

Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kael- bling, and Tomás Lozano-Pérez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems , 4(1):265–293, 2021

2021

[8] [8]

Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T. Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar...

2023

[9] [9]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Proceedings of The 6th Conference on Rob...

2023

[10] [10]

pi0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024

Pith/arXiv arXiv 2024

[11] [11]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

Pith/arXiv arXiv 2025

[12] [12]

Openvla: An open-source vision-language- action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model. arXiv preprint arXiv:2406.09246 , 2024. 17

Pith/arXiv arXiv 2024

[13] [13]

What matters in building vision-language-action models for generalist robots

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. What matters in building vision-language-action models for generalist robots. arXiv preprint arXiv:2412.14058 , 2024

Pith/arXiv arXiv 2024

[14] [14]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830 , 2025

Pith/arXiv arXiv 2025

[15] [15]

HoloBrain-0 technical report

Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. HoloBrain-0 technical report. arXiv preprint arXiv:2602.12062 , 2026

arXiv 2026

[16] [16]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 , 2023

Pith/arXiv arXiv 2023

[17] [17]

Causal world modeling for robot control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 , 2026

Pith/arXiv arXiv 2026

[18] [18]

Motus: A unified latent action world model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[19] [19]

World action models are zero-shot policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 , 2026

Pith/arXiv arXiv 2026

[20] [20]

ScanQA: 3D question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022

2022

[21] [21]

3D-LLM: Injecting the 3D world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: Injecting the 3D world into large language models. In Advances in Neural Information Processing Systems, 2023

2023

[22] [22]

EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19757–19767, 2024

2024

[23] [23]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168 , 2024

arXiv 2024

[24] [24]

VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction

Jianing Yang, Haoran Zhang, Yikai Wang, Tianheng Cheng, Wenyu Liu, Xinggang Wang, and Ziwei Liu. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction. arXiv preprint arXiv:2505.20279 , 2025

Pith/arXiv arXiv 2025

[25] [25]

Spa3R: Predictive spatial field modeling for 3D visual reasoning

Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Spa3R: Predictive spatial field modeling for 3D visual reasoning. arXiv preprint arXiv:2602.21186, 2026

arXiv 2026

[26] [26]

Bip3d: Bridging 2d images and 3d perception for embodied intelligence

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, and Zhizhong Su. Bip3d: Bridging 2d images and 3d perception for embodied intelligence. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9007–9016, 2025

2025

[27] [27]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. arXiv preprint arXiv:2407.07035 , 2024. 18

arXiv 2024

[28] [28]

Mapdream: Task-driven map learning for vision-language navigation

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Mapdream: Task-driven map learning for vision-language navigation. arXiv preprint arXiv:2602.00222 , 2026

arXiv 2026

[29] [29]

Progress-think: Semantic progress reasoning for vision-language navigation

Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, and Zhaoxin Fan. Progress-think: Semantic progress reasoning for vision-language navigation. arXiv preprint arXiv:2511.17097 , 2025

Pith/arXiv arXiv 2025

[30] [30]

Monodream: Monocular vision-language navi- gation with panoramic dreaming

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language navi- gation with panoramic dreaming. arXiv preprint arXiv:2508.02549 , 2025

arXiv 2025

[31] [31]

Aux-think: Exploring reasoning strategies for data-eﬀicient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reasoning strategies for data-eﬀicient vision-language navigation. arXiv preprint arXiv:2505.11886 , 2025

arXiv 2025

[32] [32]

GMT: General motion tracking for humanoid whole-body control

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. GMT: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025

arXiv 2025

[33] [33]

Track any motions under any disturbances

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Huaping Liu, He Wang, and Li Yi. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833 , 2025

arXiv 2025

[34] [34]

SONIC: Supersizing motion tracking for natural humanoid whole-body control

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. SONIC: Supersizing motion tracking for natura...

Pith/arXiv arXiv 2025

[35] [35]

HoloMotion-1 technical report

Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, and Zhizhong Su. HoloMotion-1 technical report. arXiv preprint arXiv:2605.15336 , 2026

Pith/arXiv arXiv 2026

[36] [36]

π0.5: A vision-language-action model with open-world generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025

Pith/arXiv arXiv 2025

[37] [37]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

2023

[38] [38]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , 2023

2023

[39] [39]

Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph

Xiaolin Zhou, Tingyang Xiao, Liu Liu, Yucheng Wang, Maiyue Chen, Xinrui Meng, Xinjie Wang, Wei Feng, Wei Sui, and Zhizhong Su. Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph. arXiv preprint arXiv:2509.13733 , 2025

arXiv 2025

[40] [40]

F AST-LIVO: Fast and tightly-coupled sparse-direct LiDAR-inertial-visual odometry

Chunran Zheng, Qingyan Zhu, Wei Xu, Xiyuan Liu, Qizhi Guo, and Fu Zhang. F AST-LIVO: Fast and tightly-coupled sparse-direct LiDAR-inertial-visual odometry. In IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 4003–4009, 2022

2022

[41] [41]

GeoFlow-SLAM: A robust tightly-coupled RGBD-inertial and legged odometry fusion SLAM for dynamic legged robotics

Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, and Zhizhong Su. GeoFlow-SLAM: A robust tightly-coupled RGBD-inertial and legged odometry fusion SLAM for dynamic legged robotics. arXiv preprint arXiv:2503.14247 , 2025. 19

arXiv 2025

[42] [42]

Oswald, and Javier Civera

Tomas Berriel Martins, Martin R. Oswald, and Javier Civera. Open-vocabulary online semantic mapping for SLAM. IEEE Robotics and Automation Letters , 2025

2025

[43] [43]

IRIS-SLAM: Unified geo-instance representations for robust semantic localization and mapping

Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, and Zhizhong Su. IRIS-SLAM: Unified geo-instance representations for robust semantic localization and mapping. arXiv preprint arXiv:2602.18709 , 2026

arXiv 2026

[44] [44]

MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 37154–37163, 2026

2026

[45] [45]

VLFM: Vision- language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. VLFM: Vision- language frontier maps for zero-shot semantic navigation. In IEEE International Conference on Robotics and Automation , pages 42–48, 2024

2024

[46] [46]

SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation. In Advances in Neural Information Processing Systems , 2024

2024

[47] [47]

DORAEMON: Decentralized ontology-aware reliable agent with enhanced memory oriented navigation

Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, and Xin Tan. DORAEMON: Decentralized ontology-aware reliable agent with enhanced memory oriented navigation. arXiv preprint arXiv:2505.21969 , 2025

arXiv 2025

[48] [48]

WMNav: Integrating vision- language models into world models for object goal navigation

Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, and Long Chen. WMNav: Integrating vision- language models into world models for object goal navigation. arXiv preprint arXiv:2503.02247 , 2025

arXiv 2025

[49] [49]

OK-Robot: What really matters in integrating open-knowledge models for robotics

Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. OK-Robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024

arXiv 2024

[50] [50]

Mobility VLA: Multimodal instruction navigation with long- context VLMs and topological graphs

Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility VLA: Multimodal instructi...

2025

[51] [51]

Hier- archical open-vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hier- archical open-vocabulary 3d scene graphs for language-grounded robot navigation. In Robotics: Science and Systems , 2024

2024

[52] [52]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , 2023

2023

[53] [53]

Robot operating system 2: Design, architecture, and uses in the wild

Steve Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074, 2022

2022

[54] [54]

Christopher E. Mower, Yuhui Wan, Hongzhan Yu, Antoine Grosnit, Jonas Gonzalez-Billandon, Matthieu Zimmer, Jinlong Wang, Xinyu Zhang, Yao Zhao, Anbang Zhai, Puze Liu, Daniel Palenicek, Davide Tateo, Cesar Cadena, Marco Hutter, Jan Peters, Guangjian Tian, Yuzheng Zhuang, Kun Shao, Xingyue Quan, Jianye Hao, Jun Wang, and Haitham Bou-Ammar. ROS-LLM: A ROS fra...

Pith/arXiv arXiv 2024

[55] [55]

RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration

Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration. arXiv preprint arXiv:2505.03673 , 2025. 20

arXiv 2025

[56] [56]

EMOS: Embodiment-aware heterogeneous multi-robot operating system with LLM agents

Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, and Lin Shao. EMOS: Embodiment-aware heterogeneous multi-robot operating system with LLM agents. arXiv preprint arXiv:2410.22662 , 2024

arXiv 2024

[57] [57]

AEROS: A single-agent operating archi- tecture with embodied capability modules

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. AEROS: A single-agent operating archi- tecture with embodied capability modules. arXiv preprint arXiv:2604.07039 , 2026

Pith/arXiv arXiv 2026

[58] [58]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation , pages 9493–9500, 2023

2023

[59] [59]

π0.7: A steerable generalist robotic foundation model with emergent capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. π0.7: A steerable generalist robotic foundation model with emergent capabilities. arXiv preprint arXiv:2604.15483 , 2026

Pith/arXiv arXiv 2026

[60] [60]

Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5664–5673, 2019

2019

[61] [61]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In IEEE International Conference on Robotics and Automation , 2023

2023

[62] [62]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 815–824, 2023

2023

[63] [63]

Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In IEEE Internat...

2024

[64] [64]

Sem: Enhancing spatial understanding for robust robot manipulation

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Yiwei Jin, Keyu Li, and Zhizhong Su. Sem: Enhancing spatial understanding for robust robot manipulation. arXiv preprint arXiv:2505.16196 , 2025

arXiv 2025

[65] [65]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3674–3683, 2018

2018

[66] [66]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages 4392–4412, 2020

2020

[67] [67]

EmbodiedGen: Towards a generative 3D world engine for embodied intelligence

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. EmbodiedGen: Towards a generative 3D world engine for embodied intelligence. arXiv preprint arXiv:2506.10600, 2025. 21

arXiv 2025