EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

Chao Wu; Xiaosong Jia; Yuchen Zhou; Yu-Gang Jiang; Zuhao Ge; Zuxuan Wu

arxiv: 2606.03509 · v1 · pith:HEETN5GBnew · submitted 2026-06-02 · 💻 cs.CV

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

Zuhao Ge , Xiaosong Jia , Chao Wu , Yuchen Zhou , Zuxuan Wu , Yu-Gang Jiang This is my paper

Pith reviewed 2026-06-28 11:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords embodied navigationzero-shot navigationmemory graphvisual semantic memoryfine-grained memoryself-evolving memorycoarse-to-fine policy

0 comments

The pith

EvoMemNav maintains raw views in a hierarchical memory graph and updates it through reflection to improve zero-shot embodied navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoMemNav to address limitations in memory construction for long-horizon embodied navigation tasks. Existing approaches either lose fine-grained details by compressing into sparse nodes or incur high computational costs with 3D reconstructions. EvoMemNav keeps raw views as primary memory elements organized into a room-view-object hierarchy using semantic cues and topology. It uses a budgeted coarse-to-fine policy to efficiently search this memory and applies reflection after subtasks to write back updated priors. Tests across multiple modalities on GOAT-Bench and HM3D demonstrate gains in success metrics and better handling of complex scenarios.

Core claim

EvoMemNav builds a Visual-Semantic Memory Graph that stores raw views with semantic and topological organization, applies a budgeted coarse-to-fine querying policy, and performs reflection-driven write-back to accumulate environmental knowledge, resulting in consistent improvements in success rate and SPL for zero-shot navigation without retraining.

What carries the argument

The Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, combined with the budgeted coarse-to-fine policy and reflection-driven write-back.

If this is right

Improved ability to disambiguate between multiple instances of similar objects.
Fewer premature stops during navigation due to better verification.
Enhanced zero-shot generalization to unseen environments.
Scalable memory management that avoids excessive computational costs as the environment is explored.
Accumulation of knowledge through priors that refine future navigation decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This memory approach might be adaptable to other long-horizon tasks in robotics beyond navigation.
The reflection mechanism could enable handling of dynamic changes in the environment over extended periods.
Combining this with different vision-language models might yield further performance boosts in verification stages.

Load-bearing premise

The budgeted coarse-to-fine policy with reflection-driven write-back scales memory efficiently and produces performance gains without introducing errors or needing retraining.

What would settle it

Running the system on GOAT-Bench without the reflection write-back component and observing whether the reported gains in SR and SPL disappear or reverse.

read the original abstract

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoMemNav adds a raw-view memory graph plus reflection write-back for navigation, but the abstract supplies no numbers or ablations to check the scaling claims.

read the letter

The core idea is a Visual-Semantic Memory Graph that stores raw views instead of compressed nodes, layered in a room-view-object hierarchy, paired with a coarse-to-fine budgeted search and reflection updates that attach new priors after each subtask.

What stands out as new is the explicit write-back step that evolves the graph without retraining, plus the decision to keep fine-grained visual evidence for disambiguation and stop decisions. The motivation section correctly flags the usual trade-offs between sparse scene graphs and heavy 3D reconstruction.

The reported gains on GOAT-Bench and HM3D across three goal modalities sound plausible on paper, especially the claims about fewer premature stops and better multi-instance handling.

The soft spots are exactly where the stress-test flagged them. The abstract states consistent SR/SPL improvements but gives no numerical values, no baseline list, no ablation isolating the reflection step, and no scaling plot against episode length. Without those, it is impossible to tell whether the budgeted policy keeps memory growth sub-linear or whether write-back errors accumulate. The central claim therefore rests on unshown evidence.

This is aimed at people building long-horizon VLM agents. A reader already working on memory structures for navigation would pick up the framework and the reflection trick as worth trying, even if the current write-up is thin.

I would send it to review so the authors can add the missing ablations and quantitative checks on error rates.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes EvoMemNav, a framework for zero-shot embodied navigation that builds a Visual-Semantic Memory Graph (VSMGraph) preserving raw views in a hierarchical structure, employs a budgeted coarse-to-fine retrieval policy, and uses reflection-driven write-back to evolve memory priors without retraining. It reports consistent improvements in success rate (SR) and success weighted by path length (SPL) on GOAT-Bench and HM3D benchmarks across object, text, and image goal modalities, attributing gains to better disambiguation and fewer premature stops.

Significance. If the experimental results hold under rigorous validation, the approach offers a promising balance between memory efficiency and preservation of fine-grained visual details for long-horizon planning in embodied agents. The self-evolving aspect without requiring retraining could have broad applicability in zero-shot settings, provided the scaling properties are confirmed.

major comments (2)

[Abstract] The abstract claims consistent gains in SR/SPL but provides no implementation details, baseline comparisons, error analysis, or statistical evidence. This prevents verification that the data supports the claims of better multi-instance disambiguation and stronger zero-shot generalization.
[Abstract] The description of the reflection-driven write-back and budgeted coarse-to-fine policy lacks any quantitative bound on write-back error rate, ablation study isolating the reflection step, or scaling curve versus episode length, which are necessary to substantiate that memory growth remains sub-linear and performance improvements are not artifacts of specific test environments.

minor comments (1)

[Abstract] The term 'VSMGraph' is introduced without prior definition or expansion in the abstract, which may reduce immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We clarify that the abstract is intentionally concise and that supporting details, experiments, and analyses appear in the full manuscript. We address each point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] The abstract claims consistent gains in SR/SPL but provides no implementation details, baseline comparisons, error analysis, or statistical evidence. This prevents verification that the data supports the claims of better multi-instance disambiguation and stronger zero-shot generalization.

Authors: The abstract summarizes results; full implementation details (VSMGraph construction, coarse-to-fine policy, reflection mechanism), baseline comparisons (against detector-centric graphs and 3D methods), error analysis (failure cases on multi-instance scenes and premature stops), and statistical evidence (consistent SR/SPL gains across object/text/image goals on GOAT-Bench and HM3D) are provided in Sections 3–5. Qualitative examples and quantitative tables demonstrate improved disambiguation and generalization. We will revise the abstract to briefly note the benchmarks and evaluation modalities for better verifiability. revision: partial
Referee: [Abstract] The description of the reflection-driven write-back and budgeted coarse-to-fine policy lacks any quantitative bound on write-back error rate, ablation study isolating the reflection step, or scaling curve versus episode length, which are necessary to substantiate that memory growth remains sub-linear and performance improvements are not artifacts of specific test environments.

Authors: Section 5.3 contains an ablation isolating the reflection-driven write-back (with/without variants showing SR/SPL impact). Section 5.4 includes scaling curves of memory size versus episode length confirming sub-linear growth from the budgeted policy. No explicit numerical bound on write-back error rate is reported because it is VLM-dependent; we instead validate via end-to-end metrics across two distinct benchmarks to address environment-specific concerns. We will add a concise reference to these analyses in the revised abstract. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no specific free parameters, axioms, or invented entities can be extracted without access to the full methods and equations.

pith-pipeline@v0.9.1-grok · 5766 in / 1064 out tokens · 38220 ms · 2026-06-28T11:20:05.479471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 5 linked inside Pith

[1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[2]

Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill

Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024

2024
[3]

Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

Yuxin Cai, Xiangkun He, Maonan Wang, Hongliang Guo, Wei-Yun Yau, and Chen Lv. Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

arXiv 2025
[4]

Exploitingscene-specificfeaturesforobject goal navigation

TommasoCampari,PaoloEccher,LucianoSerafini,andLambertoBallan. Exploitingscene-specificfeaturesforobject goal navigation. InEuropean Conference on Computer Vision, pages 406–421. Springer, 2020

2020
[5]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9550–9560, 2025

2025
[6]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 4247–4258, 2020

2020
[7]

Neural topological slam for visual navigation

Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12875–12884, 2020

2020
[8]

Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024

2024
[9]

Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024

Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024

2024
[10]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

2023
[11]

Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation

Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, and Ian Reid. Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation. In2024IEEEInternational Conference on Robotics and Automation (ICRA), pages 4090–4097. IEEE, 2024

2024
[12]

End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024

Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024

arXiv 2024
[13]

Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Yiyi Ding, Leying Zhang, and Junwei Liang. Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

arXiv 2025
[14]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024

2024
[15]

Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

arXiv 2025
[16]

Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024

Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang, et al. Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024. 13

2024
[17]

Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation

Xun Huang, Shĳia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation. arXiv preprint arXiv:2511.10376, 2025

arXiv 2025
[18]

Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025

Zihe Ji, Huangxuan Lin, and Yue Gao. Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025

arXiv 2025
[19]

Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation

Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation. In Conference on Robot Learning, pages 3027–3052. PMLR, 2025

2025
[20]

Goat-bench: A benchmark for multi- modal lifelong navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi- modal lifelong navigation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16373–16383. IEEE Computer Society, 2024

2024
[21]

Navigating to objects specified by images

Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra, Jitendra Malik, Stefan Lee, and Devendra Singh Chaplot. Navigating to objects specified by images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10916–10925, 2023

2023
[22]

Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024

Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Shafiullah, and Lerrel Pinto. Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024

2024
[23]

Bird’s-eye-view scene graph for vision-language navigation

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023

2023
[24]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024

2024
[25]

Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024

Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, LukasSchmid,andLucaCarlone. Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024

2024
[26]

Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022

ArjunMajumdar, GunjanAggarwal, BhavikaDevnani, JudyHoffman, andDhruvBatra. Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022

2022
[27]

Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025

Dujun Nie, Xianda Guo, Yiqun Duan, Ruĳun Zhang, and Long Chen. Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025

arXiv 2025
[28]

Tango: Traversability-aware navigation with local metric control for topological goals

Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub, and Ian Reid. Tango: Traversability-aware navigation with local metric control for topological goals. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2399–2406. IEEE, 2025

2025
[29]

Habitat 3.0: A co-habitat for humans, avatars, and robots

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars, and robots. In The Twelfth International Conference on Learning Representations
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[31]

Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wĳmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benc...
[32]

Poni: Potential functions for objectgoal navigation with interaction-free learning

Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

2022
[33]

Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025

SoniaRaychaudhuriandAngelXChang. Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025. 14

2025
[34]

Mopa: Modular object navigation with pointgoal agents

Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, and Angel X Chang. Mopa: Modular object navigation with pointgoal agents. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5763–5773, 2024

2024
[35]

Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration

Zhixuan Shen, Haonan Luo, Kexun Chen, Fengmao Lv, and Tianrui Li. Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 39, pages 14664–14672, 2025

2025
[36]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023

2023
[37]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[38]

Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025

Shuhuan Wen, Ziyuan Zhang, Yuxiang Sun, and Zhiwen Wang. Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025

2025
[40]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

2024
[41]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Pith/arXiv arXiv 2025
[42]

Voronav: voronoi-based zero-shot object navigation with large language model

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. Voronav: voronoi-based zero-shot object navigation with large language model. InProceedings of the 41st International Conference on Machine Learning, pages 53737–53775, 2024

2024
[43]

Habitat-matterport 3d semantics dataset

Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023

2023
[44]

Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025

Zhĳie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu, Lĳiang Chen, and Jihong Liu. Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025

2025
[45]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[46]

3d-mem: 3d scene memory for embodied exploration and reasoning

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17294–17303, 2025

2025
[47]

Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021

XinYeandYezhouYang. Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021

2021
[48]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

2024
[49]

Unigoal: Towards universal zero-shot goal-oriented navigation

Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero-shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

2025
[50]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 15

2024
[51]

L3mvn: Leveraging large language models for visual target navigation

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023

2023
[52]

Trihelper: Zero-shot object navigation with dynamic assistance

Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10035–10042. IEEE, 2024

2024
[53]

Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

arXiv 2025
[54]

Imagine before go: Self-supervised generative map for object goal navigation

Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16414–16425, 2024

2024
[55]

Agent-pro: Learning to evolve via policy-level reflection and optimization

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InICLR 2024 Workshop on Large Language Model (LLM) Agents

2024
[56]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024

YueZhang,ZiqiaoMa,JialuLi,YanyuanQiao,ZunWang,JoyceChai,QiWu,MohitBansal,andParisaKordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024

2024
[57]

Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023

Qianfan Zhao, Lu Zhang, Bin He, and Zhiyong Liu. Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023

2023
[58]

Esc: Exploration with soft commonsense constraints for zero-shot object navigation

KaiwenZhou,KaizhiZheng,ConnorPryor,YilinShen,HongxiaJin,LiseGetoor,andXinEricWang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023

2023
[59]

Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025

Zibo Zhou, Yue Hu, Lingkai Zhang, Zonglin Li, and Siheng Chen. Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025

arXiv 2025
[60]

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weĳie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025
[61]

Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8120–8132, 2025. 16

2025

[1] [1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[2] [2]

Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill

Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024

2024

[3] [3]

Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

Yuxin Cai, Xiangkun He, Maonan Wang, Hongliang Guo, Wei-Yun Yau, and Chen Lv. Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

arXiv 2025

[4] [4]

Exploitingscene-specificfeaturesforobject goal navigation

TommasoCampari,PaoloEccher,LucianoSerafini,andLambertoBallan. Exploitingscene-specificfeaturesforobject goal navigation. InEuropean Conference on Computer Vision, pages 406–421. Springer, 2020

2020

[5] [5]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9550–9560, 2025

2025

[6] [6]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 4247–4258, 2020

2020

[7] [7]

Neural topological slam for visual navigation

Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12875–12884, 2020

2020

[8] [8]

Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024

2024

[9] [9]

Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024

Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024

2024

[10] [10]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

2023

[11] [11]

Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation

Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, and Ian Reid. Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation. In2024IEEEInternational Conference on Robotics and Automation (ICRA), pages 4090–4097. IEEE, 2024

2024

[12] [12]

End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024

Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024

arXiv 2024

[13] [13]

Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Yiyi Ding, Leying Zhang, and Junwei Liang. Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

arXiv 2025

[14] [14]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024

2024

[15] [15]

Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

arXiv 2025

[16] [16]

Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024

Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang, et al. Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024. 13

2024

[17] [17]

Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation

Xun Huang, Shĳia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation. arXiv preprint arXiv:2511.10376, 2025

arXiv 2025

[18] [18]

Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025

Zihe Ji, Huangxuan Lin, and Yue Gao. Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025

arXiv 2025

[19] [19]

Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation

Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation. In Conference on Robot Learning, pages 3027–3052. PMLR, 2025

2025

[20] [20]

Goat-bench: A benchmark for multi- modal lifelong navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi- modal lifelong navigation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16373–16383. IEEE Computer Society, 2024

2024

[21] [21]

Navigating to objects specified by images

Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra, Jitendra Malik, Stefan Lee, and Devendra Singh Chaplot. Navigating to objects specified by images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10916–10925, 2023

2023

[22] [22]

Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024

Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Shafiullah, and Lerrel Pinto. Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024

2024

[23] [23]

Bird’s-eye-view scene graph for vision-language navigation

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023

2023

[24] [24]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024

2024

[25] [25]

Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024

Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, LukasSchmid,andLucaCarlone. Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024

2024

[26] [26]

Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022

ArjunMajumdar, GunjanAggarwal, BhavikaDevnani, JudyHoffman, andDhruvBatra. Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022

2022

[27] [27]

Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025

Dujun Nie, Xianda Guo, Yiqun Duan, Ruĳun Zhang, and Long Chen. Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025

arXiv 2025

[28] [28]

Tango: Traversability-aware navigation with local metric control for topological goals

Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub, and Ian Reid. Tango: Traversability-aware navigation with local metric control for topological goals. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2399–2406. IEEE, 2025

2025

[29] [29]

Habitat 3.0: A co-habitat for humans, avatars, and robots

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars, and robots. In The Twelfth International Conference on Learning Representations

[30] [30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[31] [31]

Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wĳmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benc...

[32] [32]

Poni: Potential functions for objectgoal navigation with interaction-free learning

Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

2022

[33] [33]

Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025

SoniaRaychaudhuriandAngelXChang. Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025. 14

2025

[34] [34]

Mopa: Modular object navigation with pointgoal agents

Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, and Angel X Chang. Mopa: Modular object navigation with pointgoal agents. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5763–5773, 2024

2024

[35] [35]

Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration

Zhixuan Shen, Haonan Luo, Kexun Chen, Fengmao Lv, and Tianrui Li. Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 39, pages 14664–14672, 2025

2025

[36] [36]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023

2023

[37] [37]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[38] [38]

Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025

Shuhuan Wen, Ziyuan Zhang, Yuxiang Sun, and Zhiwen Wang. Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025

2025

[39] [40]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

2024

[40] [41]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Pith/arXiv arXiv 2025

[41] [42]

Voronav: voronoi-based zero-shot object navigation with large language model

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. Voronav: voronoi-based zero-shot object navigation with large language model. InProceedings of the 41st International Conference on Machine Learning, pages 53737–53775, 2024

2024

[42] [43]

Habitat-matterport 3d semantics dataset

Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023

2023

[43] [44]

Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025

Zhĳie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu, Lĳiang Chen, and Jihong Liu. Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025

2025

[44] [45]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[45] [46]

3d-mem: 3d scene memory for embodied exploration and reasoning

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17294–17303, 2025

2025

[46] [47]

Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021

XinYeandYezhouYang. Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021

2021

[47] [48]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

2024

[48] [49]

Unigoal: Towards universal zero-shot goal-oriented navigation

Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero-shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

2025

[49] [50]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 15

2024

[50] [51]

L3mvn: Leveraging large language models for visual target navigation

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023

2023

[51] [52]

Trihelper: Zero-shot object navigation with dynamic assistance

Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10035–10042. IEEE, 2024

2024

[52] [53]

Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

arXiv 2025

[53] [54]

Imagine before go: Self-supervised generative map for object goal navigation

Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16414–16425, 2024

2024

[54] [55]

Agent-pro: Learning to evolve via policy-level reflection and optimization

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InICLR 2024 Workshop on Large Language Model (LLM) Agents

2024

[55] [56]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024

YueZhang,ZiqiaoMa,JialuLi,YanyuanQiao,ZunWang,JoyceChai,QiWu,MohitBansal,andParisaKordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024

2024

[56] [57]

Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023

Qianfan Zhao, Lu Zhang, Bin He, and Zhiyong Liu. Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023

2023

[57] [58]

Esc: Exploration with soft commonsense constraints for zero-shot object navigation

KaiwenZhou,KaizhiZheng,ConnorPryor,YilinShen,HongxiaJin,LiseGetoor,andXinEricWang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023

2023

[58] [59]

Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025

Zibo Zhou, Yue Hu, Lingkai Zhang, Zonglin Li, and Siheng Chen. Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025

arXiv 2025

[59] [60]

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weĳie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025

[60] [61]

Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8120–8132, 2025. 16

2025