pith. sign in

arxiv: 2606.03509 · v1 · pith:HEETN5GBnew · submitted 2026-06-02 · 💻 cs.CV

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

Pith reviewed 2026-06-28 11:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords embodied navigationzero-shot navigationmemory graphvisual semantic memoryfine-grained memoryself-evolving memorycoarse-to-fine policy
0
0 comments X

The pith

EvoMemNav maintains raw views in a hierarchical memory graph and updates it through reflection to improve zero-shot embodied navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoMemNav to address limitations in memory construction for long-horizon embodied navigation tasks. Existing approaches either lose fine-grained details by compressing into sparse nodes or incur high computational costs with 3D reconstructions. EvoMemNav keeps raw views as primary memory elements organized into a room-view-object hierarchy using semantic cues and topology. It uses a budgeted coarse-to-fine policy to efficiently search this memory and applies reflection after subtasks to write back updated priors. Tests across multiple modalities on GOAT-Bench and HM3D demonstrate gains in success metrics and better handling of complex scenarios.

Core claim

EvoMemNav builds a Visual-Semantic Memory Graph that stores raw views with semantic and topological organization, applies a budgeted coarse-to-fine querying policy, and performs reflection-driven write-back to accumulate environmental knowledge, resulting in consistent improvements in success rate and SPL for zero-shot navigation without retraining.

What carries the argument

The Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, combined with the budgeted coarse-to-fine policy and reflection-driven write-back.

If this is right

  • Improved ability to disambiguate between multiple instances of similar objects.
  • Fewer premature stops during navigation due to better verification.
  • Enhanced zero-shot generalization to unseen environments.
  • Scalable memory management that avoids excessive computational costs as the environment is explored.
  • Accumulation of knowledge through priors that refine future navigation decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This memory approach might be adaptable to other long-horizon tasks in robotics beyond navigation.
  • The reflection mechanism could enable handling of dynamic changes in the environment over extended periods.
  • Combining this with different vision-language models might yield further performance boosts in verification stages.

Load-bearing premise

The budgeted coarse-to-fine policy with reflection-driven write-back scales memory efficiently and produces performance gains without introducing errors or needing retraining.

What would settle it

Running the system on GOAT-Bench without the reflection write-back component and observing whether the reported gains in SR and SPL disappear or reverse.

read the original abstract

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes EvoMemNav, a framework for zero-shot embodied navigation that builds a Visual-Semantic Memory Graph (VSMGraph) preserving raw views in a hierarchical structure, employs a budgeted coarse-to-fine retrieval policy, and uses reflection-driven write-back to evolve memory priors without retraining. It reports consistent improvements in success rate (SR) and success weighted by path length (SPL) on GOAT-Bench and HM3D benchmarks across object, text, and image goal modalities, attributing gains to better disambiguation and fewer premature stops.

Significance. If the experimental results hold under rigorous validation, the approach offers a promising balance between memory efficiency and preservation of fine-grained visual details for long-horizon planning in embodied agents. The self-evolving aspect without requiring retraining could have broad applicability in zero-shot settings, provided the scaling properties are confirmed.

major comments (2)
  1. [Abstract] The abstract claims consistent gains in SR/SPL but provides no implementation details, baseline comparisons, error analysis, or statistical evidence. This prevents verification that the data supports the claims of better multi-instance disambiguation and stronger zero-shot generalization.
  2. [Abstract] The description of the reflection-driven write-back and budgeted coarse-to-fine policy lacks any quantitative bound on write-back error rate, ablation study isolating the reflection step, or scaling curve versus episode length, which are necessary to substantiate that memory growth remains sub-linear and performance improvements are not artifacts of specific test environments.
minor comments (1)
  1. [Abstract] The term 'VSMGraph' is introduced without prior definition or expansion in the abstract, which may reduce immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We clarify that the abstract is intentionally concise and that supporting details, experiments, and analyses appear in the full manuscript. We address each point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] The abstract claims consistent gains in SR/SPL but provides no implementation details, baseline comparisons, error analysis, or statistical evidence. This prevents verification that the data supports the claims of better multi-instance disambiguation and stronger zero-shot generalization.

    Authors: The abstract summarizes results; full implementation details (VSMGraph construction, coarse-to-fine policy, reflection mechanism), baseline comparisons (against detector-centric graphs and 3D methods), error analysis (failure cases on multi-instance scenes and premature stops), and statistical evidence (consistent SR/SPL gains across object/text/image goals on GOAT-Bench and HM3D) are provided in Sections 3–5. Qualitative examples and quantitative tables demonstrate improved disambiguation and generalization. We will revise the abstract to briefly note the benchmarks and evaluation modalities for better verifiability. revision: partial

  2. Referee: [Abstract] The description of the reflection-driven write-back and budgeted coarse-to-fine policy lacks any quantitative bound on write-back error rate, ablation study isolating the reflection step, or scaling curve versus episode length, which are necessary to substantiate that memory growth remains sub-linear and performance improvements are not artifacts of specific test environments.

    Authors: Section 5.3 contains an ablation isolating the reflection-driven write-back (with/without variants showing SR/SPL impact). Section 5.4 includes scaling curves of memory size versus episode length confirming sub-linear growth from the budgeted policy. No explicit numerical bound on write-back error rate is reported because it is VLM-dependent; we instead validate via end-to-end metrics across two distinct benchmarks to address environment-specific concerns. We will add a concise reference to these analyses in the revised abstract. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no specific free parameters, axioms, or invented entities can be extracted without access to the full methods and equations.

pith-pipeline@v0.9.1-grok · 5766 in / 1064 out tokens · 38220 ms · 2026-06-28T11:20:05.479471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 5 linked inside Pith

  1. [1]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill

    Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024

  3. [3]

    Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

    Yuxin Cai, Xiangkun He, Maonan Wang, Hongliang Guo, Wei-Yun Yau, and Chen Lv. Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

  4. [4]

    Exploitingscene-specificfeaturesforobject goal navigation

    TommasoCampari,PaoloEccher,LucianoSerafini,andLambertoBallan. Exploitingscene-specificfeaturesforobject goal navigation. InEuropean Conference on Computer Vision, pages 406–421. Springer, 2020

  5. [5]

    Cognav: Cognitive process modeling for object goal navigation with llms

    Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9550–9560, 2025

  6. [6]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 4247–4258, 2020

  7. [7]

    Neural topological slam for visual navigation

    Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12875–12884, 2020

  8. [8]

    Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024

    Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024

  9. [9]

    Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024

    Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024

  10. [10]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

  11. [11]

    Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation

    Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, and Ian Reid. Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation. In2024IEEEInternational Conference on Robotics and Automation (ICRA), pages 4090–4097. IEEE, 2024

  12. [12]

    End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024

    Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024

  13. [13]

    Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

    Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Yiyi Ding, Leying Zhang, and Junwei Liang. Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025

  14. [14]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024

  15. [15]

    Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

    Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

  16. [16]

    Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024

    Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang, et al. Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024. 13

  17. [17]

    Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation

    Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation. arXiv preprint arXiv:2511.10376, 2025

  18. [18]

    Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025

    Zihe Ji, Huangxuan Lin, and Yue Gao. Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025

  19. [19]

    Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation

    Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation. In Conference on Robot Learning, pages 3027–3052. PMLR, 2025

  20. [20]

    Goat-bench: A benchmark for multi- modal lifelong navigation

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi- modal lifelong navigation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16373–16383. IEEE Computer Society, 2024

  21. [21]

    Navigating to objects specified by images

    Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra, Jitendra Malik, Stefan Lee, and Devendra Singh Chaplot. Navigating to objects specified by images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10916–10925, 2023

  22. [22]

    Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024

    Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Shafiullah, and Lerrel Pinto. Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024

  23. [23]

    Bird’s-eye-view scene graph for vision-language navigation

    Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023

  24. [24]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024

  25. [25]

    Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024

    Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, LukasSchmid,andLucaCarlone. Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024

  26. [26]

    Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022

    ArjunMajumdar, GunjanAggarwal, BhavikaDevnani, JudyHoffman, andDhruvBatra. Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022

  27. [27]

    Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025

    Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, and Long Chen. Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025

  28. [28]

    Tango: Traversability-aware navigation with local metric control for topological goals

    Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub, and Ian Reid. Tango: Traversability-aware navigation with local metric control for topological goals. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2399–2406. IEEE, 2025

  29. [29]

    Habitat 3.0: A co-habitat for humans, avatars, and robots

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars, and robots. In The Twelfth International Conference on Learning Representations

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  31. [31]

    Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benc...

  32. [32]

    Poni: Potential functions for objectgoal navigation with interaction-free learning

    Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

  33. [33]

    Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025

    SoniaRaychaudhuriandAngelXChang. Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025. 14

  34. [34]

    Mopa: Modular object navigation with pointgoal agents

    Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, and Angel X Chang. Mopa: Modular object navigation with pointgoal agents. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5763–5773, 2024

  35. [35]

    Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration

    Zhixuan Shen, Haonan Luo, Kexun Chen, Fengmao Lv, and Tianrui Li. Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 39, pages 14664–14672, 2025

  36. [36]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023

  37. [37]

    Llama: Open and efficient foundation language models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  38. [38]

    Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025

    Shuhuan Wen, Ziyuan Zhang, Yuxiang Sun, and Zhiwen Wang. Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025

  39. [40]

    Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

    Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

  40. [41]

    A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

    Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

  41. [42]

    Voronav: voronoi-based zero-shot object navigation with large language model

    Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. Voronav: voronoi-based zero-shot object navigation with large language model. InProceedings of the 41st International Conference on Machine Learning, pages 53737–53775, 2024

  42. [43]

    Habitat-matterport 3d semantics dataset

    Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023

  43. [44]

    Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025

    Zhijie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu, Lijiang Chen, and Jihong Liu. Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025

  44. [45]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  45. [46]

    3d-mem: 3d scene memory for embodied exploration and reasoning

    Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17294–17303, 2025

  46. [47]

    Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021

    XinYeandYezhouYang. Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021

  47. [48]

    Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

    Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

  48. [49]

    Unigoal: Towards universal zero-shot goal-oriented navigation

    Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero-shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

  49. [50]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 15

  50. [51]

    L3mvn: Leveraging large language models for visual target navigation

    Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023

  51. [52]

    Trihelper: Zero-shot object navigation with dynamic assistance

    Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10035–10042. IEEE, 2024

  52. [53]

    Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

    Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

  53. [54]

    Imagine before go: Self-supervised generative map for object goal navigation

    Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16414–16425, 2024

  54. [55]

    Agent-pro: Learning to evolve via policy-level reflection and optimization

    Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InICLR 2024 Workshop on Large Language Model (LLM) Agents

  55. [56]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024

    YueZhang,ZiqiaoMa,JialuLi,YanyuanQiao,ZunWang,JoyceChai,QiWu,MohitBansal,andParisaKordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024

  56. [57]

    Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023

    Qianfan Zhao, Lu Zhang, Bin He, and Zhiyong Liu. Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023

  57. [58]

    Esc: Exploration with soft commonsense constraints for zero-shot object navigation

    KaiwenZhou,KaizhiZheng,ConnorPryor,YilinShen,HongxiaJin,LiseGetoor,andXinEricWang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023

  58. [59]

    Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025

    Zibo Zhou, Yue Hu, Lingkai Zhang, Zonglin Li, and Siheng Chen. Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025

  59. [60]

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

  60. [61]

    Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation

    Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8120–8132, 2025. 16