EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation
Pith reviewed 2026-06-28 11:20 UTC · model grok-4.3
The pith
EvoMemNav maintains raw views in a hierarchical memory graph and updates it through reflection to improve zero-shot embodied navigation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoMemNav builds a Visual-Semantic Memory Graph that stores raw views with semantic and topological organization, applies a budgeted coarse-to-fine querying policy, and performs reflection-driven write-back to accumulate environmental knowledge, resulting in consistent improvements in success rate and SPL for zero-shot navigation without retraining.
What carries the argument
The Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, combined with the budgeted coarse-to-fine policy and reflection-driven write-back.
If this is right
- Improved ability to disambiguate between multiple instances of similar objects.
- Fewer premature stops during navigation due to better verification.
- Enhanced zero-shot generalization to unseen environments.
- Scalable memory management that avoids excessive computational costs as the environment is explored.
- Accumulation of knowledge through priors that refine future navigation decisions.
Where Pith is reading between the lines
- This memory approach might be adaptable to other long-horizon tasks in robotics beyond navigation.
- The reflection mechanism could enable handling of dynamic changes in the environment over extended periods.
- Combining this with different vision-language models might yield further performance boosts in verification stages.
Load-bearing premise
The budgeted coarse-to-fine policy with reflection-driven write-back scales memory efficiently and produces performance gains without introducing errors or needing retraining.
What would settle it
Running the system on GOAT-Bench without the reflection write-back component and observing whether the reported gains in SR and SPL disappear or reverse.
read the original abstract
Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes EvoMemNav, a framework for zero-shot embodied navigation that builds a Visual-Semantic Memory Graph (VSMGraph) preserving raw views in a hierarchical structure, employs a budgeted coarse-to-fine retrieval policy, and uses reflection-driven write-back to evolve memory priors without retraining. It reports consistent improvements in success rate (SR) and success weighted by path length (SPL) on GOAT-Bench and HM3D benchmarks across object, text, and image goal modalities, attributing gains to better disambiguation and fewer premature stops.
Significance. If the experimental results hold under rigorous validation, the approach offers a promising balance between memory efficiency and preservation of fine-grained visual details for long-horizon planning in embodied agents. The self-evolving aspect without requiring retraining could have broad applicability in zero-shot settings, provided the scaling properties are confirmed.
major comments (2)
- [Abstract] The abstract claims consistent gains in SR/SPL but provides no implementation details, baseline comparisons, error analysis, or statistical evidence. This prevents verification that the data supports the claims of better multi-instance disambiguation and stronger zero-shot generalization.
- [Abstract] The description of the reflection-driven write-back and budgeted coarse-to-fine policy lacks any quantitative bound on write-back error rate, ablation study isolating the reflection step, or scaling curve versus episode length, which are necessary to substantiate that memory growth remains sub-linear and performance improvements are not artifacts of specific test environments.
minor comments (1)
- [Abstract] The term 'VSMGraph' is introduced without prior definition or expansion in the abstract, which may reduce immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We clarify that the abstract is intentionally concise and that supporting details, experiments, and analyses appear in the full manuscript. We address each point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] The abstract claims consistent gains in SR/SPL but provides no implementation details, baseline comparisons, error analysis, or statistical evidence. This prevents verification that the data supports the claims of better multi-instance disambiguation and stronger zero-shot generalization.
Authors: The abstract summarizes results; full implementation details (VSMGraph construction, coarse-to-fine policy, reflection mechanism), baseline comparisons (against detector-centric graphs and 3D methods), error analysis (failure cases on multi-instance scenes and premature stops), and statistical evidence (consistent SR/SPL gains across object/text/image goals on GOAT-Bench and HM3D) are provided in Sections 3–5. Qualitative examples and quantitative tables demonstrate improved disambiguation and generalization. We will revise the abstract to briefly note the benchmarks and evaluation modalities for better verifiability. revision: partial
-
Referee: [Abstract] The description of the reflection-driven write-back and budgeted coarse-to-fine policy lacks any quantitative bound on write-back error rate, ablation study isolating the reflection step, or scaling curve versus episode length, which are necessary to substantiate that memory growth remains sub-linear and performance improvements are not artifacts of specific test environments.
Authors: Section 5.3 contains an ablation isolating the reflection-driven write-back (with/without variants showing SR/SPL impact). Section 5.4 includes scaling curves of memory size versus episode length confirming sub-linear growth from the budgeted policy. No explicit numerical bound on write-back error rate is reported because it is VLM-dependent; we instead validate via end-to-end metrics across two distinct benchmarks to address environment-specific concerns. We will add a concise reference to these analyses in the revised abstract. revision: partial
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Pith/arXiv arXiv 2023
-
[2]
Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill
Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024
2024
-
[3]
Yuxin Cai, Xiangkun He, Maonan Wang, Hongliang Guo, Wei-Yun Yau, and Chen Lv. Cl-cotnav: Closed-loop hierarchical chain-of-thought for zero-shot object-goal navigation with vision-language models.arXiv preprint arXiv:2504.09000, 2025
arXiv 2025
-
[4]
Exploitingscene-specificfeaturesforobject goal navigation
TommasoCampari,PaoloEccher,LucianoSerafini,andLambertoBallan. Exploitingscene-specificfeaturesforobject goal navigation. InEuropean Conference on Computer Vision, pages 406–421. Springer, 2020
2020
-
[5]
Cognav: Cognitive process modeling for object goal navigation with llms
Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9550–9560, 2025
2025
-
[6]
Object goal navigation using goal-oriented semantic exploration
Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 4247–4258, 2020
2020
-
[7]
Neural topological slam for visual navigation
Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12875–12884, 2020
2020
-
[8]
Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024
Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.CoRR, 2024
2024
-
[9]
Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024
Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.CoRR, 2024
2024
-
[10]
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation
Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023
2023
-
[11]
Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation
Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, and Ian Reid. Robohop: Segment-basedtopologicalmaprepresentationforopen-worldvisualnavigation. In2024IEEEInternational Conference on Robotics and Automation (ICRA), pages 4090–4097. IEEE, 2024
2024
-
[12]
Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755, 2024
arXiv 2024
-
[13]
Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Yiyi Ding, Leying Zhang, and Junwei Liang. Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine exploration.arXiv preprint arXiv:2505.23019, 2025
arXiv 2025
-
[14]
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning
Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024
2024
-
[15]
Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and- language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025
arXiv 2025
-
[16]
Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024
Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang, et al. Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408, 2024. 13
2024
-
[17]
Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation
Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, and Chenglu Wen. Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation. arXiv preprint arXiv:2511.10376, 2025
arXiv 2025
-
[18]
Zihe Ji, Huangxuan Lin, and Yue Gao. Dynavlm: Zero-shot vision-language navigation system with dynamic viewpoints and self-refining graph memory.arXiv preprint arXiv:2506.15096, 2025
arXiv 2025
-
[19]
Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation
Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation. In Conference on Robot Learning, pages 3027–3052. PMLR, 2025
2025
-
[20]
Goat-bench: A benchmark for multi- modal lifelong navigation
Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi- modal lifelong navigation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16373–16383. IEEE Computer Society, 2024
2024
-
[21]
Navigating to objects specified by images
Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra, Jitendra Malik, Stefan Lee, and Devendra Singh Chaplot. Navigating to objects specified by images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10916–10925, 2023
2023
-
[22]
Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024
Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Shafiullah, and Lerrel Pinto. Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics.Robotics: Science and Systems XX, 2024
2024
-
[23]
Bird’s-eye-view scene graph for vision-language navigation
Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023
2023
-
[24]
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024
Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.CoRR, 2024
2024
-
[25]
Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024
Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, LukasSchmid,andLucaCarlone. Clio: Real-timetask-drivenopen-set3dscenegraphs.IEEERoboticsandAutomation Letters, 2024
2024
-
[26]
Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022
ArjunMajumdar, GunjanAggarwal, BhavikaDevnani, JudyHoffman, andDhruvBatra. Zson: Zero-shotobject-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352, 2022
2022
-
[27]
Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, and Long Chen. Wmnav: Integrating vision-language models into world models for object goal navigation.arXiv preprint arXiv:2503.02247, 2025
arXiv 2025
-
[28]
Tango: Traversability-aware navigation with local metric control for topological goals
Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub, and Ian Reid. Tango: Traversability-aware navigation with local metric control for topological goals. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2399–2406. IEEE, 2025
2025
-
[29]
Habitat 3.0: A co-habitat for humans, avatars, and robots
Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars, and robots. In The Twelfth International Conference on Learning Representations
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[31]
Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benc...
-
[32]
Poni: Potential functions for objectgoal navigation with interaction-free learning
Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022
2022
-
[33]
Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025
SoniaRaychaudhuriandAngelXChang. Semanticmappinginindoorembodiedai–asurveyonadvances,challenges, and future directions.arXiv, 2025. 14
2025
-
[34]
Mopa: Modular object navigation with pointgoal agents
Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, and Angel X Chang. Mopa: Modular object navigation with pointgoal agents. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5763–5773, 2024
2024
-
[35]
Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration
Zhixuan Shen, Haonan Luo, Kexun Chen, Fengmao Lv, and Tianrui Li. Enhancing multi-robot semantic navigation throughmultimodalchain-of-thoughtscorecollaboration. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 39, pages 14664–14672, 2025
2025
-
[36]
Reflexion: language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023
2023
-
[37]
Llama: Open and efficient foundation language models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
Pith/arXiv arXiv 2023
-
[38]
Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025
Shuhuan Wen, Ziyuan Zhang, Yuxiang Sun, and Zhiwen Wang. Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2025
2025
-
[40]
Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation
Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024
2024
-
[41]
Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025
Pith/arXiv arXiv 2025
-
[42]
Voronav: voronoi-based zero-shot object navigation with large language model
Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. Voronav: voronoi-based zero-shot object navigation with large language model. InProceedings of the 41st International Conference on Machine Learning, pages 53737–53775, 2024
2024
-
[43]
Habitat-matterport 3d semantics dataset
Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023
2023
-
[44]
Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025
Zhijie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu, Lijiang Chen, and Jihong Liu. Dynamic open-vocabulary3dscenegraphsforlong-termlanguage-guidedmobilemanipulation.IEEERoboticsandAutomation Letters, 2025
2025
-
[45]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[46]
3d-mem: 3d scene memory for embodied exploration and reasoning
Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17294–17303, 2025
2025
-
[47]
Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021
XinYeandYezhouYang. Efficientroboticobjectsearchviahiem: Hierarchicalpolicylearningwithintrinsic-extrinsic modeling.IEEE robotics and automation letters, 6(3):4425–4432, 2021
2021
-
[48]
Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024
Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024
2024
-
[49]
Unigoal: Towards universal zero-shot goal-oriented navigation
Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero-shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025
2025
-
[50]
Vlfm: Vision-language frontier maps for zero-shot semantic navigation
Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 15
2024
-
[51]
L3mvn: Leveraging large language models for visual target navigation
Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023
2023
-
[52]
Trihelper: Zero-shot object navigation with dynamic assistance
Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10035–10042. IEEE, 2024
2024
-
[53]
Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025
arXiv 2025
-
[54]
Imagine before go: Self-supervised generative map for object goal navigation
Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16414–16425, 2024
2024
-
[55]
Agent-pro: Learning to evolve via policy-level reflection and optimization
Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InICLR 2024 Workshop on Large Language Model (LLM) Agents
2024
-
[56]
Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024
YueZhang,ZiqiaoMa,JialuLi,YanyuanQiao,ZunWang,JoyceChai,QiWu,MohitBansal,andParisaKordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.CoRR, 2024
2024
-
[57]
Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023
Qianfan Zhao, Lu Zhang, Bin He, and Zhiyong Liu. Semantic policy network for zero-shot object goal visual navigation.IEEE Robotics and Automation Letters, 8(11):7655–7662, 2023
2023
-
[58]
Esc: Exploration with soft commonsense constraints for zero-shot object navigation
KaiwenZhou,KaizhiZheng,ConnorPryor,YilinShen,HongxiaJin,LiseGetoor,andXinEricWang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023
2023
-
[59]
Zibo Zhou, Yue Hu, Lingkai Zhang, Zonglin Li, and Siheng Chen. Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation.arXiv preprint arXiv:2506.06487, 2025
arXiv 2025
-
[60]
Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025
Pith/arXiv arXiv 2025
-
[61]
Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation
Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8120–8132, 2025. 16
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.