MVP-Nav: Multi-layer Value Map Planner Navigator

Bayram Bayramli; Guodong Zhang; Hongtao Lu; Qiuchang Li; Shaokai Wu; Wenyuan Xie; Xunchu Zhou; Yanbiao Ji; Yijin Zhou; Yue Ding

arxiv: 2606.31919 · v1 · pith:SDXXAAOWnew · submitted 2026-06-30 · 💻 cs.RO · cs.AI· cs.CV

MVP-Nav: Multi-layer Value Map Planner Navigator

Wenyuan Xie , Shaokai Wu , Yijin Zhou , Yanbiao Ji , Guodong Zhang , Bayram Bayramli , Qiuchang Li , Xunchu Zhou

show 2 more authors

Yue Ding Hongtao Lu

This is my paper

Pith reviewed 2026-07-01 05:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords zero-shot object navigationRGB-only perceptionmulti-layer value map3D foundation modelsphysical occupancygeometric planningembodied agents

0 comments

The pith

MVP-Nav reconstructs 3D geometry from RGB images alone to enable safe zero-shot object navigation without depth sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that RGB-only agents can achieve reliable zero-shot object goal navigation by using 3D foundation models to convert 2D semantic detections into explicit 3D oriented bounding boxes and occupancy. These reconstructions feed a Multi-layer Value Map that places semantic priorities and physical geometry into one cost space for planning and control. The result is navigation that respects real-world spatial constraints rather than producing semantically plausible but physically unsafe paths. A sympathetic reader would care because the work shows how structured physical priors can replace active depth hardware while still grounding decisions in measurable 3D structure.

Core claim

MVP-Nav reconstructs explicit physical occupancy from monocular observations by leveraging 3D foundation models to project 2D semantic instances into 3D oriented bounding boxes, forming a global spatial semantic representation. To unify high-level semantic reasoning and low-level physical constraints, it introduces a Multi-layer Value Map that integrates semantic priorities and reconstructed geometry into a shared cost space, enabling physically grounded geometric planning. Experiments on zero-shot object navigation benchmarks show it significantly outperforms existing depth-free methods.

What carries the argument

The Multi-layer Value Map, which places semantic priorities and reconstructed 3D geometry into a single shared cost space for unified planning.

If this is right

Navigation policies become physically constrained without requiring depth sensors at runtime.
Semantic reasoning and geometric constraints operate inside the same planning representation.
State-of-the-art results on zero-shot object goal navigation benchmarks become attainable from RGB alone.
Structured physical priors can substitute for active depth hardware in embodied settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection-plus-value-map pattern could transfer to other monocular tasks such as manipulation or exploration.
Accuracy gains in future 3D foundation models would directly raise the reliability of the resulting maps.
The approach reconnects high-level vision-language reasoning with classical geometric planning techniques.
Real-world robot trials would expose any remaining gaps between model-generated boxes and actual surfaces.

Load-bearing premise

3D foundation models can reliably project 2D semantic instances into accurate 3D oriented bounding boxes from monocular observations to form a trustworthy global spatial semantic representation without major physical errors.

What would settle it

Controlled tests in which the 3D bounding boxes contain frequent geometric errors that produce collisions or failed paths would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.31919 by Bayram Bayramli, Guodong Zhang, Hongtao Lu, Qiuchang Li, Shaokai Wu, Wenyuan Xie, Xunchu Zhou, Yanbiao Ji, Yijin Zhou, Yue Ding.

**Figure 1.** Figure 1: The overview of our navigation system. ing, they often fail to accurately perceive physical obstacles, resulting in navigation that is semantically reasonable yet physically unsafe. Collectively, these approaches expose a fundamental dilemma: existing solutions either decouple physical perception from high-level planning or sacrifice geometric constraints at the level of low-level control. As a result, cu… view at source ↗

**Figure 2.** Figure 2: Overview of the MVP-Nav framework. Our system employs a recursive architecture that first transforms monocular RGB sequences into a 3D Global Spatial Semantic List (GSSL) using foundation models. A Vision-Language Model (VLM) then reasons over this list to assign semantic scores and determine the navigation mode. These insights are integrated into a Multi-layer Value Map (MVM) to identify an optimal midter… view at source ↗

**Figure 3.** Figure 3: An example of a part of GSSL. D. VLM-based High-level Reasoning The VLM reasoning module serves as the cognitive core, taking the GSSL and navigation goal G as inputs to output semantic importance scores αi for each entity and a navigation mode to guide spatial planning. To evaluate the exploration value, the Vision-Language Model (VLM, e.g., GPT-4o) leverages common-sense priors to assess the relevance o… view at source ↗

**Figure 4.** Figure 4: Construction of the Multi-layer Value Map (MVM). (a) The framework reconstructs 3D oriented bounding boxes (OBBs) from [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the local goal gst generation. (a) The A ∗ path from start position to the midterm goal. (b) The silding window of current position. (c) The intersection of sliding window and the A ∗ path, which is the projected goal. (d) The System generates a shortterm goal (STG) in the local occupancy map with FMM Planner. IV. EXPERIMENTS A. Experimental Setup Benchmarks: We evaluate the performance o… view at source ↗

**Figure 6.** Figure 6: Examples of successful navigation episodes in the HM3D dataset. The trajectories illustrate how [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Real-world experimental results. We evaluated [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Real-world deployment on a wheeled robot platform. We [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Real-world deployment in another scenario. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Zero-shot Object Goal Navigation (ZSON) with RGB-only perception poses a fundamental challenge for embodied agents, as the absence of explicit depth information introduces severe physical uncertainty and semantic-physical misalignment. Existing approaches either rely on high-level semantic reasoning without geometric grounding or learn end-to-end policies that lack explicit physical constraints, often resulting in semantically plausible but physically unsafe behaviors. In this paper, we propose MVP-Nav, a physical-aware RGB-only navigation framework that aligns perception, planning, and control with the real 3D world. MVP-Nav reconstructs explicit physical occupancy from monocular observations by leveraging 3D foundation models to project 2D semantic instances into 3D oriented bounding boxes, forming a global spatial semantic representation. To unify high-level semantic reasoning and low-level physical constraints, we introduce a Multi-layer Value Map (MVM) that integrates semantic priorities and reconstructed geometry into a shared cost space, enabling physically grounded geometric planning. Extensive experiments on zero-shot object navigation benchmarks demonstrate that MVP-Nav significantly outperforms existing depth-free methods, achieving state-of-the-art performance and validating that structured physical priors can effectively compensate for the absence of active depth sensors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVP-Nav sketches an RGB-only navigation pipeline with 3D foundation models for occupancy and a multi-layer value map, but the abstract supplies zero numbers or validation so the claims cannot be checked.

read the letter

The one thing to know is that this paper describes MVP-Nav as a way to do zero-shot object goal navigation from RGB alone by turning monocular images into explicit 3D occupancy via foundation-model projections of 2D instances into oriented bounding boxes, then planning over a Multi-layer Value Map that folds semantic priorities and reconstructed geometry into one cost space.

The framing is clear on the problem: existing depth-free methods either stay too high-level on semantics or produce unsafe behaviors without physical constraints. The proposed split between reconstruction and layered planning is a reasonable structure for adding geometric grounding without depth sensors.

The soft spots are large. The abstract states that the system significantly outperforms prior depth-free methods and reaches SOTA, yet it contains no tables, no metrics, no ablation results, and no error analysis. The load-bearing step is the monocular 3D projection itself; the stress-test note is correct that scale, depth, and orientation ambiguities remain, and nothing in the abstract shows any check against ground-truth depth or scans. Without that verification, the claim that structured priors compensate for missing depth stays untested. The full manuscript might contain the missing experiments, but they are not visible here.

This work targets researchers in embodied navigation who want explicit maps instead of pure end-to-end policies. A reader could borrow the MVM cost-space idea for their own planner, but the lack of evidence makes it hard to judge whether the approach delivers on the physical-safety promise.

I would not bring the paper to a reading group. I would not cite it. It does not deserve peer review until the experimental section and reconstruction validation are supplied.

Referee Report

2 major / 1 minor

Summary. The paper proposes MVP-Nav, a physical-aware RGB-only framework for zero-shot object goal navigation. It reconstructs explicit physical occupancy from monocular RGB observations by using 3D foundation models to project 2D semantic instances into 3D oriented bounding boxes, forming a global spatial semantic representation. A Multi-layer Value Map (MVM) is introduced to integrate semantic priorities and reconstructed geometry into a shared cost space for physically grounded planning. The manuscript claims that MVP-Nav significantly outperforms existing depth-free methods on ZSON benchmarks and validates that structured physical priors can compensate for the absence of active depth sensors.

Significance. If the reconstruction step holds, the work would show that explicit geometric priors derived from monocular observations can enable safe navigation without depth sensors, with potential impact on cost-effective embodied systems. The MVM concept provides a concrete mechanism for unifying semantic reasoning and physical constraints. Credit is due for the explicit attempt to ground planning in reconstructed 3D structure rather than end-to-end learning alone.

major comments (2)

[Abstract] Abstract: The central empirical claim that MVP-Nav 'significantly outperforms existing depth-free methods, achieving state-of-the-art performance' is asserted without any quantitative benchmark scores, tables, error analysis, or ablation results, preventing evaluation of the outperformance assertion.
[Method] Method section (reconstruction pipeline): The claim that 3D foundation models produce sufficiently accurate 3D oriented bounding boxes and occupancy from single RGB frames to form a 'trustworthy global spatial semantic representation' is load-bearing for the physical-compensation thesis, yet no quantitative fidelity metrics (e.g., IoU against ground-truth depth or 3D scans), error propagation analysis, or handling of monocular scale/orientation ambiguities are reported.

minor comments (1)

[Introduction] The acronym MVM is defined but its relationship to existing layered cost-map or value-map representations in the navigation literature is not discussed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the presentation of results and validation of the reconstruction pipeline.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim that MVP-Nav 'significantly outperforms existing depth-free methods, achieving state-of-the-art performance' is asserted without any quantitative benchmark scores, tables, error analysis, or ablation results, preventing evaluation of the outperformance assertion.

Authors: We agree that the abstract would benefit from explicit quantitative support. The experiments section of the manuscript contains the relevant tables, success rates, and comparisons; to address the concern directly, we will revise the abstract to include key numerical results (e.g., success rate deltas on the primary ZSON benchmarks) while preserving its concise nature. revision: yes
Referee: [Method] Method section (reconstruction pipeline): The claim that 3D foundation models produce sufficiently accurate 3D oriented bounding boxes and occupancy from single RGB frames to form a 'trustworthy global spatial semantic representation' is load-bearing for the physical-compensation thesis, yet no quantitative fidelity metrics (e.g., IoU against ground-truth depth or 3D scans), error propagation analysis, or handling of monocular scale/orientation ambiguities are reported.

Authors: The referee is correct that explicit fidelity metrics would strengthen the physical-compensation argument. The current manuscript emphasizes end-to-end navigation outcomes rather than isolated reconstruction benchmarks; we will add a dedicated analysis (main text or appendix) reporting IoU and scale-error statistics against available ground-truth depth/scan data from the evaluation environments, along with a description of how multi-view fusion and semantic consistency in the MVM mitigate monocular ambiguities. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents MVP-Nav as a framework that reconstructs 3D occupancy via 3D foundation models projecting 2D instances to oriented bounding boxes and then builds a Multi-layer Value Map integrating semantics and geometry. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to its own inputs by construction. The derivation chain consists of system design choices justified by external foundation models and empirical benchmarks rather than self-referential definitions or renamed fits. The accuracy of monocular 3D projections is an external assumption subject to verification, not a circular step internal to the paper's logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; full details on parameters and assumptions are unavailable.

axioms (1)

domain assumption 3D foundation models can accurately project 2D semantic instances into reliable 3D oriented bounding boxes from monocular RGB observations
Directly invoked in the abstract to reconstruct physical occupancy.

invented entities (1)

Multi-layer Value Map (MVM) no independent evidence
purpose: Integrates semantic priorities and reconstructed geometry into a shared cost space for physically grounded planning
Introduced as the core unifying component of the framework

pith-pipeline@v0.9.1-grok · 5767 in / 1118 out tokens · 29644 ms · 2026-07-01T05:17:37.356712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 19 canonical work pages · 10 internal anchors

[1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018. URL https://arxiv.org/abs/1807. 06757

2018
[2]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Bridg- ing zero-shot object navigation and foundation models through pixel-guided navigation skill

Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridg- ing zero-shot object navigation and foundation models through pixel-guided navigation skill. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024

2024
[4]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9550–9560, 2025

2025
[5]

Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

2017
[6]

Learning to explore using active neural slam

Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. InInternational Conference on Learning Representations (ICLR), 2020

2020
[7]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33: 4247–4258, 2020

2020
[8]

Geometrically-constrained agent for spatial reasoning

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng. Geometrically-constrained agent for spatial reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 38689–38699, 2026

2026
[9]

Robothor: An open simulation-to-real embodied ai platform

Matt Deitke, Winson Han, Alvaro Herrasti, Anirud- dha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, et al. Robothor: An open simulation-to-real embodied ai platform. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3164–3174, 2020

2020
[10]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024
[11]

The University of North Carolina at Chapel Hill, 2000

Stefan Aric Gottschalk.Collision queries using oriented bounding boxes. The University of North Carolina at Chapel Hill, 2000

2000
[12]

A formal basis for the heuristic determination of minimum cost paths.IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

1968
[13]

AstraNav-World: World Model for Foresight Control and Consistency

Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt- 4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Panonav: Mapless zero-shot object navigation with panoramic scene parsing and dynamic memory, 2025

Qunchao Jin, Yilin Wu, and Changhao Chen. Panonav: Mapless zero-shot object navigation with panoramic scene parsing and dynamic memory, 2025. URL https: //arxiv.org/abs/2511.06840

work page arXiv 2025
[16]

MapAnything: Univer- sal feed-forward metric 3D reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Univer- sal feed-forward metric 3D reconstructi...

2026
[17]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models.arXiv preprint arXiv:2402.10670, 2024

Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models.arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024
[19]

Artificial potential field based path planning for mobile robots using a virtual obstacle concept

Min Cheol Lee and Min Gyu Park. Artificial potential field based path planning for mobile robots using a virtual obstacle concept. InProceedings 2003 IEEE/ASME In- ternational Conference on Advanced Intelligent Mecha- tronics (AIM 2003), volume 2, pages 735–740 vol.2,

2003
[20]

doi: 10.1109/AIM.2003.1225434

work page doi:10.1109/aim.2003.1225434 2003
[21]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open- set object detection.arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Stubborn: A strong baseline for indoor object navigation

Haokuan Luo, Albert Yue, Zhang-Wei Hong, and Pulkit Agrawal. Stubborn: A strong baseline for indoor object navigation. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3287– 3293, 2022. doi: 10.1109/IROS47612.2022.9981646

work page doi:10.1109/iros47612.2022.9981646 2022
[24]

Zson: Zero-shot object- goal navigation using multimodal goal embeddings.Ad- vances in Neural Information Processing Systems, 35: 32340–32352, 2022

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object- goal navigation using multimodal goal embeddings.Ad- vances in Neural Information Processing Systems, 35: 32340–32352, 2022

2022
[25]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InThirty-fifth Conference on Neural Information...
[26]

URL https://arxiv.org/abs/2109.08238

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Poni: Potential functions for objectgoal navigation with interaction-free learning, 2022

Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning, 2022. URL https://arxiv.org/ abs/2201.10029

work page arXiv 2022
[28]

Habitat-web: Learning embodied object- search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object- search strategies from human demonstrations at scale. In CVPR, 2022

2022
[29]

Grounded sam: Assembling open-world models for di- verse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for di- verse visual tasks, 2024. URL https://arxiv.org/abs/2401. 14159

2024
[30]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InICCV, 2019

2019
[31]

A fast marching level set method for monotonically advancing fronts.proceedings of the National Academy of Sciences, 93(4):1591–1595, 1996

James A Sethian. A fast marching level set method for monotonically advancing fronts.proceedings of the National Academy of Sciences, 93(4):1591–1595, 1996

1996
[32]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025
[34]

Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning

Mitchell Wortsman, Kiana Ehsani, Mohammad Raste- gari, Ali Farhadi, and Roozbeh Mottaghi. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning, 2019. URL https://arxiv.org/abs/ 1812.00971

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

V oronav: V oronoi- based zero-shot object navigation with large language model.arXiv preprint arXiv:2401.02695, 2024

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. V oronav: V oronoi- based zero-shot object navigation with large language model.arXiv preprint arXiv:2401.02695, 2024

work page arXiv 2024
[36]

Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real- world perception for embodied agents. InComputer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018

2018
[37]

Mobility vla: Multi- modal instruction navigation with long-context vlms and topological graphs

Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Ed- ward Lee, Wenhao Yu, Connor Schenck, David Rendle- man, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility vla: Multi- modal ins...

2025
[38]

Omninav: A unified framework for prospective exploration and visual-language navigation

Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jin- tao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, and Zedong Chu. Omninav: A unified framework for prospective exploration and visual-language navigation. arXiv preprint arXiv:2509.25687, 2025

work page arXiv 2025
[39]

Offline visual repre- sentation learning for embodied navigation

Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, and Oleksandr Maksymets. Offline visual repre- sentation learning for embodied navigation. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023

2023
[40]

Visual semantic navigation using scene priors, 2018

Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. Visual semantic navigation using scene priors, 2018. URL https://arxiv.org/abs/1810. 06543

2018
[41]

Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

2024
[42]

Unigoal: Towards universal zero- shot goal-oriented navigation

Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero- shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

2025
[43]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In International Conference on Robotics and Automation (ICRA), 2024

2024
[44]

L3mvn: Leveraging large language models for visual target nav- igation

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target nav- igation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560,
[45]

doi: 10.1109/IROS55552.2023.10342512

work page doi:10.1109/iros55552.2023.10342512 2023
[46]

A theory of geometry representations for spatial navigation.Progress in Neurobiology, 211:102228, 2022

Taiping Zeng, Bailu Si, and Jianfeng Feng. A theory of geometry representations for spatial navigation.Progress in Neurobiology, 211:102228, 2022

2022
[47]

Embodied navigation foundation model, 2025

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Ji- ahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, Yuxin Fan, Wenjun Li, Zhibo Chen, Fei Gao, Qi Wu, Zhizheng Zhang, and He Wang. Embodied navigation foundation model, 2025. URL https://arxiv.org/abs/2509.12129

work page arXiv 2025
[48]

Imaginenav: Prompting vision-language models as embodied navigator through scene imagination

Xinxin Zhao, Wenzhe Cai, Likun Tang, and Teng Wang. Imaginenav: Prompting vision-language models as embodied navigator through scene imagination. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 94387–94401, 2025. URL https://proceedings.iclr.cc/paper files/paper/2025/file/ ...

2025
[49]

Esc: Exploration with soft commonsense constraints for zero- shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero- shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023

2023
[50]

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target- driven visual navigation in indoor scenes using deep reinforcement learning, 2016. URL https://arxiv.org/abs/ 1609.05143

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018. URL https://arxiv.org/abs/1807. 06757

2018

[2] [2]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Bridg- ing zero-shot object navigation and foundation models through pixel-guided navigation skill

Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridg- ing zero-shot object navigation and foundation models through pixel-guided navigation skill. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024

2024

[4] [4]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9550–9560, 2025

2025

[5] [5]

Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

2017

[6] [6]

Learning to explore using active neural slam

Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. InInternational Conference on Learning Representations (ICLR), 2020

2020

[7] [7]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33: 4247–4258, 2020

2020

[8] [8]

Geometrically-constrained agent for spatial reasoning

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng. Geometrically-constrained agent for spatial reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 38689–38699, 2026

2026

[9] [9]

Robothor: An open simulation-to-real embodied ai platform

Matt Deitke, Winson Han, Alvaro Herrasti, Anirud- dha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, et al. Robothor: An open simulation-to-real embodied ai platform. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3164–3174, 2020

2020

[10] [10]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024

[11] [11]

The University of North Carolina at Chapel Hill, 2000

Stefan Aric Gottschalk.Collision queries using oriented bounding boxes. The University of North Carolina at Chapel Hill, 2000

2000

[12] [12]

A formal basis for the heuristic determination of minimum cost paths.IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

1968

[13] [13]

AstraNav-World: World Model for Foresight Control and Consistency

Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt- 4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Panonav: Mapless zero-shot object navigation with panoramic scene parsing and dynamic memory, 2025

Qunchao Jin, Yilin Wu, and Changhao Chen. Panonav: Mapless zero-shot object navigation with panoramic scene parsing and dynamic memory, 2025. URL https: //arxiv.org/abs/2511.06840

work page arXiv 2025

[16] [16]

MapAnything: Univer- sal feed-forward metric 3D reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Univer- sal feed-forward metric 3D reconstructi...

2026

[17] [17]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models.arXiv preprint arXiv:2402.10670, 2024

Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models.arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024

[19] [19]

Artificial potential field based path planning for mobile robots using a virtual obstacle concept

Min Cheol Lee and Min Gyu Park. Artificial potential field based path planning for mobile robots using a virtual obstacle concept. InProceedings 2003 IEEE/ASME In- ternational Conference on Advanced Intelligent Mecha- tronics (AIM 2003), volume 2, pages 735–740 vol.2,

2003

[20] [20]

doi: 10.1109/AIM.2003.1225434

work page doi:10.1109/aim.2003.1225434 2003

[21] [21]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open- set object detection.arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Stubborn: A strong baseline for indoor object navigation

Haokuan Luo, Albert Yue, Zhang-Wei Hong, and Pulkit Agrawal. Stubborn: A strong baseline for indoor object navigation. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3287– 3293, 2022. doi: 10.1109/IROS47612.2022.9981646

work page doi:10.1109/iros47612.2022.9981646 2022

[24] [24]

Zson: Zero-shot object- goal navigation using multimodal goal embeddings.Ad- vances in Neural Information Processing Systems, 35: 32340–32352, 2022

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object- goal navigation using multimodal goal embeddings.Ad- vances in Neural Information Processing Systems, 35: 32340–32352, 2022

2022

[25] [25]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InThirty-fifth Conference on Neural Information...

[26] [26]

URL https://arxiv.org/abs/2109.08238

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Poni: Potential functions for objectgoal navigation with interaction-free learning, 2022

Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning, 2022. URL https://arxiv.org/ abs/2201.10029

work page arXiv 2022

[28] [28]

Habitat-web: Learning embodied object- search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object- search strategies from human demonstrations at scale. In CVPR, 2022

2022

[29] [29]

Grounded sam: Assembling open-world models for di- verse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for di- verse visual tasks, 2024. URL https://arxiv.org/abs/2401. 14159

2024

[30] [30]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InICCV, 2019

2019

[31] [31]

A fast marching level set method for monotonically advancing fronts.proceedings of the National Academy of Sciences, 93(4):1591–1595, 1996

James A Sethian. A fast marching level set method for monotonically advancing fronts.proceedings of the National Academy of Sciences, 93(4):1591–1595, 1996

1996

[32] [32]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025

[34] [34]

Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning

Mitchell Wortsman, Kiana Ehsani, Mohammad Raste- gari, Ali Farhadi, and Roozbeh Mottaghi. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning, 2019. URL https://arxiv.org/abs/ 1812.00971

work page internal anchor Pith review Pith/arXiv arXiv 2019

[35] [35]

V oronav: V oronoi- based zero-shot object navigation with large language model.arXiv preprint arXiv:2401.02695, 2024

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. V oronav: V oronoi- based zero-shot object navigation with large language model.arXiv preprint arXiv:2401.02695, 2024

work page arXiv 2024

[36] [36]

Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real- world perception for embodied agents. InComputer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018

2018

[37] [37]

Mobility vla: Multi- modal instruction navigation with long-context vlms and topological graphs

Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Ed- ward Lee, Wenhao Yu, Connor Schenck, David Rendle- man, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility vla: Multi- modal ins...

2025

[38] [38]

Omninav: A unified framework for prospective exploration and visual-language navigation

Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jin- tao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, and Zedong Chu. Omninav: A unified framework for prospective exploration and visual-language navigation. arXiv preprint arXiv:2509.25687, 2025

work page arXiv 2025

[39] [39]

Offline visual repre- sentation learning for embodied navigation

Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, and Oleksandr Maksymets. Offline visual repre- sentation learning for embodied navigation. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023

2023

[40] [40]

Visual semantic navigation using scene priors, 2018

Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. Visual semantic navigation using scene priors, 2018. URL https://arxiv.org/abs/1810. 06543

2018

[41] [41]

Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

2024

[42] [42]

Unigoal: Towards universal zero- shot goal-oriented navigation

Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero- shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

2025

[43] [43]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In International Conference on Robotics and Automation (ICRA), 2024

2024

[44] [44]

L3mvn: Leveraging large language models for visual target nav- igation

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target nav- igation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560,

[45] [45]

doi: 10.1109/IROS55552.2023.10342512

work page doi:10.1109/iros55552.2023.10342512 2023

[46] [46]

A theory of geometry representations for spatial navigation.Progress in Neurobiology, 211:102228, 2022

Taiping Zeng, Bailu Si, and Jianfeng Feng. A theory of geometry representations for spatial navigation.Progress in Neurobiology, 211:102228, 2022

2022

[47] [47]

Embodied navigation foundation model, 2025

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Ji- ahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, Yuxin Fan, Wenjun Li, Zhibo Chen, Fei Gao, Qi Wu, Zhizheng Zhang, and He Wang. Embodied navigation foundation model, 2025. URL https://arxiv.org/abs/2509.12129

work page arXiv 2025

[48] [48]

Imaginenav: Prompting vision-language models as embodied navigator through scene imagination

Xinxin Zhao, Wenzhe Cai, Likun Tang, and Teng Wang. Imaginenav: Prompting vision-language models as embodied navigator through scene imagination. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 94387–94401, 2025. URL https://proceedings.iclr.cc/paper files/paper/2025/file/ ...

2025

[49] [49]

Esc: Exploration with soft commonsense constraints for zero- shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero- shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023

2023

[50] [50]

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target- driven visual navigation in indoor scenes using deep reinforcement learning, 2016. URL https://arxiv.org/abs/ 1609.05143

work page internal anchor Pith review Pith/arXiv arXiv 2016