EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

Benedict Jun Ma; Zecheng Yin

arxiv: 2606.18634 · v1 · pith:5HYE4WG7new · submitted 2026-06-17 · 💻 cs.RO · cs.AI

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

Zecheng Yin , Benedict Jun Ma This is my paper

Pith reviewed 2026-06-26 21:01 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords object goal navigationdepth fusionvision-language modelsefficient explorationrobot navigationunknown environmentssuccess ratepath length

0 comments

The pith

Fusing depth maps with vision-language outputs lets navigation agents choose next steps that cut revisits and redundant motion in unknown spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EffiNav to solve efficient object goal navigation by deciding where to explore next in unseen environments. It fuses depth signals with vision-language model outputs to guide movement, aiming to fix generalization failures in learned systems and inefficiency in non-learned ones that cause looping or backtracking. Evaluations on HM3D and OVON simulation benchmarks, plus real-robot tests, show it matches or exceeds recent baselines on success rate and success weighted by path length. The work also adapts the same approach to a memory-augmented variant on GOAT-BENCH with minimal changes. A reader would care because shorter, smarter paths leave more time for follow-on tasks in search-and-rescue or field robotics.

Core claim

EffiNav fuses depth and vision-language signals to produce exploration decisions that avoid excessive revisits or redundant back-and-forth motion, delivering success rates and success-weighted path lengths that match or surpass recent baselines on HM3D and OVON while validating on physical robots and extending to memory-augmented object goal navigation on GOAT-BENCH.

What carries the argument

Fusion of depth maps and vision-language model outputs to select the next exploration location in unknown environments.

If this is right

Agents reach targets with higher path efficiency as measured by SPL.
The same fusion works across two distinct simulation benchmarks without retraining.
Real-robot deployment requires only minimal changes from the simulated version.
The framework extends to memory-augmented object goal navigation with little modification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fusion approach may generalize to environments containing dynamic obstacles if depth updates remain reliable.
Combining the method with additional sensors such as wheel odometry could further lower path length in cluttered spaces.
Failure analysis on large episode sets could identify specific scene types where language cues add the most value over depth alone.

Load-bearing premise

That depth and vision-language signals together yield exploration choices that reliably reduce revisits and redundant motion compared with prior methods.

What would settle it

A set of new simulation episodes where EffiNav produces trajectories whose revisit rate or back-and-forth count matches or exceeds the worst-performing baselines.

Figures

Figures reproduced from arXiv: 2606.18634 by Benedict Jun Ma, Zecheng Yin.

**Figure 1.** Figure 1: Overview of the proposed EffiNav framework. The EffiNav processes RGB-D input to generate region masks, selects ego-centric regions via VLM [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Illustration of some “easy navigation” in OVON. Goal: fireplace, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 2.** Figure 2: Illustration of path planning. Goal: TV monitor. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Distribution of goal geodesic distance in HM3D and OVON [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of goals in HM3D and OVON geodesic distances from the agent’s initial position to the goal, illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Failure analysis results. Failure reson and item distribution in HM3D, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 9.** Figure 9: Trajectories on the same goal “refrigerator cabinet” with different [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 8.** Figure 8: EffiNav search trajectories toward refrigerator cabinet and bed as [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Real world environment, position settings and trajectory overview. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Trajectory of real-world validation on Unitree robot. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Qwen2.5 VL detection prompt (green box) and VLM output (blue [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Qwen2.5 VL VQA prompt in global wise check in [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Qwen2.5 VL ego centric VQA prompt in module a [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

read the original abstract

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EffiNav fuses depth and vision-language signals for better ObjNav exploration decisions and shows competitive results on sim benchmarks plus real robots, but the gains look incremental rather than foundational.

read the letter

The core point here is that the authors combine depth sensing with vision-language model outputs to guide where the agent looks next in unknown spaces, and this produces SR and SPL numbers that match or beat recent baselines while cutting down on backtracking. They back it with tests on HM3D and OVON, real-robot runs, a failure analysis across many episodes, and a quick extension to the memory-augmented GOAT-BENCH setting.

What the work actually adds is a concrete fusion recipe that targets the generalization problems of trained models and the inefficiency of non-trained ones. The real-robot validation and the failure breakdown are useful; they show the method is not just a sim artifact. The stress-test note found no internal contradictions or unsupported leaps in the argument chain, so the central performance claim appears to rest on the reported comparisons.

The soft spots are modest. Without the exact fusion architecture or ablation numbers in the abstract, it is difficult to judge how much each component drives the improvement versus careful tuning or dataset specifics. The generalization story would be stronger with more varied real-world conditions or explicit comparisons on how much redundant motion is actually avoided. These are normal engineering-paper issues rather than fatal gaps.

This paper is aimed at robotics researchers who already work on object-goal or embodied navigation and want practical multimodal ideas that run on hardware. A reader in that niche would pick up implementation details and benchmark numbers worth checking. It is coherent on its own terms and shows honest engagement with the metrics that matter for the task.

I would send it to peer review. The multi-benchmark plus real-robot setup gives referees something concrete to evaluate, even if the novelty is more applied than theoretical.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces EffiNav, a framework fusing depth information with vision-language models to enable more efficient exploration decisions in Object Goal Navigation (ObjNav) tasks within unknown environments. It evaluates the approach on HM3D and OVON simulation benchmarks, extends it with minimal modification to a memory-augmented setting on GOAT-BENCH, and includes real-robot validation, while providing failure analysis across simulation episodes. The central claim is that EffiNav matches or outperforms recent baselines on Success Rate (SR) and Success weighted by Path Length (SPL), indicating improved efficiency, robustness, and generalizability.

Significance. If the reported performance on SR and SPL holds, the work provides a practical fusion-based approach that balances success and trajectory efficiency in ObjNav, with demonstrated adaptability to related tasks and real-world settings. This could support deployment in applications such as search-and-rescue where avoiding redundant motion is critical.

minor comments (3)

[Abstract] Abstract: The final sentence contains a subject-verb agreement error ('the performances reveals' should be 'the performance reveals').
[Abstract] Abstract and §4: While the text states that EffiNav 'matches or outperforms' baselines on SR and SPL, the abstract itself provides no numerical values, error bars, or dataset-specific breakdowns; ensure the results section supplies these with clear table references for verifiability.
The description of real-robot validation would benefit from explicit mention of sensor setup, environment scale, and any domain-shift handling to strengthen the practical-applicability claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on EffiNav and the recommendation for minor revision. The referee's description accurately captures the framework's fusion of depth and vision-language inputs, its performance on HM3D and OVON, the real-robot validation, failure analysis, and the extension to GOAT-BENCH. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical navigation framework whose performance claims rest on direct comparisons against external baselines on HM3D, OVON, GOAT-BENCH, and real-robot trials. No equations, fitted parameters, or first-principles derivations appear in the provided text; success metrics (SR, SPL) are reported as measured outcomes rather than quantities defined in terms of themselves. No self-citation chains or ansatzes are invoked to close any derivation loop, so the argument chain remains open to external falsification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical model, no explicit parameters, and no postulated entities; therefore the ledger is empty.

pith-pipeline@v0.9.1-grok · 5825 in / 1163 out tokens · 30619 ms · 2026-06-26T21:01:47.065894+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018. URL https://arxiv.org/abs/1807. 06757

2018
[2]

Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance, 2025

Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun Chen, Tai Wang, and Jiangmiao Pang. Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance, 2025

2025
[3]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InInternational Conference on Computer Vision (ICCV), 2025

2025
[4]

Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

2017
[5]

Navila: Legged robot vision-language- action model for navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language- action model for navigation. InRSS, 2025

2025
[6]

End-to-end navigation with vlms: Trans- forming spatial reasoning into question-answering

Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vlms: Trans- forming spatial reasoning into question-answering. In Workshop on Language and Robot Learning: Language as an Interface, 2024

2024
[7]

Diffusion as reasoning: Enhancing object navigation via diffusion model condi- tioned on llm-based object-room knowledge, 2025

Yiming Ji, Kaijie Yun, Yang Liu, Zhengpu Wang, Boyu Ma, Zongwu Xie, and Hong Liu. Diffusion as reasoning: Enhancing object navigation via diffusion model condi- tioned on llm-based object-room knowledge, 2025. URL https://arxiv.org/abs/2410.21842

work page arXiv 2025
[8]

Dynavlm: Zero- shot vision-language navigation system with dynamic viewpoints and self-refining graph memory, 2025

Zihe Ji, Huangxuan Lin, and Yue Gao. Dynavlm: Zero- shot vision-language navigation system with dynamic viewpoints and self-refining graph memory, 2025. URL https://arxiv.org/abs/2506.15096

work page arXiv 2025
[9]

Goat-bench: A benchmark for multi- modal lifelong navigation, 2024

Mukul Khanna*, Ram Ramrakhya*, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi- modal lifelong navigation, 2024

2024
[10]

Beyond the nav-graph: Vision- and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- Michael Frahm, editors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XXVIII, volume ...

work page doi:10.1007/978-3-030-58604-1 2020
[11]

Ground-level viewpoint vision-and-language navigation in continuous environments

Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, and Qi Wu. Ground-level viewpoint vision-and-language navigation in continuous environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5266–5273,
[12]

doi: 10.1109/ICRA55743.2025.11127275

work page doi:10.1109/icra55743.2025.11127275 2025
[13]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

PIVOT: iterative visual prompting elicits action- able knowledge for vlms

Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, and Brian Ichter. PIVOT: iterative visual prompting elicits ac...

2024
[15]

Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

Xavi Puig, Eric Undersander, Andrew Szot, Mikael Dal- laire Cote, Ruslan Partsey, Jimmy Yang, Ruta Desai, Alexander William Clegg, Michal Hlavac, Tiffany Min, Theo Gervet, Vladim ´ır V ondrus, Vincent-Pierre Berges, John Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara R...

2023
[16]

Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms

Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[17]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wij- mans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (hm3d): 1000 large- scale 3d environments for embodied ai, 2021. URL https://arxiv.org/abs/2109.08238

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Pirlnav: Pretraining with imitation and rl finetuning for objectnav

Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Ab- hishek Das. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. InComputer Vision and Pattern Recognition (CVPR), 2023 IEEE Conference on, 2023

2023
[20]

Vint: A foundation model for visual navigation

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Sta- chowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors, Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Re- search, pages 711–733. PMLR, 06–09 Nov 2...

2023
[21]

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration. InInternational Confer- ence on Robotics and Automation(ICRA), 2024. URL https://arxiv.org/abs/2310.xxxx

2024
[22]

Duan and Y

Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, and Hermann Blum. Frontier- net: Learning visual cues to explore.IEEE Robotics and Automation Letters, 10(7):6576–6583, 2025. doi: 10.1109/LRA.2025.3569122

work page doi:10.1109/lra.2025.3569122 2025
[23]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https: //qwenlm.github.io/blog/qwen2.5-vl/

2025
[24]

Trackvla: Embodied visual tracking in the wild.arXiv pre-print, 2025

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual tracking in the wild.arXiv pre-print, 2025. URL http://arxiv.org/abs/2505.23189

work page arXiv 2025
[25]

3d-mem: 3d scene memory for embodied exploration and reasoning

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 17294–17303, June 2025

2025
[26]

Navigation with vlm framework: Go to any language, 2024

Zecheng Yin, Chonghao Cheng, and Lizhen. Navigation with vlm framework: Go to any language, 2024. URL https://arxiv.org/abs/2410.02787

work page arXiv 2024
[27]

Hypernav: Hybrid perception for object-oriented navigation in unknown en- vironment, 2025

Zecheng Yin, Hao Zhao, and Zhen Li. Hypernav: Hybrid perception for object-oriented navigation in unknown en- vironment, 2025. URL https://arxiv.org/abs/2510.22917

work page arXiv 2025
[28]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In International Conference on Robotics and Automation (ICRA), 2024

2024
[29]

Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal naviga- tion

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal naviga- tion. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

2024
[30]

Trajectory diffusion for objectgoal navigation

Xinyao Yu, Sixian Zhang, Xinhang Song, Xiaorong Qin, and Shuqiang Jiang. Trajectory diffusion for objectgoal navigation. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2025. Curran Associates Inc. ISBN 9798331314385

2025
[31]

Gamap: zero-shot object goal navigation with multi-scale geometric-affordance guidance

Shuaihang Yuan, Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, and Yi Fang. Gamap: zero-shot object goal navigation with multi-scale geometric-affordance guidance. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2025. Curran Associates Inc. ISBN 9798331314385

2025
[32]

Navidiffusor: Cost-guided diffusion model for visual navigation

Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, and Hui Cheng. Navidiffusor: Cost-guided diffusion model for visual navigation. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[33]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Embodied navigation foundation model

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jia- hang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025
[35]

Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.Robotics: Science and Systems, 2025

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.Robotics: Science and Systems, 2025

2025
[36]

URL https://arxiv.org/abs/2508.04598

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Haoxiang Fu, Xinyu Zheng, Pengwei Wang, Zhongyuan Wang, Wenbo Ding, and Shanghang Zhang.nava 3: Under- standing any instruction, navigating anywhere, finding anything, 2025. URL https://arxiv.org/abs/2508.04598

work page arXiv 2025
[37]

yes” or “no

Yufeng Zhong, Chengjian Feng, Feng Yan, Fanfan Liu, Liming Zheng, and Lin Ma. Robotron-nav: A unified framework for embodied navigation integrating percep- tion, planning, and prediction. InInternational Confer- ence on Computer Vision (ICCV), 2025. APPENDIX LetNdenote the total number of evaluation episodes. For each episodei∈ {1, . . . , N}, define: •I ...

2025

[1] [1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018. URL https://arxiv.org/abs/1807. 06757

2018

[2] [2]

Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance, 2025

Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun Chen, Tai Wang, and Jiangmiao Pang. Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance, 2025

2025

[3] [3]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. InInternational Conference on Computer Vision (ICCV), 2025

2025

[4] [4]

Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments.International Conference on 3D Vision (3DV), 2017

2017

[5] [5]

Navila: Legged robot vision-language- action model for navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language- action model for navigation. InRSS, 2025

2025

[6] [6]

End-to-end navigation with vlms: Trans- forming spatial reasoning into question-answering

Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vlms: Trans- forming spatial reasoning into question-answering. In Workshop on Language and Robot Learning: Language as an Interface, 2024

2024

[7] [7]

Diffusion as reasoning: Enhancing object navigation via diffusion model condi- tioned on llm-based object-room knowledge, 2025

Yiming Ji, Kaijie Yun, Yang Liu, Zhengpu Wang, Boyu Ma, Zongwu Xie, and Hong Liu. Diffusion as reasoning: Enhancing object navigation via diffusion model condi- tioned on llm-based object-room knowledge, 2025. URL https://arxiv.org/abs/2410.21842

work page arXiv 2025

[8] [8]

Dynavlm: Zero- shot vision-language navigation system with dynamic viewpoints and self-refining graph memory, 2025

Zihe Ji, Huangxuan Lin, and Yue Gao. Dynavlm: Zero- shot vision-language navigation system with dynamic viewpoints and self-refining graph memory, 2025. URL https://arxiv.org/abs/2506.15096

work page arXiv 2025

[9] [9]

Goat-bench: A benchmark for multi- modal lifelong navigation, 2024

Mukul Khanna*, Ram Ramrakhya*, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi- modal lifelong navigation, 2024

2024

[10] [10]

Beyond the nav-graph: Vision- and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- Michael Frahm, editors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XXVIII, volume ...

work page doi:10.1007/978-3-030-58604-1 2020

[11] [11]

Ground-level viewpoint vision-and-language navigation in continuous environments

Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, and Qi Wu. Ground-level viewpoint vision-and-language navigation in continuous environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5266–5273,

[12] [12]

doi: 10.1109/ICRA55743.2025.11127275

work page doi:10.1109/icra55743.2025.11127275 2025

[13] [13]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

PIVOT: iterative visual prompting elicits action- able knowledge for vlms

Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, and Brian Ichter. PIVOT: iterative visual prompting elicits ac...

2024

[15] [15]

Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

Xavi Puig, Eric Undersander, Andrew Szot, Mikael Dal- laire Cote, Ruslan Partsey, Jimmy Yang, Ruta Desai, Alexander William Clegg, Michal Hlavac, Tiffany Min, Theo Gervet, Vladim ´ır V ondrus, Vincent-Pierre Berges, John Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara R...

2023

[16] [16]

Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms

Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[17] [17]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wij- mans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (hm3d): 1000 large- scale 3d environments for embodied ai, 2021. URL https://arxiv.org/abs/2109.08238

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Pirlnav: Pretraining with imitation and rl finetuning for objectnav

Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Ab- hishek Das. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. InComputer Vision and Pattern Recognition (CVPR), 2023 IEEE Conference on, 2023

2023

[20] [20]

Vint: A foundation model for visual navigation

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Sta- chowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors, Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Re- search, pages 711–733. PMLR, 06–09 Nov 2...

2023

[21] [21]

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration. InInternational Confer- ence on Robotics and Automation(ICRA), 2024. URL https://arxiv.org/abs/2310.xxxx

2024

[22] [22]

Duan and Y

Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, and Hermann Blum. Frontier- net: Learning visual cues to explore.IEEE Robotics and Automation Letters, 10(7):6576–6583, 2025. doi: 10.1109/LRA.2025.3569122

work page doi:10.1109/lra.2025.3569122 2025

[23] [23]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https: //qwenlm.github.io/blog/qwen2.5-vl/

2025

[24] [24]

Trackvla: Embodied visual tracking in the wild.arXiv pre-print, 2025

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual tracking in the wild.arXiv pre-print, 2025. URL http://arxiv.org/abs/2505.23189

work page arXiv 2025

[25] [25]

3d-mem: 3d scene memory for embodied exploration and reasoning

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 17294–17303, June 2025

2025

[26] [26]

Navigation with vlm framework: Go to any language, 2024

Zecheng Yin, Chonghao Cheng, and Lizhen. Navigation with vlm framework: Go to any language, 2024. URL https://arxiv.org/abs/2410.02787

work page arXiv 2024

[27] [27]

Hypernav: Hybrid perception for object-oriented navigation in unknown en- vironment, 2025

Zecheng Yin, Hao Zhao, and Zhen Li. Hypernav: Hybrid perception for object-oriented navigation in unknown en- vironment, 2025. URL https://arxiv.org/abs/2510.22917

work page arXiv 2025

[28] [28]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In International Conference on Robotics and Automation (ICRA), 2024

2024

[29] [29]

Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal naviga- tion

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal naviga- tion. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

2024

[30] [30]

Trajectory diffusion for objectgoal navigation

Xinyao Yu, Sixian Zhang, Xinhang Song, Xiaorong Qin, and Shuqiang Jiang. Trajectory diffusion for objectgoal navigation. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2025. Curran Associates Inc. ISBN 9798331314385

2025

[31] [31]

Gamap: zero-shot object goal navigation with multi-scale geometric-affordance guidance

Shuaihang Yuan, Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, and Yi Fang. Gamap: zero-shot object goal navigation with multi-scale geometric-affordance guidance. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2025. Curran Associates Inc. ISBN 9798331314385

2025

[32] [32]

Navidiffusor: Cost-guided diffusion model for visual navigation

Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, and Hui Cheng. Navidiffusor: Cost-guided diffusion model for visual navigation. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[33] [33]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Embodied navigation foundation model

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jia- hang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025

[35] [35]

Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.Robotics: Science and Systems, 2025

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.Robotics: Science and Systems, 2025

2025

[36] [36]

URL https://arxiv.org/abs/2508.04598

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Haoxiang Fu, Xinyu Zheng, Pengwei Wang, Zhongyuan Wang, Wenbo Ding, and Shanghang Zhang.nava 3: Under- standing any instruction, navigating anywhere, finding anything, 2025. URL https://arxiv.org/abs/2508.04598

work page arXiv 2025

[37] [37]

yes” or “no

Yufeng Zhong, Chengjian Feng, Feng Yan, Fanfan Liu, Liming Zheng, and Lin Ma. Robotron-nav: A unified framework for embodied navigation integrating percep- tion, planning, and prediction. InInternational Confer- ence on Computer Vision (ICCV), 2025. APPENDIX LetNdenote the total number of evaluation episodes. For each episodei∈ {1, . . . , N}, define: •I ...

2025