pith. machine review for the scientific record. sign in

arxiv: 2512.17435 · v3 · submitted 2025-12-19 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:05 UTC · model grok-4.3

classification 💻 cs.RO
keywords visual navigationvision-language modelsscene imaginationmapless navigationembodied AIobject navigationinstance navigationfuture view generation
0
0 comments X

The pith

Vision-language models navigate without maps by imagining future views from candidate positions and selecting the best one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard approaches to robot navigation either build explicit maps or rely on text-only reasoning from language models, which overlook spatial geometry and occupancy details essential for real-world movement. This work demonstrates that vision-language models can drive mapless navigation directly from RGB or RGB-D camera streams by first generating imagined future images of possible robot viewpoints. A dedicated imagination module creates these views by capturing patterns of effective human navigation choices, after which the model simply chooses the single most informative image as the next target. A memory mechanism then maintains consistency across observations by integrating keyframes in a sparse-to-dense hierarchy. The result converts long-horizon object search into repeated, simpler point-goal navigation steps, yielding top performance among mapless systems and outperforming many map-using baselines on standard benchmarks.

Core claim

The central claim is that prompting a vision-language model with imagined future observation images turns navigation planning into a tractable best-view selection task, supported by a selective foveation memory that preserves spatial consistency, enabling state-of-the-art mapless performance on open-vocabulary object and instance navigation that surpasses most map-based alternatives.

What carries the argument

The future-view imagination module that generates semantically meaningful candidate viewpoints from human navigation preferences, paired with VLM-based selection of the most informative view and a hierarchical selective foveation memory for long-term spatial reasoning.

If this is right

  • Navigation decisions reduce to repeated selection of the single most informative imagined image rather than complex geometric planning.
  • Long-horizon goal-directed tasks decompose into sequences of shorter point-goal navigation subtasks.
  • Scene imagination combined with selective memory enables effective spatial reasoning from onboard visual streams alone.
  • The approach achieves state-of-the-art results in mapless settings while exceeding many methods that rely on explicit maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same imagination-plus-selection loop could be applied to other embodied tasks such as object manipulation by generating future gripper views.
  • If the memory mechanism scales, it may reduce reliance on full environment mapping in changing or partially observed indoor spaces.
  • Visual prompting of this kind suggests VLMs can be steered toward latent spatial understanding without additional training on navigation data.

Load-bearing premise

The future-view imagination module accurately distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential.

What would settle it

Replacing the imagination module with random or static viewpoint generation and measuring whether performance on the open-vocabulary object and instance navigation benchmarks falls below that of competing mapless or map-based methods.

Figures

Figures reproduced from arXiv: 2512.17435 by Changyin Sun, Teng Wang, Wenzhe Cai, Xinxin Zhao.

Figure 1
Figure 1. Figure 1: The comparison between the conventional LLM-based navigation pipeline and our ImagineNav++ pipeline. The traditional LLM-based navigation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall pipeline of our mapless, open-vocabulary navigation framework ImagineNav++. At each iteration, the agent captures a panoramic view of its [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of human demonstration trajectories on the MP3D [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the synthesized image observations at future naviga [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Where2Imagine: the impact of different sampling intervals T on ObjectNav performance on HM3D dataset [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of trajectories at different sampling steps T. This image [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The visualization of the relative pose predicted by our Where2Imagine module and uniform sampling (radius: 2.0m; angular interval: [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of our hierarchical keyframe-based memory, depicting RGB observations and their corresponding top-down map positions for selected [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Complete prompt input and decision output of the Vision-Language model for exploration direction selection in the object goal navigation task. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of object-goal navigation trajectories: success vs. failure examples. The top and bottom rows compare top-down paths from successful and [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ImagineNav++, a framework for mapless embodied navigation that prompts VLMs by imagining future scene views from candidate positions. It features a future-view imagination module trained to distill human navigation preferences for generating informative viewpoints, a VLM-based selection of the best view, and a selective foveation memory for maintaining spatial consistency across observations. The approach converts long-horizon navigation into point-goal tasks and reports state-of-the-art results on open-vocabulary object and instance navigation benchmarks, outperforming most map-based methods.

Significance. Should the empirical claims hold under scrutiny, the work would be significant for the field of embodied AI and robotics. It offers a novel way to harness VLMs' visual reasoning capabilities for navigation without explicit mapping, which could lead to more efficient and scalable solutions for home-assistance robots performing long-horizon tasks. The emphasis on scene imagination as a bridge between textual planning limitations and spatial perception is a promising direction.

major comments (2)
  1. The central innovation rests on the future-view imagination module accurately distilling human navigation preferences into semantically meaningful viewpoints with high exploration potential. However, the manuscript does not provide quantitative evidence, such as metrics for viewpoint quality, human preference alignment scores, or ablation studies that isolate the module's contribution by comparing against random or heuristic candidate views. This validation is load-bearing for attributing the reported SOTA performance to the imagination component rather than the VLM selector or memory alone.
  2. While the abstract claims SOTA performance in mapless settings surpassing most map-based methods, the lack of detailed baseline comparisons, ablation studies, and full experimental protocols makes it difficult to assess the robustness of these claims. Specific tables or figures showing per-task breakdowns and statistical significance would strengthen the evidence.
minor comments (1)
  1. The phrase 'selective foveation memory mechanism' is introduced without a brief explanation or reference, which could benefit from clarification for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We are pleased that the referee recognizes the potential significance of our work in embodied AI. Below, we address each major comment in detail, providing clarifications and indicating revisions to be made in the updated manuscript.

read point-by-point responses
  1. Referee: The central innovation rests on the future-view imagination module accurately distilling human navigation preferences into semantically meaningful viewpoints with high exploration potential. However, the manuscript does not provide quantitative evidence, such as metrics for viewpoint quality, human preference alignment scores, or ablation studies that isolate the module's contribution by comparing against random or heuristic candidate views. This validation is load-bearing for attributing the reported SOTA performance to the imagination component rather than the VLM selector or memory alone.

    Authors: We agree that isolating the contribution of the future-view imagination module is important for validating our claims. While the current manuscript demonstrates the overall effectiveness through end-to-end navigation results and qualitative visualizations of imagined views, we acknowledge the absence of specific quantitative metrics for viewpoint quality and direct ablations against random or heuristic selections. In the revised manuscript, we will include additional ablation studies comparing the preference-distilled module to random view generation and heuristic baselines (e.g., based on depth or saliency). We will also report human preference alignment scores using the validation set from the distillation process. This will help attribute the performance gains specifically to the imagination component. revision: yes

  2. Referee: While the abstract claims SOTA performance in mapless settings surpassing most map-based methods, the lack of detailed baseline comparisons, ablation studies, and full experimental protocols makes it difficult to assess the robustness of these claims. Specific tables or figures showing per-task breakdowns and statistical significance would strengthen the evidence.

    Authors: We appreciate this feedback on strengthening the empirical evaluation. The manuscript does include comparisons against several mapless and map-based baselines on standard benchmarks, with results aggregated over multiple runs. However, to address the concerns, we will expand the experimental section with detailed per-task breakdowns (e.g., success rates for different object categories and scene types), additional ablation studies on the memory and VLM components, and full experimental protocols including hyperparameters and training details. We will also add statistical significance tests, such as paired t-tests or Wilcoxon tests, to the performance comparisons. These additions will be included in the revised version to provide a more robust assessment of the SOTA claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical navigation framework (ImagineNav++) that combines a trained future-view imagination module, VLM-based view selection, and a selective memory mechanism. All performance claims are grounded in external benchmark evaluations on open-vocabulary navigation tasks rather than any internal derivation that reduces to fitted parameters or self-defined quantities by construction. No equations or predictions are shown to be equivalent to their inputs; the method relies on pre-trained VLMs and reported SOTA results, consistent with independent external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the imagination module can produce useful views and that VLMs can perform spatial selection from them; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption VLMs can perform effective spatial reasoning and viewpoint selection from imagined future images
    Invoked when the framework translates navigation planning into best-view image selection.

pith-pipeline@v0.9.0 · 5568 in / 1143 out tokens · 78873 ms · 2026-05-16T21:05:39.672838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 8 internal anchors

  1. [1]

    Frontier Based Exploration for Autonomous Robot

    A. Topiwala, P. Inani, and A. Kathpal, “Frontier based exploration for autonomous robot,”arXiv preprint arXiv:1806.03581, 2018

  2. [2]

    Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,

    W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” inProc. IEEE In. Conf. Robot. Autom., 2024, pp. 5228–5234

  3. [3]

    Object goal navigation using goal-oriented semantic exploration,

    D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,”Adv. Neural Inform. Process. Syst., 2020

  4. [4]

    Habitat-web: Learning embodied object-search strategies from human demonstra- tions at scale,

    R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-web: Learning embodied object-search strategies from human demonstra- tions at scale,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5173–5183

  5. [5]

    Prioritized semantic learning for zero-shot instance navigation,

    X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 161–178

  6. [6]

    UniGoal: Towards universal zero-shot goal-oriented navigation,

    H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “UniGoal: Towards universal zero-shot goal-oriented navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 19 057–19 066. 15

  7. [7]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn.PMLR, 2021, pp. 8748–8763

  8. [8]

    Masked au- toencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 000–16 009

  9. [9]

    Detecting twenty-thousand classes using image-level supervision,

    X. Zhou, R. Girdhar, A. Joulin, P. Kr ¨ahenb¨uhl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” inProc. Eur . Conf. Comput. Vis.Springer, 2022, pp. 350–368

  10. [10]

    YOLO- World: Real-time open-vocabulary object detection,

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “YOLO- World: Real-time open-vocabulary object detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 901–16 911

  11. [11]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProc. Int. Conf. Comput. Vis., 2023, pp. 4015–4026

  12. [12]

    General object foundation model for images and videos at scale,

    J. Wu, Y . Jiang, Q. Liu, Z. Yuan, X. Bai, and S. Bai, “General object foundation model for images and videos at scale,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3783–3795

  13. [13]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhuet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inProc. Eur . Conf. Comput. Vis., 2023, pp. 38–55

  14. [14]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Adv. Neural Inform. Process. Syst., vol. 33, pp. 1877–1901, 2020

  15. [15]

    Palm: Scaling language modeling with pathways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways,”J Mach Learn Res, vol. 24, no. 240, pp. 1–113, 2023

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  17. [17]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

  18. [18]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  19. [19]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauthet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  20. [20]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, 2024

  21. [21]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 24 185–24 198

  22. [22]

    InstructBLIP: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. A. Li, P. Fung, and S. C. H. Hoi, “InstructBLIP: Towards general-purpose vision-language models with instruction tuning,”Adv. Neural Inform. Process. Syst., 2023

  23. [23]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yueet al., “Llama-adapter v2: Parameter-efficient visual instruction model,”arXiv preprint arXiv:2304.15010, 2023

  24. [24]

    Mask R-CNN,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2980–2988

  25. [25]

    ESC: Exploration with soft commonsense constraints for zero- shot object navigation,

    K.-Q. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero- shot object navigation,” inProc. Int. Conf. Multimedia and Expo, 2023

  26. [26]

    OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

    Y . Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL, 2024, pp. 338–351

  27. [27]

    V oronav: V oronoi-based zero-shot object navigation with large language model,

    P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu, “V oronav: V oronoi-based zero-shot object navigation with large language model,” inPro. Int. Conf. Mach. Learn., 2024, pp. 53 737–53 775

  28. [28]

    TriHelper: Zero-shot object navigation with dynamic assistance,

    L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” in Proc. IEEE Int. Conf. Intell. Rob. Syst., 2024, pp. 10 035–10 042

  29. [29]

    Navigation with large language models: Semantic guesswork as a heuristic for planning,

    D. Shah, M. Equi, B. Osinski, F. Xia, B. Ichter, and S. Levine, “Navigation with large language models: Semantic guesswork as a heuristic for planning,” inProc. Conference on Robot Learning, 2023, pp. 19 057–19 066

  30. [30]

    L3MVN: Leveraging large language models for visual target navigation,

    B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2023, pp. 3554–3560

  31. [31]

    Open scene graphs for open-world object- goal navigation,

    J. Loo, Z. Wu, and D. Hsu, “Open scene graphs for open-world object- goal navigation,”The International Journal of Robotics Research, p. 02783649251369549, 2025

  32. [32]

    Eval- uating spatial understanding of large language models,

    Y . Yamada, Y . Bao, A. K. Lampinen, J. Kasai, and I. Yildirim, “Eval- uating spatial understanding of large language models,”Transactions on Machine Learning Research, 2024

  33. [33]

    PolyOculus: Simultaneous multi-view image-based novel view synthesis,

    J. J. Yu, T. Aumentado-Armstrong, F. Forghani, K. G. Derpanis, and M. A. Brubaker, “PolyOculus: Simultaneous multi-view image-based novel view synthesis,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 433– 451

  34. [34]

    Consistent view synthesis with pose-guided diffusion models,

    H.-Y . Tseng, Q. Li, C. Kim, S. Alsisan, J.-B. Huang, and J. Kopf, “Consistent view synthesis with pose-guided diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 773– 16 783

  35. [35]

    Long-term photometric consistent novel view synthesis with diffusion models,

    J. J. Yu, F. Forghani, K. G. Derpanis, and M. A. Brubaker, “Long-term photometric consistent novel view synthesis with diffusion models,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 7094–7104

  36. [36]

    Object memory transformer for object goal navigation,

    R. Fukushima, K. Ota, A. Kanezaki, Y . Sasaki, and Y . Yoshiyasu, “Object memory transformer for object goal navigation,” inProc. Int. Conf. Robot. Autom., 2022, pp. 11 288–11 294

  37. [37]

    Gridmm: Grid memory map for vision-and-language navigation,

    Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “Gridmm: Grid memory map for vision-and-language navigation,” inProc. Int. Conf. Comput. Vis., 2023, pp. 15 625–15 636

  38. [38]

    NavGPT: Explicit reasoning in vision- and-language navigation with large language models,

    G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProc. AAAI Conf. Artif. Intell., 2023, pp. 7641 – 7649

  39. [39]

    Mem2ego: Empowering vision-language models with global-to-ego memory for long-horizon embodied navigation,

    L. Zhang, Y . Liu, Z. Zhang, M. Aghaei, Y . Hu, H. Gu, M. A. Alomrani, D. G. A. Bravo, R. Karimi, A. Hamidizadehet al., “Mem2ego: Empowering vision-language models with global-to-ego memory for long-horizon embodied navigation,” inProc. CVPR 2025 Workshop F oundation Models Meet Embodied Agents, 2025

  40. [40]

    Navid: Video-based vlm plans the next step for vision-and-language navigation,

    J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “Navid: Video-based vlm plans the next step for vision-and-language navigation,” inProc. Robotics: Science and Systems, 2024

  41. [41]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProc. Robotics: Science and Systems, 2025

  42. [42]

    Multi- modal large language model for visual navigation,

    Y .-H. H. Tsai, V . Dhar, J. Li, B. Zhang, and J. Zhang, “Multi- modal large language model for visual navigation,”arXiv preprint arXiv:2310.08669, 2023

  43. [43]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, and et al., “DINOv2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024

  44. [44]

    Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inProc. 35th NeurIPS Datasets and Benchmarks Track (Round 2), 2021

  45. [45]

    Habitat synthetic scenes dataset (HSSD-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation,

    M. Khanna, Y . Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva, “Habitat synthetic scenes dataset (HSSD-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 384–16 393

  46. [46]

    Gibson env: Real-world perception for embodied agents,

    F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” inPro. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9068–9079

  47. [47]

    ImagineNav: Prompting vision-language models as embodied navigator through scene imagi- nation,

    X. Zhao, W. Cai, L. Tang, and T. Wang, “ImagineNav: Prompting vision-language models as embodied navigator through scene imagi- nation,” inProc. Int. Conf. Learn. Represent., 2025

  48. [48]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” inProc. Int. Conf. Mach. Learn., 2022

  49. [49]

    Code as policies: Language model programs for embod- ied control,

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embod- ied control,” inProc. Int. Conf. Robot. Autom., 2023, pp. 9493–9500

  50. [50]

    Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,

    S. Huang, Z. Jiang, H. Dong, Y . Qiao, P. Gao, and H. Li, “Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,”arXiv preprint arXiv:2305.11176, 2023

  51. [51]

    Solving quantitative rea- 16 soning problems with language models,

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative rea- 16 soning problems with language models,”Adv. Neural Inform. Process. Syst., 2022

  52. [52]

    Do as i can, not as i say: Grounding language in robotic affordances,

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.- H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettin...

  53. [53]

    VILA: On pre-training for visual language models,

    J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “VILA: On pre-training for visual language models,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 26 689–26 699

  54. [54]

    CoPa: General robotic manipulation through spatial constraints of parts with foundation mod- els,

    H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “CoPa: General robotic manipulation through spatial constraints of parts with foundation mod- els,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2024, pp. 9488–9495

  55. [55]

    PIVOT: Iterative visual prompting elicits actionable knowledge for vlms,

    S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasgupta, A. Xie, D. Driess, A. Wahid, Z. Xu, Q. Vuong, T. Zhang, T.-W. E. Lee, K.- H. Lee, P. Xu, S. Kirmani, Y . Zhu, A. Zeng, K. Hausman, N. Heess, C. Finn, S. Levine, and B. Ichter, “PIVOT: Iterative visual prompting elicits actionable knowledge for vlms,” inProc. Int. Conf. Mach. Learn., 2024, pp. 37...

  56. [56]

    Socratic models: Composing zero-shot multimodal reasoning with language,

    A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence, “Socratic models: Composing zero-shot multimodal reasoning with language,” inProc. Int. Conf. Learn. Represent., 2023

  57. [57]

    Simple but effective: Clip embeddings for embodied ai,

    A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” inProc. Int. Conf. Comput. Vis., 2022, pp. 14 829–14 838

  58. [58]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

    S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 23 171–23 181

  59. [59]

    Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,

    A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 32 340–32 352, 2022

  60. [60]

    Visual language maps for robot navigation,

    C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” inProc. Int. Conf. Robot. Autom., 2023, pp. 10 608–10 615

  61. [61]

    GOAT: Go to any thing,

    M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra, R. Mottaghi, J. Malik, and D. S. Chaplot, “GOAT: Go to any thing,” inProc. Robotics: Science and Systems, 2024

  62. [62]

    Navigating to objects specified by images,

    J. Krantz, T. Gervet, K. Yadav, A. Wang, C. Paxton, R. Mottaghi, D. Batra, J. Malik, S. Lee, and D. S. Chaplot, “Navigating to objects specified by images,” inProc. Int. Conf. Comput. Vis., 2023, pp. 10 916–10 925

  63. [63]

    PEANUT: Predicting and navigating to unseen targets,

    A. J. Zhai and S. Wang, “PEANUT: Predicting and navigating to unseen targets,” inProc. Int. Conf. Comput. Vis., 2023, pp. 10 926– 10 935

  64. [64]

    Poni: Potential functions for objectgoal navigation with interaction-free learning,

    S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “Poni: Potential functions for objectgoal navigation with interaction-free learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 890–18 900

  65. [65]

    Navigating to objects in unseen environments by distance prediction,

    M. Zhu, B. Zhao, and T. Kong, “Navigating to objects in unseen environments by distance prediction,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2022, pp. 10 571–10 578

  66. [66]

    Occupancy anticipation for efficient exploration and navigation,

    S. K. Ramakrishnan, Z. Al-Halah, and K. Grauman, “Occupancy anticipation for efficient exploration and navigation,” inProc. Eur . Conf. Comput. Vis., 2020, pp. 400–418

  67. [67]

    Learning to map for active semantic goal navigation,

    G. Georgakis, B. Bucher, K. Schmeckpeper, S. Singh, and K. Daniilidis, “Learning to map for active semantic goal navigation,”arXiv preprint arXiv:2106.15648, 2021

  68. [68]

    SSCNav: Confidence-aware semantic scene completion for visual semantic navigation,

    Y . Liang, B. Chen, and S. Song, “SSCNav: Confidence-aware semantic scene completion for visual semantic navigation,” inProc. Int. Conf. Robot. Autom., 2021, pp. 13 194–13 200

  69. [69]

    Imagine before go: Self-supervised generative map for object goal navigation,

    S. Zhang, X. Yu, X. Song, X. Wang, and S. Jiang, “Imagine before go: Self-supervised generative map for object goal navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 414–16 425

  70. [70]

    Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” inProc. Int. Conf. Robot. Autom.IEEE, 2024, pp. 42–48

  71. [71]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 19 730– 19 742

  72. [72]

    End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering,

    D. Goetting, H. G. Singh, and A. Loquercio, “End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering,”arXiv preprint arXiv:2411.05755, 2024

  73. [73]

    Openin: Open-vocabulary instance-oriented navigation in dynamic domestic environments,

    Y . Tang, M. Wang, Y . Deng, Z. Zheng, J. Deng, S. Zuo, and Y . Yue, “Openin: Open-vocabulary instance-oriented navigation in dynamic domestic environments,”IEEE Rob. Autom. Lett., vol. 10, no. 9, pp. 9256–9263, 2025

  74. [74]

    OctoNav: Towards generalist embodied navigation,

    C. Gao, L. Jin, X. Peng, J. Zhang, Y . Deng, A. Li, H. Wang, and S. Liu, “OctoNav: Towards generalist embodied navigation,”arXiv preprint arXiv:2506.09839, 2025

  75. [75]

    Zeronvs: Zero-shot 360- degree view synthesis from a single image,

    K. Sargent, Z. Li, T. Shah, C. Herrmann, H.-X. Yu, Y . Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sunet al., “Zeronvs: Zero-shot 360- degree view synthesis from a single image,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9420–9429

  76. [76]

    Behind the scenes: Density fields for single view reconstruction,

    F. Wimbauer, N. Yang, C. Rupprecht, and D. Cremers, “Behind the scenes: Density fields for single view reconstruction,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 9076–9086

  77. [77]

    Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,

    A.-Q. Cao and R. De Charette, “Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 9387–9398

  78. [78]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. Int. Conf. Comput. Vis., 2016, pp. 770–778

  79. [79]

    VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,

    S. E. F. D. Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albu- querque Araujo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,”Pattern Recognit. Lett., vol. 32, no. 1, pp. 56–68, 2011

  80. [80]

    Adaptive key frame extraction for video summarization using an aggregation mechanism,

    N. Ejaz, T. B. Tariq, and S. W. Baik, “Adaptive key frame extraction for video summarization using an aggregation mechanism,”J Vis Commun Image R, vol. 23, no. 7, pp. 1031–1040, 2012

Showing first 80 references.