arxiv: 2512.17435 · v3 · submitted 2025-12-19 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang , Xinxin Zhao , Wenzhe Cai , Changyin Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:05 UTC · model grok-4.3

classification 💻 cs.RO

keywords visual navigationvision-language modelsscene imaginationmapless navigationembodied AIobject navigationinstance navigationfuture view generation

0 comments

The pith

Vision-language models navigate without maps by imagining future views from candidate positions and selecting the best one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard approaches to robot navigation either build explicit maps or rely on text-only reasoning from language models, which overlook spatial geometry and occupancy details essential for real-world movement. This work demonstrates that vision-language models can drive mapless navigation directly from RGB or RGB-D camera streams by first generating imagined future images of possible robot viewpoints. A dedicated imagination module creates these views by capturing patterns of effective human navigation choices, after which the model simply chooses the single most informative image as the next target. A memory mechanism then maintains consistency across observations by integrating keyframes in a sparse-to-dense hierarchy. The result converts long-horizon object search into repeated, simpler point-goal navigation steps, yielding top performance among mapless systems and outperforming many map-using baselines on standard benchmarks.

Core claim

The central claim is that prompting a vision-language model with imagined future observation images turns navigation planning into a tractable best-view selection task, supported by a selective foveation memory that preserves spatial consistency, enabling state-of-the-art mapless performance on open-vocabulary object and instance navigation that surpasses most map-based alternatives.

What carries the argument

The future-view imagination module that generates semantically meaningful candidate viewpoints from human navigation preferences, paired with VLM-based selection of the most informative view and a hierarchical selective foveation memory for long-term spatial reasoning.

If this is right

Navigation decisions reduce to repeated selection of the single most informative imagined image rather than complex geometric planning.
Long-horizon goal-directed tasks decompose into sequences of shorter point-goal navigation subtasks.
Scene imagination combined with selective memory enables effective spatial reasoning from onboard visual streams alone.
The approach achieves state-of-the-art results in mapless settings while exceeding many methods that rely on explicit maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same imagination-plus-selection loop could be applied to other embodied tasks such as object manipulation by generating future gripper views.
If the memory mechanism scales, it may reduce reliance on full environment mapping in changing or partially observed indoor spaces.
Visual prompting of this kind suggests VLMs can be steered toward latent spatial understanding without additional training on navigation data.

Load-bearing premise

The future-view imagination module accurately distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential.

What would settle it

Replacing the imagination module with random or static viewpoint generation and measuring whether performance on the open-vocabulary object and instance navigation benchmarks falls below that of competing mapless or map-based methods.

Figures

Figures reproduced from arXiv: 2512.17435 by Changyin Sun, Teng Wang, Wenzhe Cai, Xinxin Zhao.

**Figure 1.** Figure 1: The comparison between the conventional LLM-based navigation pipeline and our ImagineNav++ pipeline. The traditional LLM-based navigation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overall pipeline of our mapless, open-vocabulary navigation framework ImagineNav++. At each iteration, the agent captures a panoramic view of its [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of human demonstration trajectories on the MP3D [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the synthesized image observations at future naviga [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Where2Imagine: the impact of different sampling intervals T on ObjectNav performance on HM3D dataset [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of trajectories at different sampling steps T. This image [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: The visualization of the relative pose predicted by our Where2Imagine module and uniform sampling (radius: 2.0m; angular interval: [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of our hierarchical keyframe-based memory, depicting RGB observations and their corresponding top-down map positions for selected [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Complete prompt input and decision output of the Vision-Language model for exploration direction selection in the object goal navigation task. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Analysis of object-goal navigation trajectories: success vs. failure examples. The top and bottom rows compare top-down paths from successful and [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean mapless navigation pipeline that turns planning into VLM best-view selection from imagined scenes, but the evidence for the imagination module itself is still thin.

read the letter

The core move here is to skip explicit maps and let a VLM pick the next viewpoint by looking at imagined future images. The authors train an imagination module on human navigation preferences to produce candidate views, then feed those to the VLM for selection, and they add a selective foveation memory that keeps a sparse-to-dense record of past keyframes. This converts long-horizon navigation into a sequence of simpler point-goal steps. On the benchmarks they report, it reaches SOTA in mapless settings and beats most map-based baselines, which is the practical payoff if the numbers hold.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ImagineNav++, a framework for mapless embodied navigation that prompts VLMs by imagining future scene views from candidate positions. It features a future-view imagination module trained to distill human navigation preferences for generating informative viewpoints, a VLM-based selection of the best view, and a selective foveation memory for maintaining spatial consistency across observations. The approach converts long-horizon navigation into point-goal tasks and reports state-of-the-art results on open-vocabulary object and instance navigation benchmarks, outperforming most map-based methods.

Significance. Should the empirical claims hold under scrutiny, the work would be significant for the field of embodied AI and robotics. It offers a novel way to harness VLMs' visual reasoning capabilities for navigation without explicit mapping, which could lead to more efficient and scalable solutions for home-assistance robots performing long-horizon tasks. The emphasis on scene imagination as a bridge between textual planning limitations and spatial perception is a promising direction.

major comments (2)

The central innovation rests on the future-view imagination module accurately distilling human navigation preferences into semantically meaningful viewpoints with high exploration potential. However, the manuscript does not provide quantitative evidence, such as metrics for viewpoint quality, human preference alignment scores, or ablation studies that isolate the module's contribution by comparing against random or heuristic candidate views. This validation is load-bearing for attributing the reported SOTA performance to the imagination component rather than the VLM selector or memory alone.
While the abstract claims SOTA performance in mapless settings surpassing most map-based methods, the lack of detailed baseline comparisons, ablation studies, and full experimental protocols makes it difficult to assess the robustness of these claims. Specific tables or figures showing per-task breakdowns and statistical significance would strengthen the evidence.

minor comments (1)

The phrase 'selective foveation memory mechanism' is introduced without a brief explanation or reference, which could benefit from clarification for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We are pleased that the referee recognizes the potential significance of our work in embodied AI. Below, we address each major comment in detail, providing clarifications and indicating revisions to be made in the updated manuscript.

read point-by-point responses

Referee: The central innovation rests on the future-view imagination module accurately distilling human navigation preferences into semantically meaningful viewpoints with high exploration potential. However, the manuscript does not provide quantitative evidence, such as metrics for viewpoint quality, human preference alignment scores, or ablation studies that isolate the module's contribution by comparing against random or heuristic candidate views. This validation is load-bearing for attributing the reported SOTA performance to the imagination component rather than the VLM selector or memory alone.

Authors: We agree that isolating the contribution of the future-view imagination module is important for validating our claims. While the current manuscript demonstrates the overall effectiveness through end-to-end navigation results and qualitative visualizations of imagined views, we acknowledge the absence of specific quantitative metrics for viewpoint quality and direct ablations against random or heuristic selections. In the revised manuscript, we will include additional ablation studies comparing the preference-distilled module to random view generation and heuristic baselines (e.g., based on depth or saliency). We will also report human preference alignment scores using the validation set from the distillation process. This will help attribute the performance gains specifically to the imagination component. revision: yes
Referee: While the abstract claims SOTA performance in mapless settings surpassing most map-based methods, the lack of detailed baseline comparisons, ablation studies, and full experimental protocols makes it difficult to assess the robustness of these claims. Specific tables or figures showing per-task breakdowns and statistical significance would strengthen the evidence.

Authors: We appreciate this feedback on strengthening the empirical evaluation. The manuscript does include comparisons against several mapless and map-based baselines on standard benchmarks, with results aggregated over multiple runs. However, to address the concerns, we will expand the experimental section with detailed per-task breakdowns (e.g., success rates for different object categories and scene types), additional ablation studies on the memory and VLM components, and full experimental protocols including hyperparameters and training details. We will also add statistical significance tests, such as paired t-tests or Wilcoxon tests, to the performance comparisons. These additions will be included in the revised version to provide a more robust assessment of the SOTA claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical navigation framework (ImagineNav++) that combines a trained future-view imagination module, VLM-based view selection, and a selective memory mechanism. All performance claims are grounded in external benchmark evaluations on open-vocabulary navigation tasks rather than any internal derivation that reduces to fitted parameters or self-defined quantities by construction. No equations or predictions are shown to be equivalent to their inputs; the method relies on pre-trained VLMs and reported SOTA results, consistent with independent external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the imagination module can produce useful views and that VLMs can perform spatial selection from them; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption VLMs can perform effective spatial reasoning and viewpoint selection from imagined future images
Invoked when the framework translates navigation planning into best-view image selection.

pith-pipeline@v0.9.0 · 5568 in / 1143 out tokens · 78873 ms · 2026-05-16T21:05:39.672838+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 8 internal anchors

[1]

Frontier Based Exploration for Autonomous Robot

A. Topiwala, P. Inani, and A. Kathpal, “Frontier based exploration for autonomous robot,”arXiv preprint arXiv:1806.03581, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,

W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” inProc. IEEE In. Conf. Robot. Autom., 2024, pp. 5228–5234

work page 2024
[3]

Object goal navigation using goal-oriented semantic exploration,

D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,”Adv. Neural Inform. Process. Syst., 2020

work page 2020
[4]

Habitat-web: Learning embodied object-search strategies from human demonstra- tions at scale,

R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-web: Learning embodied object-search strategies from human demonstra- tions at scale,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5173–5183

work page 2022
[5]

Prioritized semantic learning for zero-shot instance navigation,

X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 161–178

work page 2024
[6]

UniGoal: Towards universal zero-shot goal-oriented navigation,

H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “UniGoal: Towards universal zero-shot goal-oriented navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 19 057–19 066. 15

work page 2025
[7]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn.PMLR, 2021, pp. 8748–8763

work page 2021
[8]

Masked au- toencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 000–16 009

work page 2022
[9]

Detecting twenty-thousand classes using image-level supervision,

X. Zhou, R. Girdhar, A. Joulin, P. Kr ¨ahenb¨uhl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” inProc. Eur . Conf. Comput. Vis.Springer, 2022, pp. 350–368

work page 2022
[10]

YOLO- World: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “YOLO- World: Real-time open-vocabulary object detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 901–16 911

work page 2024
[11]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProc. Int. Conf. Comput. Vis., 2023, pp. 4015–4026

work page 2023
[12]

General object foundation model for images and videos at scale,

J. Wu, Y . Jiang, Q. Liu, Z. Yuan, X. Bai, and S. Bai, “General object foundation model for images and videos at scale,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3783–3795

work page 2024
[13]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhuet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inProc. Eur . Conf. Comput. Vis., 2023, pp. 38–55

work page 2023
[14]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Adv. Neural Inform. Process. Syst., vol. 33, pp. 1877–1901, 2020

work page 1901
[15]

Palm: Scaling language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways,”J Mach Learn Res, vol. 24, no. 240, pp. 1–113, 2023

work page 2023
[16]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauthet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, 2024

work page 2024
[21]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 24 185–24 198

work page 2024
[22]

InstructBLIP: Towards general-purpose vision-language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. A. Li, P. Fung, and S. C. H. Hoi, “InstructBLIP: Towards general-purpose vision-language models with instruction tuning,”Adv. Neural Inform. Process. Syst., 2023

work page 2023
[23]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yueet al., “Llama-adapter v2: Parameter-efficient visual instruction model,”arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Mask R-CNN,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2980–2988

work page 2017
[25]

ESC: Exploration with soft commonsense constraints for zero- shot object navigation,

K.-Q. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero- shot object navigation,” inProc. Int. Conf. Multimedia and Expo, 2023

work page 2023
[26]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Y . Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL, 2024, pp. 338–351

work page 2024
[27]

V oronav: V oronoi-based zero-shot object navigation with large language model,

P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu, “V oronav: V oronoi-based zero-shot object navigation with large language model,” inPro. Int. Conf. Mach. Learn., 2024, pp. 53 737–53 775

work page 2024
[28]

TriHelper: Zero-shot object navigation with dynamic assistance,

L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” in Proc. IEEE Int. Conf. Intell. Rob. Syst., 2024, pp. 10 035–10 042

work page 2024
[29]

Navigation with large language models: Semantic guesswork as a heuristic for planning,

D. Shah, M. Equi, B. Osinski, F. Xia, B. Ichter, and S. Levine, “Navigation with large language models: Semantic guesswork as a heuristic for planning,” inProc. Conference on Robot Learning, 2023, pp. 19 057–19 066

work page 2023
[30]

L3MVN: Leveraging large language models for visual target navigation,

B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2023, pp. 3554–3560

work page 2023
[31]

Open scene graphs for open-world object- goal navigation,

J. Loo, Z. Wu, and D. Hsu, “Open scene graphs for open-world object- goal navigation,”The International Journal of Robotics Research, p. 02783649251369549, 2025

work page 2025
[32]

Eval- uating spatial understanding of large language models,

Y . Yamada, Y . Bao, A. K. Lampinen, J. Kasai, and I. Yildirim, “Eval- uating spatial understanding of large language models,”Transactions on Machine Learning Research, 2024

work page 2024
[33]

PolyOculus: Simultaneous multi-view image-based novel view synthesis,

J. J. Yu, T. Aumentado-Armstrong, F. Forghani, K. G. Derpanis, and M. A. Brubaker, “PolyOculus: Simultaneous multi-view image-based novel view synthesis,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 433– 451

work page 2024
[34]

Consistent view synthesis with pose-guided diffusion models,

H.-Y . Tseng, Q. Li, C. Kim, S. Alsisan, J.-B. Huang, and J. Kopf, “Consistent view synthesis with pose-guided diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 773– 16 783

work page 2023
[35]

Long-term photometric consistent novel view synthesis with diffusion models,

J. J. Yu, F. Forghani, K. G. Derpanis, and M. A. Brubaker, “Long-term photometric consistent novel view synthesis with diffusion models,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 7094–7104

work page 2023
[36]

Object memory transformer for object goal navigation,

R. Fukushima, K. Ota, A. Kanezaki, Y . Sasaki, and Y . Yoshiyasu, “Object memory transformer for object goal navigation,” inProc. Int. Conf. Robot. Autom., 2022, pp. 11 288–11 294

work page 2022
[37]

Gridmm: Grid memory map for vision-and-language navigation,

Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “Gridmm: Grid memory map for vision-and-language navigation,” inProc. Int. Conf. Comput. Vis., 2023, pp. 15 625–15 636

work page 2023
[38]

NavGPT: Explicit reasoning in vision- and-language navigation with large language models,

G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProc. AAAI Conf. Artif. Intell., 2023, pp. 7641 – 7649

work page 2023
[39]

Mem2ego: Empowering vision-language models with global-to-ego memory for long-horizon embodied navigation,

L. Zhang, Y . Liu, Z. Zhang, M. Aghaei, Y . Hu, H. Gu, M. A. Alomrani, D. G. A. Bravo, R. Karimi, A. Hamidizadehet al., “Mem2ego: Empowering vision-language models with global-to-ego memory for long-horizon embodied navigation,” inProc. CVPR 2025 Workshop F oundation Models Meet Embodied Agents, 2025

work page 2025
[40]

Navid: Video-based vlm plans the next step for vision-and-language navigation,

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “Navid: Video-based vlm plans the next step for vision-and-language navigation,” inProc. Robotics: Science and Systems, 2024

work page 2024
[41]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProc. Robotics: Science and Systems, 2025

work page 2025
[42]

Multi- modal large language model for visual navigation,

Y .-H. H. Tsai, V . Dhar, J. Li, B. Zhang, and J. Zhang, “Multi- modal large language model for visual navigation,”arXiv preprint arXiv:2310.08669, 2023

work page arXiv 2023
[43]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, and et al., “DINOv2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024

work page 2024
[44]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inProc. 35th NeurIPS Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[45]

Habitat synthetic scenes dataset (HSSD-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation,

M. Khanna, Y . Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva, “Habitat synthetic scenes dataset (HSSD-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 384–16 393

work page 2024
[46]

Gibson env: Real-world perception for embodied agents,

F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” inPro. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9068–9079

work page 2018
[47]

ImagineNav: Prompting vision-language models as embodied navigator through scene imagi- nation,

X. Zhao, W. Cai, L. Tang, and T. Wang, “ImagineNav: Prompting vision-language models as embodied navigator through scene imagi- nation,” inProc. Int. Conf. Learn. Represent., 2025

work page 2025
[48]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” inProc. Int. Conf. Mach. Learn., 2022

work page 2022
[49]

Code as policies: Language model programs for embod- ied control,

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embod- ied control,” inProc. Int. Conf. Robot. Autom., 2023, pp. 9493–9500

work page 2023
[50]

Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,

S. Huang, Z. Jiang, H. Dong, Y . Qiao, P. Gao, and H. Li, “Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,”arXiv preprint arXiv:2305.11176, 2023

work page arXiv 2023
[51]

Solving quantitative rea- 16 soning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative rea- 16 soning problems with language models,”Adv. Neural Inform. Process. Syst., 2022

work page 2022
[52]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.- H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettin...

work page 2022
[53]

VILA: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “VILA: On pre-training for visual language models,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 26 689–26 699

work page 2024
[54]

CoPa: General robotic manipulation through spatial constraints of parts with foundation mod- els,

H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “CoPa: General robotic manipulation through spatial constraints of parts with foundation mod- els,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2024, pp. 9488–9495

work page 2024
[55]

PIVOT: Iterative visual prompting elicits actionable knowledge for vlms,

S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasgupta, A. Xie, D. Driess, A. Wahid, Z. Xu, Q. Vuong, T. Zhang, T.-W. E. Lee, K.- H. Lee, P. Xu, S. Kirmani, Y . Zhu, A. Zeng, K. Hausman, N. Heess, C. Finn, S. Levine, and B. Ichter, “PIVOT: Iterative visual prompting elicits actionable knowledge for vlms,” inProc. Int. Conf. Mach. Learn., 2024, pp. 37...

work page 2024
[56]

Socratic models: Composing zero-shot multimodal reasoning with language,

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence, “Socratic models: Composing zero-shot multimodal reasoning with language,” inProc. Int. Conf. Learn. Represent., 2023

work page 2023
[57]

Simple but effective: Clip embeddings for embodied ai,

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” inProc. Int. Conf. Comput. Vis., 2022, pp. 14 829–14 838

work page 2022
[58]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 23 171–23 181

work page 2023
[59]

Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,

A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 32 340–32 352, 2022

work page 2022
[60]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” inProc. Int. Conf. Robot. Autom., 2023, pp. 10 608–10 615

work page 2023
[61]

GOAT: Go to any thing,

M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra, R. Mottaghi, J. Malik, and D. S. Chaplot, “GOAT: Go to any thing,” inProc. Robotics: Science and Systems, 2024

work page 2024
[62]

Navigating to objects specified by images,

J. Krantz, T. Gervet, K. Yadav, A. Wang, C. Paxton, R. Mottaghi, D. Batra, J. Malik, S. Lee, and D. S. Chaplot, “Navigating to objects specified by images,” inProc. Int. Conf. Comput. Vis., 2023, pp. 10 916–10 925

work page 2023
[63]

PEANUT: Predicting and navigating to unseen targets,

A. J. Zhai and S. Wang, “PEANUT: Predicting and navigating to unseen targets,” inProc. Int. Conf. Comput. Vis., 2023, pp. 10 926– 10 935

work page 2023
[64]

Poni: Potential functions for objectgoal navigation with interaction-free learning,

S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “Poni: Potential functions for objectgoal navigation with interaction-free learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 890–18 900

work page 2022
[65]

Navigating to objects in unseen environments by distance prediction,

M. Zhu, B. Zhao, and T. Kong, “Navigating to objects in unseen environments by distance prediction,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2022, pp. 10 571–10 578

work page 2022
[66]

Occupancy anticipation for efficient exploration and navigation,

S. K. Ramakrishnan, Z. Al-Halah, and K. Grauman, “Occupancy anticipation for efficient exploration and navigation,” inProc. Eur . Conf. Comput. Vis., 2020, pp. 400–418

work page 2020
[67]

Learning to map for active semantic goal navigation,

G. Georgakis, B. Bucher, K. Schmeckpeper, S. Singh, and K. Daniilidis, “Learning to map for active semantic goal navigation,”arXiv preprint arXiv:2106.15648, 2021

work page arXiv 2021
[68]

SSCNav: Confidence-aware semantic scene completion for visual semantic navigation,

Y . Liang, B. Chen, and S. Song, “SSCNav: Confidence-aware semantic scene completion for visual semantic navigation,” inProc. Int. Conf. Robot. Autom., 2021, pp. 13 194–13 200

work page 2021
[69]

Imagine before go: Self-supervised generative map for object goal navigation,

S. Zhang, X. Yu, X. Song, X. Wang, and S. Jiang, “Imagine before go: Self-supervised generative map for object goal navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 414–16 425

work page 2024
[70]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” inProc. Int. Conf. Robot. Autom.IEEE, 2024, pp. 42–48

work page 2024
[71]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 19 730– 19 742

work page 2023
[72]

End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering,

D. Goetting, H. G. Singh, and A. Loquercio, “End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering,”arXiv preprint arXiv:2411.05755, 2024

work page arXiv 2024
[73]

Openin: Open-vocabulary instance-oriented navigation in dynamic domestic environments,

Y . Tang, M. Wang, Y . Deng, Z. Zheng, J. Deng, S. Zuo, and Y . Yue, “Openin: Open-vocabulary instance-oriented navigation in dynamic domestic environments,”IEEE Rob. Autom. Lett., vol. 10, no. 9, pp. 9256–9263, 2025

work page 2025
[74]

OctoNav: Towards generalist embodied navigation,

C. Gao, L. Jin, X. Peng, J. Zhang, Y . Deng, A. Li, H. Wang, and S. Liu, “OctoNav: Towards generalist embodied navigation,”arXiv preprint arXiv:2506.09839, 2025

work page arXiv 2025
[75]

Zeronvs: Zero-shot 360- degree view synthesis from a single image,

K. Sargent, Z. Li, T. Shah, C. Herrmann, H.-X. Yu, Y . Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sunet al., “Zeronvs: Zero-shot 360- degree view synthesis from a single image,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9420–9429

work page 2024
[76]

Behind the scenes: Density fields for single view reconstruction,

F. Wimbauer, N. Yang, C. Rupprecht, and D. Cremers, “Behind the scenes: Density fields for single view reconstruction,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 9076–9086

work page 2023
[77]

Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,

A.-Q. Cao and R. De Charette, “Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 9387–9398

work page 2023
[78]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. Int. Conf. Comput. Vis., 2016, pp. 770–778

work page 2016
[79]

VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,

S. E. F. D. Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albu- querque Araujo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,”Pattern Recognit. Lett., vol. 32, no. 1, pp. 56–68, 2011

work page 2011
[80]

Adaptive key frame extraction for video summarization using an aggregation mechanism,

N. Ejaz, T. B. Tariq, and S. W. Baik, “Adaptive key frame extraction for video summarization using an aggregation mechanism,”J Vis Commun Image R, vol. 23, no. 7, pp. 1031–1040, 2012

work page 2012

Showing first 80 references.