Recognition: 2 theorem links
· Lean TheoremImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination
Pith reviewed 2026-05-16 21:05 UTC · model grok-4.3
The pith
Vision-language models navigate without maps by imagining future views from candidate positions and selecting the best one.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that prompting a vision-language model with imagined future observation images turns navigation planning into a tractable best-view selection task, supported by a selective foveation memory that preserves spatial consistency, enabling state-of-the-art mapless performance on open-vocabulary object and instance navigation that surpasses most map-based alternatives.
What carries the argument
The future-view imagination module that generates semantically meaningful candidate viewpoints from human navigation preferences, paired with VLM-based selection of the most informative view and a hierarchical selective foveation memory for long-term spatial reasoning.
If this is right
- Navigation decisions reduce to repeated selection of the single most informative imagined image rather than complex geometric planning.
- Long-horizon goal-directed tasks decompose into sequences of shorter point-goal navigation subtasks.
- Scene imagination combined with selective memory enables effective spatial reasoning from onboard visual streams alone.
- The approach achieves state-of-the-art results in mapless settings while exceeding many methods that rely on explicit maps.
Where Pith is reading between the lines
- The same imagination-plus-selection loop could be applied to other embodied tasks such as object manipulation by generating future gripper views.
- If the memory mechanism scales, it may reduce reliance on full environment mapping in changing or partially observed indoor spaces.
- Visual prompting of this kind suggests VLMs can be steered toward latent spatial understanding without additional training on navigation data.
Load-bearing premise
The future-view imagination module accurately distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential.
What would settle it
Replacing the imagination module with random or static viewpoint generation and measuring whether performance on the open-vocabulary object and instance navigation benchmarks falls below that of competing mapless or map-based methods.
Figures
read the original abstract
Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ImagineNav++, a framework for mapless embodied navigation that prompts VLMs by imagining future scene views from candidate positions. It features a future-view imagination module trained to distill human navigation preferences for generating informative viewpoints, a VLM-based selection of the best view, and a selective foveation memory for maintaining spatial consistency across observations. The approach converts long-horizon navigation into point-goal tasks and reports state-of-the-art results on open-vocabulary object and instance navigation benchmarks, outperforming most map-based methods.
Significance. Should the empirical claims hold under scrutiny, the work would be significant for the field of embodied AI and robotics. It offers a novel way to harness VLMs' visual reasoning capabilities for navigation without explicit mapping, which could lead to more efficient and scalable solutions for home-assistance robots performing long-horizon tasks. The emphasis on scene imagination as a bridge between textual planning limitations and spatial perception is a promising direction.
major comments (2)
- The central innovation rests on the future-view imagination module accurately distilling human navigation preferences into semantically meaningful viewpoints with high exploration potential. However, the manuscript does not provide quantitative evidence, such as metrics for viewpoint quality, human preference alignment scores, or ablation studies that isolate the module's contribution by comparing against random or heuristic candidate views. This validation is load-bearing for attributing the reported SOTA performance to the imagination component rather than the VLM selector or memory alone.
- While the abstract claims SOTA performance in mapless settings surpassing most map-based methods, the lack of detailed baseline comparisons, ablation studies, and full experimental protocols makes it difficult to assess the robustness of these claims. Specific tables or figures showing per-task breakdowns and statistical significance would strengthen the evidence.
minor comments (1)
- The phrase 'selective foveation memory mechanism' is introduced without a brief explanation or reference, which could benefit from clarification for readers.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We are pleased that the referee recognizes the potential significance of our work in embodied AI. Below, we address each major comment in detail, providing clarifications and indicating revisions to be made in the updated manuscript.
read point-by-point responses
-
Referee: The central innovation rests on the future-view imagination module accurately distilling human navigation preferences into semantically meaningful viewpoints with high exploration potential. However, the manuscript does not provide quantitative evidence, such as metrics for viewpoint quality, human preference alignment scores, or ablation studies that isolate the module's contribution by comparing against random or heuristic candidate views. This validation is load-bearing for attributing the reported SOTA performance to the imagination component rather than the VLM selector or memory alone.
Authors: We agree that isolating the contribution of the future-view imagination module is important for validating our claims. While the current manuscript demonstrates the overall effectiveness through end-to-end navigation results and qualitative visualizations of imagined views, we acknowledge the absence of specific quantitative metrics for viewpoint quality and direct ablations against random or heuristic selections. In the revised manuscript, we will include additional ablation studies comparing the preference-distilled module to random view generation and heuristic baselines (e.g., based on depth or saliency). We will also report human preference alignment scores using the validation set from the distillation process. This will help attribute the performance gains specifically to the imagination component. revision: yes
-
Referee: While the abstract claims SOTA performance in mapless settings surpassing most map-based methods, the lack of detailed baseline comparisons, ablation studies, and full experimental protocols makes it difficult to assess the robustness of these claims. Specific tables or figures showing per-task breakdowns and statistical significance would strengthen the evidence.
Authors: We appreciate this feedback on strengthening the empirical evaluation. The manuscript does include comparisons against several mapless and map-based baselines on standard benchmarks, with results aggregated over multiple runs. However, to address the concerns, we will expand the experimental section with detailed per-task breakdowns (e.g., success rates for different object categories and scene types), additional ablation studies on the memory and VLM components, and full experimental protocols including hyperparameters and training details. We will also add statistical significance tests, such as paired t-tests or Wilcoxon tests, to the performance comparisons. These additions will be included in the revised version to provide a more robust assessment of the SOTA claims. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes an empirical navigation framework (ImagineNav++) that combines a trained future-view imagination module, VLM-based view selection, and a selective memory mechanism. All performance claims are grounded in external benchmark evaluations on open-vocabulary navigation tasks rather than any internal derivation that reduces to fitted parameters or self-defined quantities by construction. No equations or predictions are shown to be equivalent to their inputs; the method relies on pre-trained VLMs and reported SOTA results, consistent with independent external validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs can perform effective spatial reasoning and viewpoint selection from imagined future images
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Frontier Based Exploration for Autonomous Robot
A. Topiwala, P. Inani, and A. Kathpal, “Frontier based exploration for autonomous robot,”arXiv preprint arXiv:1806.03581, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,
W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” inProc. IEEE In. Conf. Robot. Autom., 2024, pp. 5228–5234
work page 2024
-
[3]
Object goal navigation using goal-oriented semantic exploration,
D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,”Adv. Neural Inform. Process. Syst., 2020
work page 2020
-
[4]
Habitat-web: Learning embodied object-search strategies from human demonstra- tions at scale,
R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-web: Learning embodied object-search strategies from human demonstra- tions at scale,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5173–5183
work page 2022
-
[5]
Prioritized semantic learning for zero-shot instance navigation,
X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 161–178
work page 2024
-
[6]
UniGoal: Towards universal zero-shot goal-oriented navigation,
H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “UniGoal: Towards universal zero-shot goal-oriented navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 19 057–19 066. 15
work page 2025
-
[7]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn.PMLR, 2021, pp. 8748–8763
work page 2021
-
[8]
Masked au- toencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 000–16 009
work page 2022
-
[9]
Detecting twenty-thousand classes using image-level supervision,
X. Zhou, R. Girdhar, A. Joulin, P. Kr ¨ahenb¨uhl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” inProc. Eur . Conf. Comput. Vis.Springer, 2022, pp. 350–368
work page 2022
-
[10]
YOLO- World: Real-time open-vocabulary object detection,
T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “YOLO- World: Real-time open-vocabulary object detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 901–16 911
work page 2024
-
[11]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProc. Int. Conf. Comput. Vis., 2023, pp. 4015–4026
work page 2023
-
[12]
General object foundation model for images and videos at scale,
J. Wu, Y . Jiang, Q. Liu, Z. Yuan, X. Bai, and S. Bai, “General object foundation model for images and videos at scale,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3783–3795
work page 2024
-
[13]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhuet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inProc. Eur . Conf. Comput. Vis., 2023, pp. 38–55
work page 2023
-
[14]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Adv. Neural Inform. Process. Syst., vol. 33, pp. 1877–1901, 2020
work page 1901
-
[15]
Palm: Scaling language modeling with pathways,
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways,”J Mach Learn Res, vol. 24, no. 240, pp. 1–113, 2023
work page 2023
-
[16]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauthet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, 2024
work page 2024
-
[21]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 24 185–24 198
work page 2024
-
[22]
InstructBLIP: Towards general-purpose vision-language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. A. Li, P. Fung, and S. C. H. Hoi, “InstructBLIP: Towards general-purpose vision-language models with instruction tuning,”Adv. Neural Inform. Process. Syst., 2023
work page 2023
-
[23]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yueet al., “Llama-adapter v2: Parameter-efficient visual instruction model,”arXiv preprint arXiv:2304.15010, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2980–2988
work page 2017
-
[25]
ESC: Exploration with soft commonsense constraints for zero- shot object navigation,
K.-Q. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero- shot object navigation,” inProc. Int. Conf. Multimedia and Expo, 2023
work page 2023
-
[26]
OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,
Y . Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL, 2024, pp. 338–351
work page 2024
-
[27]
V oronav: V oronoi-based zero-shot object navigation with large language model,
P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu, “V oronav: V oronoi-based zero-shot object navigation with large language model,” inPro. Int. Conf. Mach. Learn., 2024, pp. 53 737–53 775
work page 2024
-
[28]
TriHelper: Zero-shot object navigation with dynamic assistance,
L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” in Proc. IEEE Int. Conf. Intell. Rob. Syst., 2024, pp. 10 035–10 042
work page 2024
-
[29]
Navigation with large language models: Semantic guesswork as a heuristic for planning,
D. Shah, M. Equi, B. Osinski, F. Xia, B. Ichter, and S. Levine, “Navigation with large language models: Semantic guesswork as a heuristic for planning,” inProc. Conference on Robot Learning, 2023, pp. 19 057–19 066
work page 2023
-
[30]
L3MVN: Leveraging large language models for visual target navigation,
B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2023, pp. 3554–3560
work page 2023
-
[31]
Open scene graphs for open-world object- goal navigation,
J. Loo, Z. Wu, and D. Hsu, “Open scene graphs for open-world object- goal navigation,”The International Journal of Robotics Research, p. 02783649251369549, 2025
work page 2025
-
[32]
Eval- uating spatial understanding of large language models,
Y . Yamada, Y . Bao, A. K. Lampinen, J. Kasai, and I. Yildirim, “Eval- uating spatial understanding of large language models,”Transactions on Machine Learning Research, 2024
work page 2024
-
[33]
PolyOculus: Simultaneous multi-view image-based novel view synthesis,
J. J. Yu, T. Aumentado-Armstrong, F. Forghani, K. G. Derpanis, and M. A. Brubaker, “PolyOculus: Simultaneous multi-view image-based novel view synthesis,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 433– 451
work page 2024
-
[34]
Consistent view synthesis with pose-guided diffusion models,
H.-Y . Tseng, Q. Li, C. Kim, S. Alsisan, J.-B. Huang, and J. Kopf, “Consistent view synthesis with pose-guided diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 773– 16 783
work page 2023
-
[35]
Long-term photometric consistent novel view synthesis with diffusion models,
J. J. Yu, F. Forghani, K. G. Derpanis, and M. A. Brubaker, “Long-term photometric consistent novel view synthesis with diffusion models,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 7094–7104
work page 2023
-
[36]
Object memory transformer for object goal navigation,
R. Fukushima, K. Ota, A. Kanezaki, Y . Sasaki, and Y . Yoshiyasu, “Object memory transformer for object goal navigation,” inProc. Int. Conf. Robot. Autom., 2022, pp. 11 288–11 294
work page 2022
-
[37]
Gridmm: Grid memory map for vision-and-language navigation,
Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “Gridmm: Grid memory map for vision-and-language navigation,” inProc. Int. Conf. Comput. Vis., 2023, pp. 15 625–15 636
work page 2023
-
[38]
NavGPT: Explicit reasoning in vision- and-language navigation with large language models,
G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProc. AAAI Conf. Artif. Intell., 2023, pp. 7641 – 7649
work page 2023
-
[39]
L. Zhang, Y . Liu, Z. Zhang, M. Aghaei, Y . Hu, H. Gu, M. A. Alomrani, D. G. A. Bravo, R. Karimi, A. Hamidizadehet al., “Mem2ego: Empowering vision-language models with global-to-ego memory for long-horizon embodied navigation,” inProc. CVPR 2025 Workshop F oundation Models Meet Embodied Agents, 2025
work page 2025
-
[40]
Navid: Video-based vlm plans the next step for vision-and-language navigation,
J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “Navid: Video-based vlm plans the next step for vision-and-language navigation,” inProc. Robotics: Science and Systems, 2024
work page 2024
-
[41]
Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,
J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProc. Robotics: Science and Systems, 2025
work page 2025
-
[42]
Multi- modal large language model for visual navigation,
Y .-H. H. Tsai, V . Dhar, J. Li, B. Zhang, and J. Zhang, “Multi- modal large language model for visual navigation,”arXiv preprint arXiv:2310.08669, 2023
-
[43]
DINOv2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, and et al., “DINOv2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024
work page 2024
-
[44]
Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inProc. 35th NeurIPS Datasets and Benchmarks Track (Round 2), 2021
work page 2021
-
[45]
M. Khanna, Y . Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva, “Habitat synthetic scenes dataset (HSSD-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 384–16 393
work page 2024
-
[46]
Gibson env: Real-world perception for embodied agents,
F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” inPro. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9068–9079
work page 2018
-
[47]
ImagineNav: Prompting vision-language models as embodied navigator through scene imagi- nation,
X. Zhao, W. Cai, L. Tang, and T. Wang, “ImagineNav: Prompting vision-language models as embodied navigator through scene imagi- nation,” inProc. Int. Conf. Learn. Represent., 2025
work page 2025
-
[48]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” inProc. Int. Conf. Mach. Learn., 2022
work page 2022
-
[49]
Code as policies: Language model programs for embod- ied control,
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embod- ied control,” inProc. Int. Conf. Robot. Autom., 2023, pp. 9493–9500
work page 2023
-
[50]
Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,
S. Huang, Z. Jiang, H. Dong, Y . Qiao, P. Gao, and H. Li, “Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,”arXiv preprint arXiv:2305.11176, 2023
-
[51]
Solving quantitative rea- 16 soning problems with language models,
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative rea- 16 soning problems with language models,”Adv. Neural Inform. Process. Syst., 2022
work page 2022
-
[52]
Do as i can, not as i say: Grounding language in robotic affordances,
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.- H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettin...
work page 2022
-
[53]
VILA: On pre-training for visual language models,
J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “VILA: On pre-training for visual language models,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 26 689–26 699
work page 2024
-
[54]
CoPa: General robotic manipulation through spatial constraints of parts with foundation mod- els,
H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “CoPa: General robotic manipulation through spatial constraints of parts with foundation mod- els,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2024, pp. 9488–9495
work page 2024
-
[55]
PIVOT: Iterative visual prompting elicits actionable knowledge for vlms,
S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasgupta, A. Xie, D. Driess, A. Wahid, Z. Xu, Q. Vuong, T. Zhang, T.-W. E. Lee, K.- H. Lee, P. Xu, S. Kirmani, Y . Zhu, A. Zeng, K. Hausman, N. Heess, C. Finn, S. Levine, and B. Ichter, “PIVOT: Iterative visual prompting elicits actionable knowledge for vlms,” inProc. Int. Conf. Mach. Learn., 2024, pp. 37...
work page 2024
-
[56]
Socratic models: Composing zero-shot multimodal reasoning with language,
A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence, “Socratic models: Composing zero-shot multimodal reasoning with language,” inProc. Int. Conf. Learn. Represent., 2023
work page 2023
-
[57]
Simple but effective: Clip embeddings for embodied ai,
A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” inProc. Int. Conf. Comput. Vis., 2022, pp. 14 829–14 838
work page 2022
-
[58]
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,
S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 23 171–23 181
work page 2023
-
[59]
Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,
A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 32 340–32 352, 2022
work page 2022
-
[60]
Visual language maps for robot navigation,
C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” inProc. Int. Conf. Robot. Autom., 2023, pp. 10 608–10 615
work page 2023
-
[61]
M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra, R. Mottaghi, J. Malik, and D. S. Chaplot, “GOAT: Go to any thing,” inProc. Robotics: Science and Systems, 2024
work page 2024
-
[62]
Navigating to objects specified by images,
J. Krantz, T. Gervet, K. Yadav, A. Wang, C. Paxton, R. Mottaghi, D. Batra, J. Malik, S. Lee, and D. S. Chaplot, “Navigating to objects specified by images,” inProc. Int. Conf. Comput. Vis., 2023, pp. 10 916–10 925
work page 2023
-
[63]
PEANUT: Predicting and navigating to unseen targets,
A. J. Zhai and S. Wang, “PEANUT: Predicting and navigating to unseen targets,” inProc. Int. Conf. Comput. Vis., 2023, pp. 10 926– 10 935
work page 2023
-
[64]
Poni: Potential functions for objectgoal navigation with interaction-free learning,
S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “Poni: Potential functions for objectgoal navigation with interaction-free learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 890–18 900
work page 2022
-
[65]
Navigating to objects in unseen environments by distance prediction,
M. Zhu, B. Zhao, and T. Kong, “Navigating to objects in unseen environments by distance prediction,” inProc. IEEE Int. Conf. Intell. Rob. Syst., 2022, pp. 10 571–10 578
work page 2022
-
[66]
Occupancy anticipation for efficient exploration and navigation,
S. K. Ramakrishnan, Z. Al-Halah, and K. Grauman, “Occupancy anticipation for efficient exploration and navigation,” inProc. Eur . Conf. Comput. Vis., 2020, pp. 400–418
work page 2020
-
[67]
Learning to map for active semantic goal navigation,
G. Georgakis, B. Bucher, K. Schmeckpeper, S. Singh, and K. Daniilidis, “Learning to map for active semantic goal navigation,”arXiv preprint arXiv:2106.15648, 2021
-
[68]
SSCNav: Confidence-aware semantic scene completion for visual semantic navigation,
Y . Liang, B. Chen, and S. Song, “SSCNav: Confidence-aware semantic scene completion for visual semantic navigation,” inProc. Int. Conf. Robot. Autom., 2021, pp. 13 194–13 200
work page 2021
-
[69]
Imagine before go: Self-supervised generative map for object goal navigation,
S. Zhang, X. Yu, X. Song, X. Wang, and S. Jiang, “Imagine before go: Self-supervised generative map for object goal navigation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 16 414–16 425
work page 2024
-
[70]
Vlfm: Vision- language frontier maps for zero-shot semantic navigation,
N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” inProc. Int. Conf. Robot. Autom.IEEE, 2024, pp. 42–48
work page 2024
-
[71]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 19 730– 19 742
work page 2023
-
[72]
D. Goetting, H. G. Singh, and A. Loquercio, “End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering,”arXiv preprint arXiv:2411.05755, 2024
-
[73]
Openin: Open-vocabulary instance-oriented navigation in dynamic domestic environments,
Y . Tang, M. Wang, Y . Deng, Z. Zheng, J. Deng, S. Zuo, and Y . Yue, “Openin: Open-vocabulary instance-oriented navigation in dynamic domestic environments,”IEEE Rob. Autom. Lett., vol. 10, no. 9, pp. 9256–9263, 2025
work page 2025
-
[74]
OctoNav: Towards generalist embodied navigation,
C. Gao, L. Jin, X. Peng, J. Zhang, Y . Deng, A. Li, H. Wang, and S. Liu, “OctoNav: Towards generalist embodied navigation,”arXiv preprint arXiv:2506.09839, 2025
-
[75]
Zeronvs: Zero-shot 360- degree view synthesis from a single image,
K. Sargent, Z. Li, T. Shah, C. Herrmann, H.-X. Yu, Y . Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sunet al., “Zeronvs: Zero-shot 360- degree view synthesis from a single image,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9420–9429
work page 2024
-
[76]
Behind the scenes: Density fields for single view reconstruction,
F. Wimbauer, N. Yang, C. Rupprecht, and D. Cremers, “Behind the scenes: Density fields for single view reconstruction,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 9076–9086
work page 2023
-
[77]
Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,
A.-Q. Cao and R. De Charette, “Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 9387–9398
work page 2023
-
[78]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. Int. Conf. Comput. Vis., 2016, pp. 770–778
work page 2016
-
[79]
VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,
S. E. F. D. Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albu- querque Araujo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,”Pattern Recognit. Lett., vol. 32, no. 1, pp. 56–68, 2011
work page 2011
-
[80]
Adaptive key frame extraction for video summarization using an aggregation mechanism,
N. Ejaz, T. B. Tariq, and S. W. Baik, “Adaptive key frame extraction for video summarization using an aggregation mechanism,”J Vis Commun Image R, vol. 23, no. 7, pp. 1031–1040, 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.