pith. machine review for the scientific record. sign in

arxiv: 2402.15852 · v7 · submitted 2024-02-24 · 💻 cs.CV · cs.RO

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Pith reviewed 2026-05-18 04:50 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords vision-and-language navigationvideo-based VLMembodied AISim2Real transfermonocular RGB navigationaction planninginstruction following
0
0 comments X

The pith

A video-based vision-language model navigates unseen environments by outputting next actions from a raw monocular RGB video stream alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NaVid as a way to close the generalization gap in vision-and-language navigation between simulation and reality or across different scenes. It formulates the task so that a large vision-language model receives only a continuous stream of RGB images from one camera and directly predicts the next discrete action while following language instructions. This video-based design encodes past observations as spatio-temporal context and avoids the noise and domain gaps introduced by maps, depth sensors, or odometers. Training combines 510k navigation trajectories with 763k web-scale examples, and experiments report state-of-the-art results in both simulated and physical environments along with strong cross-dataset and Sim2Real transfer. A sympathetic reader would care because the method simplifies the sensor stack required for reliable embodied agents.

Core claim

NaVid is a video-based large vision language model that takes an on-the-fly monocular RGB video stream and produces the next-step navigation action. It reaches state-of-the-art performance in simulation and real-world settings without maps, odometers or depth inputs, and it shows superior cross-dataset and Sim2Real transfer by using spatio-temporal context from historical frames to support instruction following.

What carries the argument

NaVid, the video-based VLM that directly maps a continuous RGB video stream to the next discrete action while treating past frames as spatio-temporal context for decision making.

If this is right

  • Navigation agents can operate in continuous environments without maintaining explicit maps or relying on depth or odometer readings.
  • Historical video frames supply useful spatio-temporal context that improves both action planning and language instruction adherence.
  • Removing map and depth inputs reduces the Sim2Real gap caused by sensor noise or domain shift in those modalities.
  • The same video-based formulation supports stronger cross-dataset generalization than map-centric or depth-centric baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The video-only approach may allow navigation on simpler robot platforms that lack depth sensors or reliable odometry.
  • Combining navigation data with large web corpora could scale instruction understanding for longer or more abstract commands.
  • The same video-context mechanism might transfer to other embodied tasks such as manipulation where continuous visual history is available.

Load-bearing premise

A VLM trained on collected navigation trajectories and web-scale data can reliably generalize action planning to completely unseen real-world environments using only raw RGB video without auxiliary sensors or explicit mapping.

What would settle it

Deploy NaVid in a real-world indoor or outdoor space whose layout, lighting, and obstacles differ substantially from both the training trajectories and the web data, then measure whether success rate and instruction-following accuracy remain at the reported state-of-the-art level.

read the original abstract

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces NaVid, a video-based large vision-language model for vision-and-language navigation (VLN). It claims to achieve state-of-the-art navigation performance in both simulation and real-world settings by using only raw monocular RGB video streams to output next-step actions, without maps, odometers, or depth sensors. The model is trained on 510k navigation samples from continuous environments plus 763k web-scale data and is reported to show strong cross-dataset generalization and Sim2Real transfer.

Significance. If the empirical claims are substantiated, the work would be significant for embodied AI by showing that large VLMs can perform reliable spatio-temporal reasoning for navigation from video alone. This could reduce hardware complexity and address long-standing Sim2Real gaps in VLN. The video-based encoding of historical observations as context is a clear strength that aligns with human-like navigation.

major comments (2)
  1. [Abstract] Abstract: The central claims of 'state-of-the-art performance in simulation environments and the real world' and 'superior cross-dataset and Sim2Real transfer' are stated without any quantitative metrics, success rates, SPL values, or references to specific experimental tables or figures. This makes it impossible to assess the magnitude or reliability of the reported gains from the provided summary alone.
  2. [Real-world evaluation] Real-world and Sim2Real evaluation sections: No explicit controls, environment novelty scoring, visual similarity metrics, or out-of-distribution tests are described to verify that real-world test scenes are disjoint from the 510k collected navigation trajectories. This is load-bearing for the generalization claim, as overlap in visual statistics or instruction styles could confound video-based planning with implicit memorization rather than true transfer.
minor comments (1)
  1. [Abstract] Abstract: The final sentence ('We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field') uses overly broad language that could be revised to a more precise statement of contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will incorporate in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'state-of-the-art performance in simulation environments and the real world' and 'superior cross-dataset and Sim2Real transfer' are stated without any quantitative metrics, success rates, SPL values, or references to specific experimental tables or figures. This makes it impossible to assess the magnitude or reliability of the reported gains from the provided summary alone.

    Authors: We agree that including quantitative metrics in the abstract would allow readers to immediately gauge the scale of the reported improvements. In the revised manuscript, we will update the abstract to incorporate key results such as success rates, SPL values, and cross-dataset transfer metrics from our experiments, with explicit references to the relevant tables and figures. revision: yes

  2. Referee: [Real-world evaluation] Real-world and Sim2Real evaluation sections: No explicit controls, environment novelty scoring, visual similarity metrics, or out-of-distribution tests are described to verify that real-world test scenes are disjoint from the 510k collected navigation trajectories. This is load-bearing for the generalization claim, as overlap in visual statistics or instruction styles could confound video-based planning with implicit memorization rather than true transfer.

    Authors: We recognize the importance of explicitly demonstrating scene disjointness to support the Sim2Real and generalization claims. The current manuscript describes training on 510k samples from continuous environments and real-world testing in separate settings, but does not detail novelty controls. In the revision, we will add a subsection to the real-world evaluation that specifies the environment selection criteria, including any visual similarity metrics or out-of-distribution checks used to confirm that test scenes were disjoint from the training trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical VLM training on external data with benchmark evaluation

full rationale

The paper presents NaVid as a video-based VLM trained on 510k navigation samples from continuous environments plus 763k web-scale data, then evaluated for next-step action prediction in VLN tasks. All performance claims rest on standard simulation benchmarks and real-world tests using raw RGB video input. No equations, derivations, or first-principles results are defined; there are no fitted parameters renamed as predictions, no self-definitional constructs, and no load-bearing self-citations that reduce the central claims to the authors' own prior unverified results. The work is self-contained against external datasets and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of scaling a standard VLM architecture with mixed navigation and web data; no new physical laws or mathematical derivations are introduced.

free parameters (1)
  • Training data composition
    510k navigation samples plus 763k web examples are selected to balance instruction following and visual grounding.
axioms (1)
  • domain assumption A pre-trained VLM can be fine-tuned to map video sequences plus language to discrete navigation actions.
    Invoked when stating that the model outputs next-step actions directly from RGB video.

pith-pipeline@v0.9.0 · 5825 in / 1210 out tokens · 29935 ms · 2026-05-18T04:50:43.377659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

    cs.RO 2026-04 unverdicted novelty 7.0

    RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.

  2. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  3. VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

    cs.RO 2026-03 conditional novelty 7.0

    VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

  4. Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

    cs.CV 2026-02 unverdicted novelty 7.0

    GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.

  5. PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.

  6. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

  7. FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

    cs.RO 2026-04 unverdicted novelty 6.0

    FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.

  8. FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.

  9. Visually-grounded Humanoid Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.

  10. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  11. A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

  12. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  13. Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

    cs.RO 2025-11 unverdicted novelty 6.0

    Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.

  14. C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

    cs.RO 2025-10 unverdicted novelty 6.0

    C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark whil...

  15. R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.

  16. GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    cs.RO 2025-05 unverdicted novelty 6.0

    GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

  17. LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation

    cs.CV 2026-05 conditional novelty 5.0

    LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benc...

  18. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · cited by 18 Pith papers · 18 internal anchors

  1. [1]

    Are We Mak- ing Real Progress in Simulated Environments? Measur- ing the Sim2Real Gap in Embodied Visual Navigation

    Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Are We Mak- ing Real Progress in Simulated Environments? Measur- ing the Sim2Real Gap in Embodied Visual Navigation. In arXiv:1912.06321, 2019

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Bevbert: Topo-metric map pre-training for language-guided navigation

    Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation. arXiv preprint arXiv:2212.04385, 2022

  4. [4]

    Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments. arXiv preprint arXiv:2304.03047, 2023

  5. [6]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018

  6. [7]

    Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition , pages 3674–3683, 2018

  7. [9]

    Sim-to-real transfer for vision-and-language navigation

    Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Ste- fan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning , pages 671–681. PMLR, 2021

  8. [10]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  9. [11]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  10. [12]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676. IEEE, 2017

  11. [13]

    Touchdown: Natural language navigation and spatial reasoning in visual street envi- ronments

    Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street envi- ronments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12538–12547, 2019

  12. [14]

    Mapgpt: Map-guided prompting for unified vision-and-language navigation

    Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314 , 2024

  13. [15]

    Topological planning with transformers for vision-and-language navigation

    Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11276– 11286, 2021

  14. [16]

    Weakly-supervised multi-granularity map learning for vision-and-language navigation

    Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. arXiv preprint arXiv:2210.07506, 2022

  15. [17]

    Action-aware zero-shot robot navigation by ex- ploiting vision-and-language ability of foundation mod- els

    Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. Action-aware zero-shot robot navigation by ex- ploiting vision-and-language ability of foundation mod- els. arXiv preprint arXiv:2308.07997 , 2023

  16. [18]

    History aware multimodal transformer for vision-and-language navigation

    Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems , 34:5834–5847, 2021

  17. [19]

    Think global, act local: Dual-scale graph transformer for vision-and- language navigation

    Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and- language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 16537–16547, 2022

  18. [20]

    Uniter: Universal image-text represen- tation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text represen- tation learning. In European conference on computer vision, pages 104–120. Springer, 2020

  19. [21]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

  20. [22]

    Toward next-generation learned robot manipulation

    Jinda Cui and Jeff Trinkle. Toward next-generation learned robot manipulation. Science robotics , 6(54): eabd9461, 2021

  21. [23]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    W Dai, J Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500 , 2, 2023

  22. [24]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  23. [25]

    A survey on long text modeling with transform- ers

    Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transform- ers. arXiv preprint arXiv:2302.14502 , 2023

  24. [26]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 , 2023

  25. [27]

    Speaker-follower models for vision-and- language navigation

    Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and- language navigation. Advances in Neural Information Processing Systems, 31, 2018

  26. [28]

    Drive like a human: Rethinking autonomous driving with large language models

    Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 910–919, 2024

  27. [29]

    Counterfactual vision-and-language navigation via adversarial path sampler

    Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. Counterfactual vision-and-language navigation via adversarial path sampler. In European Conference on Computer Vision , pages 71–86. Springer, 2020

  28. [30]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

  29. [31]

    Cross-modal map learning for vi- sion and language navigation

    Georgios Georgakis, Karl Schmeckpeper, Karan Wan- choo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vi- sion and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022

  30. [32]

    Vision-and-language navigation: A survey of tasks, methods, and future directions

    Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7606–7623, 2022

  31. [33]

    Airbert: In-domain pretraining for vision-and-language navigation

    Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, and Cordelia Schmid. Airbert: In-domain pretraining for vision-and-language navigation. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021

  32. [34]

    Towards learning a generic agent for vision-and-language navigation via pre-training

    Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13137–13146, 2020

  33. [35]

    Language and visual entity rela- tionship graph for agent navigation

    Yicong Hong, Cristian Rodriguez, Yuankai Qi, Qi Wu, and Stephen Gould. Language and visual entity rela- tionship graph for agent navigation. Advances in Neural Information Processing Systems , 33, 2020

  34. [36]

    A recurrent vision-and- language bert for navigation

    Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez- Opazo, and Stephen Gould. A recurrent vision-and- language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1643–1653, June 2021

  35. [37]

    Bridging the gap between learning in discrete and continuous environments for vision-and-language navi- gation

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navi- gation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15439–15449, 2022

  36. [38]

    Look before you leap: Unveiling the power of gpt- 4v in robotic vision-language planning

    Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt- 4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023

  37. [39]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tomp- son, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning , pages 1769–1782. PMLR, 2023

  38. [40]

    Sasra: Semantically- aware spatio-temporal reasoning agent for vision-and- language navigation in continuous environments

    Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. Sasra: Semantically- aware spatio-temporal reasoning agent for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2108.11945, 2021

  39. [41]

    A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning

    Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Ja- son Baldridge, and Zarana Parekh. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. arXiv preprint arXiv:2210.03112, 2022

  40. [42]

    Tactical rewind: Self-correction via backtracking in vision-and-language navigation

    Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6741–6749, 2019

  41. [43]

    Extending regular expressions with context operators and parse extraction

    Steven M Kearns. Extending regular expressions with context operators and parse extraction. Software: Prac- tice and Experience , 21(8):787–804, 1991

  42. [44]

    Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments

    Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments. In European Conference on Computer Vision , pages 588–603. Springer, 2022

  43. [45]

    Beyond the nav-graph: Vision- and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision , pages 104–

  44. [46]

    Beyond the nav-graph: Vision and language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. In European Conference on Computer Vision (ECCV) , 2020

  45. [47]

    Waypoint models for instruction-guided navigation in continuous environ- ments

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environ- ments. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15162–15171, 2021

  46. [48]

    It- erative vision-and-language navigation

    Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason. It- erative vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14921–14930, 2023

  47. [50]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotem- poral grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotem- poral grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

  48. [51]

    Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models

    Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024

  49. [52]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023

  50. [53]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019

  51. [54]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023

  52. [55]

    Robust navigation with language pretraining and stochastic sampling

    Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244 , 2019

  53. [56]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Euro- pean Conference on Computer Vision , pages 121–137. Springer, 2020

  54. [57]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 , 2023

  55. [58]

    Mo-vln: A multi-task benchmark for open-set zero-shot vision-and- language navigation

    Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. Mo-vln: A multi-task benchmark for open-set zero-shot vision-and- language navigation. arXiv preprint arXiv:2306.10322 , 2023

  56. [59]

    The development of llms for embodied navigation

    Jinzhou Lin, Han Gao, Rongtao Xu, Changwei Wang, Li Guo, and Shibiao Xu. The development of llms for embodied navigation. arXiv preprint arXiv:2311.00530, 2023

  57. [60]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 , 2023

  58. [61]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  59. [62]

    Efficient and consistent bundle adjustment on lidar point clouds

    Zheng Liu, Xiyuan Liu, and Fu Zhang. Efficient and consistent bundle adjustment on lidar point clouds. IEEE Transactions on Robotics , 2023

  60. [63]

    Discuss before moving: Visual language nav- igation via multi-expert discussions

    Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language nav- igation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023

  61. [64]

    Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

    Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan Al- Regib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 , 2019

  62. [65]

    The marathon 2: A navigation system

    Steve Macenski, Francisco Mart ´ın, Ruffin White, and Jonatan Gin ´es Clavero. The marathon 2: A navigation system. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 2718–

  63. [66]

    The marathon 2: A navigation system

    Steven Macenski, Francisco Martin, Ruffin White, and Jonatan Gin ´es Clavero. The marathon 2: A navigation system. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2020

  64. [67]

    Improving vision-and-language navigation with image-text pairs from the web

    Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020

  65. [68]

    Zson: Zero-shot object-goal navigation using multimodal goal embed- dings

    Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embed- dings. arXiv preprint arXiv:2206.12403 , 2022

  66. [69]

    Langnav: Language as a perceptual representation for navigation

    Bowen Pan, Rameswar Panda, SouYoung Jin, Roge- rio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889 , 2023

  67. [70]

    Visual language navigation: A survey and open challenges

    Sang-Min Park and Young-Gab Kim. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1):365–427, 2023

  68. [71]

    Object-and-action aware model for visual language navigation

    Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. Object-and-action aware model for visual language navigation. In European Con- ference on Computer Vision , pages 303–317. Springer, 2020

  69. [72]

    Reverie: Remote embodied visual referring expression in real indoor environments

    Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

  70. [73]

    Hop: History-and-order aware pre-training for vision-and-language navigation

    Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu. Hop: History-and-order aware pre-training for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15418–15427, 2022

  71. [74]

    March in chat: Interactive prompting for re- mote embodied referring expression

    Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for re- mote embodied referring expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023

  72. [75]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

  73. [76]

    Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d en- vironments for embodied ai

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d en- vironments for embodied ai. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and...

  74. [77]

    Poni: Potential functions for objectgoal navigation with interaction-free learning

    Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grau- man. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

  75. [78]

    Language-aligned way- point (law) supervision for vision-and-language nav- igation in continuous environments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel X Chang. Language-aligned way- point (law) supervision for vision-and-language nav- igation in continuous environments. arXiv preprint arXiv:2109.15207, 2021

  76. [79]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  77. [80]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  78. [81]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9339–9347, 2019

  79. [82]

    Velma: Verbalization embodiment of llm agents for vision and language navigation in street view

    Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. arXiv preprint arXiv:2307.06082, 2023

  80. [83]

    Fast marching methods

    James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999

Showing first 80 references.