arxiv: 2402.15852 · v7 · submitted 2024-02-24 · 💻 cs.CV · cs.RO

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang , Kunyu Wang , Rongtao Xu , Gengze Zhou , Yicong Hong , Xiaomeng Fang , Qi Wu , Zhizheng Zhang

show 1 more author

He Wang

This is my paper

Pith reviewed 2026-05-18 04:50 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-and-language navigationvideo-based VLMembodied AISim2Real transfermonocular RGB navigationaction planninginstruction following

0 comments

The pith

A video-based vision-language model navigates unseen environments by outputting next actions from a raw monocular RGB video stream alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NaVid as a way to close the generalization gap in vision-and-language navigation between simulation and reality or across different scenes. It formulates the task so that a large vision-language model receives only a continuous stream of RGB images from one camera and directly predicts the next discrete action while following language instructions. This video-based design encodes past observations as spatio-temporal context and avoids the noise and domain gaps introduced by maps, depth sensors, or odometers. Training combines 510k navigation trajectories with 763k web-scale examples, and experiments report state-of-the-art results in both simulated and physical environments along with strong cross-dataset and Sim2Real transfer. A sympathetic reader would care because the method simplifies the sensor stack required for reliable embodied agents.

Core claim

NaVid is a video-based large vision language model that takes an on-the-fly monocular RGB video stream and produces the next-step navigation action. It reaches state-of-the-art performance in simulation and real-world settings without maps, odometers or depth inputs, and it shows superior cross-dataset and Sim2Real transfer by using spatio-temporal context from historical frames to support instruction following.

What carries the argument

NaVid, the video-based VLM that directly maps a continuous RGB video stream to the next discrete action while treating past frames as spatio-temporal context for decision making.

If this is right

Navigation agents can operate in continuous environments without maintaining explicit maps or relying on depth or odometer readings.
Historical video frames supply useful spatio-temporal context that improves both action planning and language instruction adherence.
Removing map and depth inputs reduces the Sim2Real gap caused by sensor noise or domain shift in those modalities.
The same video-based formulation supports stronger cross-dataset generalization than map-centric or depth-centric baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The video-only approach may allow navigation on simpler robot platforms that lack depth sensors or reliable odometry.
Combining navigation data with large web corpora could scale instruction understanding for longer or more abstract commands.
The same video-context mechanism might transfer to other embodied tasks such as manipulation where continuous visual history is available.

Load-bearing premise

A VLM trained on collected navigation trajectories and web-scale data can reliably generalize action planning to completely unseen real-world environments using only raw RGB video without auxiliary sensors or explicit mapping.

What would settle it

Deploy NaVid in a real-world indoor or outdoor space whose layout, lighting, and obstacles differ substantially from both the training trajectories and the web data, then measure whether success rate and instruction-following accuracy remain at the reported state-of-the-art level.

read the original abstract

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces NaVid, a video-based large vision-language model for vision-and-language navigation (VLN). It claims to achieve state-of-the-art navigation performance in both simulation and real-world settings by using only raw monocular RGB video streams to output next-step actions, without maps, odometers, or depth sensors. The model is trained on 510k navigation samples from continuous environments plus 763k web-scale data and is reported to show strong cross-dataset generalization and Sim2Real transfer.

Significance. If the empirical claims are substantiated, the work would be significant for embodied AI by showing that large VLMs can perform reliable spatio-temporal reasoning for navigation from video alone. This could reduce hardware complexity and address long-standing Sim2Real gaps in VLN. The video-based encoding of historical observations as context is a clear strength that aligns with human-like navigation.

major comments (2)

[Abstract] Abstract: The central claims of 'state-of-the-art performance in simulation environments and the real world' and 'superior cross-dataset and Sim2Real transfer' are stated without any quantitative metrics, success rates, SPL values, or references to specific experimental tables or figures. This makes it impossible to assess the magnitude or reliability of the reported gains from the provided summary alone.
[Real-world evaluation] Real-world and Sim2Real evaluation sections: No explicit controls, environment novelty scoring, visual similarity metrics, or out-of-distribution tests are described to verify that real-world test scenes are disjoint from the 510k collected navigation trajectories. This is load-bearing for the generalization claim, as overlap in visual statistics or instruction styles could confound video-based planning with implicit memorization rather than true transfer.

minor comments (1)

[Abstract] Abstract: The final sentence ('We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field') uses overly broad language that could be revised to a more precise statement of contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will incorporate in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'state-of-the-art performance in simulation environments and the real world' and 'superior cross-dataset and Sim2Real transfer' are stated without any quantitative metrics, success rates, SPL values, or references to specific experimental tables or figures. This makes it impossible to assess the magnitude or reliability of the reported gains from the provided summary alone.

Authors: We agree that including quantitative metrics in the abstract would allow readers to immediately gauge the scale of the reported improvements. In the revised manuscript, we will update the abstract to incorporate key results such as success rates, SPL values, and cross-dataset transfer metrics from our experiments, with explicit references to the relevant tables and figures. revision: yes
Referee: [Real-world evaluation] Real-world and Sim2Real evaluation sections: No explicit controls, environment novelty scoring, visual similarity metrics, or out-of-distribution tests are described to verify that real-world test scenes are disjoint from the 510k collected navigation trajectories. This is load-bearing for the generalization claim, as overlap in visual statistics or instruction styles could confound video-based planning with implicit memorization rather than true transfer.

Authors: We recognize the importance of explicitly demonstrating scene disjointness to support the Sim2Real and generalization claims. The current manuscript describes training on 510k samples from continuous environments and real-world testing in separate settings, but does not detail novelty controls. In the revision, we will add a subsection to the real-world evaluation that specifies the environment selection criteria, including any visual similarity metrics or out-of-distribution checks used to confirm that test scenes were disjoint from the training trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical VLM training on external data with benchmark evaluation

full rationale

The paper presents NaVid as a video-based VLM trained on 510k navigation samples from continuous environments plus 763k web-scale data, then evaluated for next-step action prediction in VLN tasks. All performance claims rest on standard simulation benchmarks and real-world tests using raw RGB video input. No equations, derivations, or first-principles results are defined; there are no fitted parameters renamed as predictions, no self-definitional constructs, and no load-bearing self-citations that reduce the central claims to the authors' own prior unverified results. The work is self-contained against external datasets and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of scaling a standard VLM architecture with mixed navigation and web data; no new physical laws or mathematical derivations are introduced.

free parameters (1)

Training data composition
510k navigation samples plus 763k web examples are selected to balance instruction following and visual grounding.

axioms (1)

domain assumption A pre-trained VLM can be fine-tuned to map video sequences plus language to discrete navigation actions.
Invoked when stating that the model outputs next-step actions directly from RGB video.

pith-pipeline@v0.9.0 · 5825 in / 1210 out tokens · 29935 ms · 2026-05-18T04:50:43.377659+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation
cs.RO 2026-04 unverdicted novelty 7.0

RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.
Towards Generalizable Robotic Manipulation in Dynamic Environments
cs.CV 2026-03 unverdicted novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
cs.RO 2026-03 conditional novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
cs.CV 2026-02 unverdicted novelty 7.0

GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
cs.RO 2026-04 unverdicted novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
cs.CV 2026-04 unverdicted novelty 6.0

FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.
Visually-grounded Humanoid Agents
cs.CV 2026-04 unverdicted novelty 6.0

A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
cs.CV 2026-03 unverdicted novelty 6.0

Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
cs.RO 2025-11 unverdicted novelty 6.0

Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
cs.RO 2025-10 unverdicted novelty 6.0

C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark whil...
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation
cs.RO 2025-10 unverdicted novelty 6.0

R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
cs.RO 2025-05 unverdicted novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation
cs.CV 2026-05 conditional novelty 5.0

LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benc...
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · cited by 18 Pith papers · 18 internal anchors

[1]

Are We Mak- ing Real Progress in Simulated Environments? Measur- ing the Sim2Real Gap in Embodied Visual Navigation

Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Are We Mak- ing Real Progress in Simulated Environments? Measur- ing the Sim2Real Gap in Embodied Visual Navigation. In arXiv:1912.06321, 2019

work page arXiv 1912
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Bevbert: Topo-metric map pre-training for language-guided navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation. arXiv preprint arXiv:2212.04385, 2022

work page arXiv 2022
[4]

Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments. arXiv preprint arXiv:2304.03047, 2023

work page arXiv 2023
[6]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition , pages 3674–3683, 2018

work page 2018
[9]

Sim-to-real transfer for vision-and-language navigation

Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Ste- fan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning , pages 671–681. PMLR, 2021

work page 2021
[10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676. IEEE, 2017

work page 2017
[13]

Touchdown: Natural language navigation and spatial reasoning in visual street envi- ronments

Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street envi- ronments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12538–12547, 2019

work page 2019
[14]

Mapgpt: Map-guided prompting for unified vision-and-language navigation

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314 , 2024

work page arXiv 2024
[15]

Topological planning with transformers for vision-and-language navigation

Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11276– 11286, 2021

work page 2021
[16]

Weakly-supervised multi-granularity map learning for vision-and-language navigation

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. arXiv preprint arXiv:2210.07506, 2022

work page arXiv 2022
[17]

Action-aware zero-shot robot navigation by ex- ploiting vision-and-language ability of foundation mod- els

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. Action-aware zero-shot robot navigation by ex- ploiting vision-and-language ability of foundation mod- els. arXiv preprint arXiv:2308.07997 , 2023

work page arXiv 2023
[18]

History aware multimodal transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems , 34:5834–5847, 2021

work page 2021
[19]

Think global, act local: Dual-scale graph transformer for vision-and- language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and- language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 16537–16547, 2022

work page 2022
[20]

Uniter: Universal image-text represen- tation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text represen- tation learning. In European conference on computer vision, pages 104–120. Springer, 2020

work page 2020
[21]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

work page 2023
[22]

Toward next-generation learned robot manipulation

Jinda Cui and Jeff Trinkle. Toward next-generation learned robot manipulation. Science robotics , 6(54): eabd9461, 2021

work page 2021
[23]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

W Dai, J Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500 , 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

A survey on long text modeling with transform- ers

Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transform- ers. arXiv preprint arXiv:2302.14502 , 2023

work page arXiv 2023
[26]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Speaker-follower models for vision-and- language navigation

Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and- language navigation. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[28]

Drive like a human: Rethinking autonomous driving with large language models

Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 910–919, 2024

work page 2024
[29]

Counterfactual vision-and-language navigation via adversarial path sampler

Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. Counterfactual vision-and-language navigation via adversarial path sampler. In European Conference on Computer Vision , pages 71–86. Springer, 2020

work page 2020
[30]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

work page 2023
[31]

Cross-modal map learning for vi- sion and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wan- choo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vi- sion and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022

work page 2022
[32]

Vision-and-language navigation: A survey of tasks, methods, and future directions

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7606–7623, 2022

work page 2022
[33]

Airbert: In-domain pretraining for vision-and-language navigation

Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, and Cordelia Schmid. Airbert: In-domain pretraining for vision-and-language navigation. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021

work page 2021
[34]

Towards learning a generic agent for vision-and-language navigation via pre-training

Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13137–13146, 2020

work page 2020
[35]

Language and visual entity rela- tionship graph for agent navigation

Yicong Hong, Cristian Rodriguez, Yuankai Qi, Qi Wu, and Stephen Gould. Language and visual entity rela- tionship graph for agent navigation. Advances in Neural Information Processing Systems , 33, 2020

work page 2020
[36]

A recurrent vision-and- language bert for navigation

Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez- Opazo, and Stephen Gould. A recurrent vision-and- language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1643–1653, June 2021

work page 2021
[37]

Bridging the gap between learning in discrete and continuous environments for vision-and-language navi- gation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navi- gation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15439–15449, 2022

work page 2022
[38]

Look before you leap: Unveiling the power of gpt- 4v in robotic vision-language planning

Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt- 4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023

work page arXiv 2023
[39]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tomp- son, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning , pages 1769–1782. PMLR, 2023

work page 2023
[40]

Sasra: Semantically- aware spatio-temporal reasoning agent for vision-and- language navigation in continuous environments

Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. Sasra: Semantically- aware spatio-temporal reasoning agent for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2108.11945, 2021

work page arXiv 2021
[41]

A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning

Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Ja- son Baldridge, and Zarana Parekh. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. arXiv preprint arXiv:2210.03112, 2022

work page arXiv 2022
[42]

Tactical rewind: Self-correction via backtracking in vision-and-language navigation

Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6741–6749, 2019

work page 2019
[43]

Extending regular expressions with context operators and parse extraction

Steven M Kearns. Extending regular expressions with context operators and parse extraction. Software: Prac- tice and Experience , 21(8):787–804, 1991

work page 1991
[44]

Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments. In European Conference on Computer Vision , pages 588–603. Springer, 2022

work page 2022
[45]

Beyond the nav-graph: Vision- and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision , pages 104–

work page
[46]

Beyond the nav-graph: Vision and language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. In European Conference on Computer Vision (ECCV) , 2020

work page 2020
[47]

Waypoint models for instruction-guided navigation in continuous environ- ments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environ- ments. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15162–15171, 2021

work page 2021
[48]

It- erative vision-and-language navigation

Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason. It- erative vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14921–14930, 2023

work page 2023
[50]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotem- poral grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotem- poral grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

work page 2020
[51]

Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models

Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024
[52]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[54]

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Robust navigation with language pretraining and stochastic sampling

Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244 , 2019

work page arXiv 1909
[56]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Euro- pean Conference on Computer Vision , pages 121–137. Springer, 2020

work page 2020
[57]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 , 2023

work page arXiv 2023
[58]

Mo-vln: A multi-task benchmark for open-set zero-shot vision-and- language navigation

Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. Mo-vln: A multi-task benchmark for open-set zero-shot vision-and- language navigation. arXiv preprint arXiv:2306.10322 , 2023

work page arXiv 2023
[59]

The development of llms for embodied navigation

Jinzhou Lin, Han Gao, Rongtao Xu, Changwei Wang, Li Guo, and Shibiao Xu. The development of llms for embodied navigation. arXiv preprint arXiv:2311.00530, 2023

work page arXiv 2023
[60]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[62]

Efficient and consistent bundle adjustment on lidar point clouds

Zheng Liu, Xiyuan Liu, and Fu Zhang. Efficient and consistent bundle adjustment on lidar point clouds. IEEE Transactions on Robotics , 2023

work page 2023
[63]

Discuss before moving: Visual language nav- igation via multi-expert discussions

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language nav- igation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023

work page arXiv 2023
[64]

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan Al- Regib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[65]

The marathon 2: A navigation system

Steve Macenski, Francisco Mart ´ın, Ruffin White, and Jonatan Gin ´es Clavero. The marathon 2: A navigation system. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 2718–

work page 2020
[66]

The marathon 2: A navigation system

Steven Macenski, Francisco Martin, Ruffin White, and Jonatan Gin ´es Clavero. The marathon 2: A navigation system. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2020

work page 2020
[67]

Improving vision-and-language navigation with image-text pairs from the web

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020

work page 2020
[68]

Zson: Zero-shot object-goal navigation using multimodal goal embed- dings

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embed- dings. arXiv preprint arXiv:2206.12403 , 2022

work page arXiv 2022
[69]

Langnav: Language as a perceptual representation for navigation

Bowen Pan, Rameswar Panda, SouYoung Jin, Roge- rio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889 , 2023

work page arXiv 2023
[70]

Visual language navigation: A survey and open challenges

Sang-Min Park and Young-Gab Kim. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1):365–427, 2023

work page 2023
[71]

Object-and-action aware model for visual language navigation

Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. Object-and-action aware model for visual language navigation. In European Con- ference on Computer Vision , pages 303–317. Springer, 2020

work page 2020
[72]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

work page 2020
[73]

Hop: History-and-order aware pre-training for vision-and-language navigation

Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu. Hop: History-and-order aware pre-training for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15418–15427, 2022

work page 2022
[74]

March in chat: Interactive prompting for re- mote embodied referring expression

Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for re- mote embodied referring expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023

work page 2023
[75]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[76]

Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d en- vironments for embodied ai

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d en- vironments for embodied ai. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and...

work page 2021
[77]

Poni: Potential functions for objectgoal navigation with interaction-free learning

Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grau- man. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

work page 2022
[78]

Language-aligned way- point (law) supervision for vision-and-language nav- igation in continuous environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel X Chang. Language-aligned way- point (law) supervision for vision-and-language nav- igation in continuous environments. arXiv preprint arXiv:2109.15207, 2021

work page arXiv 2021
[79]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[80]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[81]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9339–9347, 2019

work page 2019
[82]

Velma: Verbalization embodiment of llm agents for vision and language navigation in street view

Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. arXiv preprint arXiv:2307.06082, 2023

work page arXiv 2023
[83]

Fast marching methods

James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999

work page 1999

Showing first 80 references.