NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Pith reviewed 2026-05-18 04:50 UTC · model grok-4.3
The pith
A video-based vision-language model navigates unseen environments by outputting next actions from a raw monocular RGB video stream alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NaVid is a video-based large vision language model that takes an on-the-fly monocular RGB video stream and produces the next-step navigation action. It reaches state-of-the-art performance in simulation and real-world settings without maps, odometers or depth inputs, and it shows superior cross-dataset and Sim2Real transfer by using spatio-temporal context from historical frames to support instruction following.
What carries the argument
NaVid, the video-based VLM that directly maps a continuous RGB video stream to the next discrete action while treating past frames as spatio-temporal context for decision making.
If this is right
- Navigation agents can operate in continuous environments without maintaining explicit maps or relying on depth or odometer readings.
- Historical video frames supply useful spatio-temporal context that improves both action planning and language instruction adherence.
- Removing map and depth inputs reduces the Sim2Real gap caused by sensor noise or domain shift in those modalities.
- The same video-based formulation supports stronger cross-dataset generalization than map-centric or depth-centric baselines.
Where Pith is reading between the lines
- The video-only approach may allow navigation on simpler robot platforms that lack depth sensors or reliable odometry.
- Combining navigation data with large web corpora could scale instruction understanding for longer or more abstract commands.
- The same video-context mechanism might transfer to other embodied tasks such as manipulation where continuous visual history is available.
Load-bearing premise
A VLM trained on collected navigation trajectories and web-scale data can reliably generalize action planning to completely unseen real-world environments using only raw RGB video without auxiliary sensors or explicit mapping.
What would settle it
Deploy NaVid in a real-world indoor or outdoor space whose layout, lighting, and obstacles differ substantially from both the training trajectories and the web data, then measure whether success rate and instruction-following accuracy remain at the reported state-of-the-art level.
read the original abstract
Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NaVid, a video-based large vision-language model for vision-and-language navigation (VLN). It claims to achieve state-of-the-art navigation performance in both simulation and real-world settings by using only raw monocular RGB video streams to output next-step actions, without maps, odometers, or depth sensors. The model is trained on 510k navigation samples from continuous environments plus 763k web-scale data and is reported to show strong cross-dataset generalization and Sim2Real transfer.
Significance. If the empirical claims are substantiated, the work would be significant for embodied AI by showing that large VLMs can perform reliable spatio-temporal reasoning for navigation from video alone. This could reduce hardware complexity and address long-standing Sim2Real gaps in VLN. The video-based encoding of historical observations as context is a clear strength that aligns with human-like navigation.
major comments (2)
- [Abstract] Abstract: The central claims of 'state-of-the-art performance in simulation environments and the real world' and 'superior cross-dataset and Sim2Real transfer' are stated without any quantitative metrics, success rates, SPL values, or references to specific experimental tables or figures. This makes it impossible to assess the magnitude or reliability of the reported gains from the provided summary alone.
- [Real-world evaluation] Real-world and Sim2Real evaluation sections: No explicit controls, environment novelty scoring, visual similarity metrics, or out-of-distribution tests are described to verify that real-world test scenes are disjoint from the 510k collected navigation trajectories. This is load-bearing for the generalization claim, as overlap in visual statistics or instruction styles could confound video-based planning with implicit memorization rather than true transfer.
minor comments (1)
- [Abstract] Abstract: The final sentence ('We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field') uses overly broad language that could be revised to a more precise statement of contributions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will incorporate in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 'state-of-the-art performance in simulation environments and the real world' and 'superior cross-dataset and Sim2Real transfer' are stated without any quantitative metrics, success rates, SPL values, or references to specific experimental tables or figures. This makes it impossible to assess the magnitude or reliability of the reported gains from the provided summary alone.
Authors: We agree that including quantitative metrics in the abstract would allow readers to immediately gauge the scale of the reported improvements. In the revised manuscript, we will update the abstract to incorporate key results such as success rates, SPL values, and cross-dataset transfer metrics from our experiments, with explicit references to the relevant tables and figures. revision: yes
-
Referee: [Real-world evaluation] Real-world and Sim2Real evaluation sections: No explicit controls, environment novelty scoring, visual similarity metrics, or out-of-distribution tests are described to verify that real-world test scenes are disjoint from the 510k collected navigation trajectories. This is load-bearing for the generalization claim, as overlap in visual statistics or instruction styles could confound video-based planning with implicit memorization rather than true transfer.
Authors: We recognize the importance of explicitly demonstrating scene disjointness to support the Sim2Real and generalization claims. The current manuscript describes training on 510k samples from continuous environments and real-world testing in separate settings, but does not detail novelty controls. In the revision, we will add a subsection to the real-world evaluation that specifies the environment selection criteria, including any visual similarity metrics or out-of-distribution checks used to confirm that test scenes were disjoint from the training trajectories. revision: yes
Circularity Check
No circularity: empirical VLM training on external data with benchmark evaluation
full rationale
The paper presents NaVid as a video-based VLM trained on 510k navigation samples from continuous environments plus 763k web-scale data, then evaluated for next-step action prediction in VLN tasks. All performance claims rest on standard simulation benchmarks and real-world tests using raw RGB video input. No equations, derivations, or first-principles results are defined; there are no fitted parameters renamed as predictions, no self-definitional constructs, and no load-bearing self-citations that reduce the central claims to the authors' own prior unverified results. The work is self-contained against external datasets and benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training data composition
axioms (1)
- domain assumption A pre-trained VLM can be fine-tuned to map video sequences plus language to discrete navigation actions.
Forward citations
Cited by 18 Pith papers
-
Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation
RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.
-
Towards Generalizable Robotic Manipulation in Dynamic Environments
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
-
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
-
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
-
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
-
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
-
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
-
FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.
-
Visually-grounded Humanoid Agents
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
-
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
-
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
-
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark whil...
-
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation
R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.
-
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
-
LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation
LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benc...
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Reference graph
Works this paper leans on
-
[1]
Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Are We Mak- ing Real Progress in Simulated Environments? Measur- ing the Sim2Real Gap in Embodied Visual Navigation. In arXiv:1912.06321, 2019
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Bevbert: Topo-metric map pre-training for language-guided navigation
Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation. arXiv preprint arXiv:2212.04385, 2022
-
[4]
Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments
Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments. arXiv preprint arXiv:2304.03047, 2023
-
[6]
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition , pages 3674–3683, 2018
work page 2018
-
[9]
Sim-to-real transfer for vision-and-language navigation
Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Ste- fan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning , pages 671–681. PMLR, 2021
work page 2021
-
[10]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Matterport3d: Learning from rgb-d data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676. IEEE, 2017
work page 2017
-
[13]
Touchdown: Natural language navigation and spatial reasoning in visual street envi- ronments
Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street envi- ronments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12538–12547, 2019
work page 2019
-
[14]
Mapgpt: Map-guided prompting for unified vision-and-language navigation
Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314 , 2024
-
[15]
Topological planning with transformers for vision-and-language navigation
Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11276– 11286, 2021
work page 2021
-
[16]
Weakly-supervised multi-granularity map learning for vision-and-language navigation
Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. arXiv preprint arXiv:2210.07506, 2022
-
[17]
Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. Action-aware zero-shot robot navigation by ex- ploiting vision-and-language ability of foundation mod- els. arXiv preprint arXiv:2308.07997 , 2023
-
[18]
History aware multimodal transformer for vision-and-language navigation
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems , 34:5834–5847, 2021
work page 2021
-
[19]
Think global, act local: Dual-scale graph transformer for vision-and- language navigation
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and- language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 16537–16547, 2022
work page 2022
-
[20]
Uniter: Universal image-text represen- tation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text represen- tation learning. In European conference on computer vision, pages 104–120. Springer, 2020
work page 2020
-
[21]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023
work page 2023
-
[22]
Toward next-generation learned robot manipulation
Jinda Cui and Jeff Trinkle. Toward next-generation learned robot manipulation. Science robotics , 6(54): eabd9461, 2021
work page 2021
-
[23]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
W Dai, J Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500 , 2, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
A survey on long text modeling with transform- ers
Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transform- ers. arXiv preprint arXiv:2302.14502 , 2023
-
[26]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Speaker-follower models for vision-and- language navigation
Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and- language navigation. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[28]
Drive like a human: Rethinking autonomous driving with large language models
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 910–919, 2024
work page 2024
-
[29]
Counterfactual vision-and-language navigation via adversarial path sampler
Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. Counterfactual vision-and-language navigation via adversarial path sampler. In European Conference on Computer Vision , pages 71–86. Springer, 2020
work page 2020
-
[30]
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation
Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023
work page 2023
-
[31]
Cross-modal map learning for vi- sion and language navigation
Georgios Georgakis, Karl Schmeckpeper, Karan Wan- choo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vi- sion and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022
work page 2022
-
[32]
Vision-and-language navigation: A survey of tasks, methods, and future directions
Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7606–7623, 2022
work page 2022
-
[33]
Airbert: In-domain pretraining for vision-and-language navigation
Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, and Cordelia Schmid. Airbert: In-domain pretraining for vision-and-language navigation. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021
work page 2021
-
[34]
Towards learning a generic agent for vision-and-language navigation via pre-training
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13137–13146, 2020
work page 2020
-
[35]
Language and visual entity rela- tionship graph for agent navigation
Yicong Hong, Cristian Rodriguez, Yuankai Qi, Qi Wu, and Stephen Gould. Language and visual entity rela- tionship graph for agent navigation. Advances in Neural Information Processing Systems , 33, 2020
work page 2020
-
[36]
A recurrent vision-and- language bert for navigation
Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez- Opazo, and Stephen Gould. A recurrent vision-and- language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1643–1653, June 2021
work page 2021
-
[37]
Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navi- gation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15439–15449, 2022
work page 2022
-
[38]
Look before you leap: Unveiling the power of gpt- 4v in robotic vision-language planning
Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt- 4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023
-
[39]
Inner monologue: Embodied reasoning through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tomp- son, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning , pages 1769–1782. PMLR, 2023
work page 2023
-
[40]
Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. Sasra: Semantically- aware spatio-temporal reasoning agent for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2108.11945, 2021
-
[41]
Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Ja- son Baldridge, and Zarana Parekh. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. arXiv preprint arXiv:2210.03112, 2022
-
[42]
Tactical rewind: Self-correction via backtracking in vision-and-language navigation
Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6741–6749, 2019
work page 2019
-
[43]
Extending regular expressions with context operators and parse extraction
Steven M Kearns. Extending regular expressions with context operators and parse extraction. Software: Prac- tice and Experience , 21(8):787–804, 1991
work page 1991
-
[44]
Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments
Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments. In European Conference on Computer Vision , pages 588–603. Springer, 2022
work page 2022
-
[45]
Beyond the nav-graph: Vision- and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision , pages 104–
-
[46]
Beyond the nav-graph: Vision and language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. In European Conference on Computer Vision (ECCV) , 2020
work page 2020
-
[47]
Waypoint models for instruction-guided navigation in continuous environ- ments
Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environ- ments. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15162–15171, 2021
work page 2021
-
[48]
It- erative vision-and-language navigation
Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason. It- erative vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14921–14930, 2023
work page 2023
-
[50]
Room-across-room: Multilingual vision-and-language navigation with dense spatiotem- poral grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotem- poral grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020
work page 2020
-
[51]
Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models
Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024
-
[52]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[54]
Vision-Language Foundation Models as Effective Robot Imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Robust navigation with language pretraining and stochastic sampling
Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244 , 2019
-
[56]
Oscar: Object-semantics aligned pre-training for vision-language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Euro- pean Conference on Computer Vision , pages 121–137. Springer, 2020
work page 2020
-
[57]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 , 2023
-
[58]
Mo-vln: A multi-task benchmark for open-set zero-shot vision-and- language navigation
Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. Mo-vln: A multi-task benchmark for open-set zero-shot vision-and- language navigation. arXiv preprint arXiv:2306.10322 , 2023
-
[59]
The development of llms for embodied navigation
Jinzhou Lin, Han Gao, Rongtao Xu, Changwei Wang, Li Guo, and Shibiao Xu. The development of llms for embodied navigation. arXiv preprint arXiv:2311.00530, 2023
-
[60]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[62]
Efficient and consistent bundle adjustment on lidar point clouds
Zheng Liu, Xiyuan Liu, and Fu Zhang. Efficient and consistent bundle adjustment on lidar point clouds. IEEE Transactions on Robotics , 2023
work page 2023
-
[63]
Discuss before moving: Visual language nav- igation via multi-expert discussions
Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language nav- igation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023
-
[64]
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan Al- Regib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[65]
The marathon 2: A navigation system
Steve Macenski, Francisco Mart ´ın, Ruffin White, and Jonatan Gin ´es Clavero. The marathon 2: A navigation system. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 2718–
work page 2020
-
[66]
The marathon 2: A navigation system
Steven Macenski, Francisco Martin, Ruffin White, and Jonatan Gin ´es Clavero. The marathon 2: A navigation system. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2020
work page 2020
-
[67]
Improving vision-and-language navigation with image-text pairs from the web
Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020
work page 2020
-
[68]
Zson: Zero-shot object-goal navigation using multimodal goal embed- dings
Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embed- dings. arXiv preprint arXiv:2206.12403 , 2022
-
[69]
Langnav: Language as a perceptual representation for navigation
Bowen Pan, Rameswar Panda, SouYoung Jin, Roge- rio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889 , 2023
-
[70]
Visual language navigation: A survey and open challenges
Sang-Min Park and Young-Gab Kim. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1):365–427, 2023
work page 2023
-
[71]
Object-and-action aware model for visual language navigation
Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. Object-and-action aware model for visual language navigation. In European Con- ference on Computer Vision , pages 303–317. Springer, 2020
work page 2020
-
[72]
Reverie: Remote embodied visual referring expression in real indoor environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020
work page 2020
-
[73]
Hop: History-and-order aware pre-training for vision-and-language navigation
Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu. Hop: History-and-order aware pre-training for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15418–15427, 2022
work page 2022
-
[74]
March in chat: Interactive prompting for re- mote embodied referring expression
Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for re- mote embodied referring expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023
work page 2023
-
[75]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[76]
Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d en- vironments for embodied ai
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d en- vironments for embodied ai. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and...
work page 2021
-
[77]
Poni: Potential functions for objectgoal navigation with interaction-free learning
Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grau- man. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022
work page 2022
-
[78]
Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel X Chang. Language-aligned way- point (law) supervision for vision-and-language nav- igation in continuous environments. arXiv preprint arXiv:2109.15207, 2021
-
[79]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[80]
A reduction of imitation learning and structured prediction to no-regret online learning
St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[81]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9339–9347, 2019
work page 2019
-
[82]
Velma: Verbalization embodiment of llm agents for vision and language navigation in street view
Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. arXiv preprint arXiv:2307.06082, 2023
-
[83]
James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999
work page 1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.