GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

Jiahao Yang; Shuqiang Jiang; Xiangyang Li; Xing Zhu; Yinghao Xu; Yujun Shen; Zihan Wang

arxiv: 2605.22036 · v1 · pith:CFUQQNK3new · submitted 2026-05-21 · 💻 cs.CV · cs.AI

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

Jiahao Yang , Zihan Wang , Xiangyang Li , Xing Zhu , Yujun Shen , Yinghao Xu , Shuqiang Jiang This is my paper

Pith reviewed 2026-05-22 06:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Vision-Language NavigationBird's-Eye-View RepresentationRGB-D Projection3D Geometric PriorsMultimodal Large Language ModelsToken EfficiencySpatial Reasoning

0 comments

The pith

Projecting RGB-D features into agent-centric BEV maps plus 3D priors lets vision-language navigation models reach state-of-the-art performance using only navigation data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dense RGB video inputs create too many tokens and hide spatial structure in vision-language navigation. It replaces them with a compact bird's-eye-view layout built by lifting RGB-D features into 3D and collapsing them around the agent. A pretrained 3D foundation model adds structural priors to the same layout. The resulting representation keeps geometric relationships while cutting token count, so the model can reason about space more efficiently and still hit top accuracy without data-augmentation tricks or extra question-answering data.

Core claim

By projecting visual features from RGB-D inputs into 3D space and aggregating them into an agent-centric BEV layout, then enriching that layout with features from a pretrained 3D foundation model, the GA-BEV representation supplies both explicit depth geometry and implicit structural priors inside a compact token set that multimodal language models can use directly for navigation.

What carries the argument

Geometry-Aware BEV (GA-BEV) representation: the explicit 3D projection of RGB-D patches followed by agent-centric aggregation, augmented by implicit priors from a 3D foundation model.

If this is right

Navigation agents can process shorter token sequences and therefore run with lower compute and memory cost per step.
Explicit depth projection combined with 3D priors improves spatial reasoning accuracy without requiring mixed VQA training.
Models trained only on navigation trajectories reach state-of-the-art success rates, showing that extra augmentation data is not necessary.
The same BEV construction can be swapped into other MLLM-based embodied tasks that need compact spatial context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token reduction may allow real-time inference on robots with limited onboard compute.
The method could be tested on longer-horizon tasks where spatial consistency matters more than single-step accuracy.
Replacing the 3D foundation model with a smaller or domain-specific encoder might further improve efficiency while keeping most gains.

Load-bearing premise

Projecting RGB-D features into 3D and flattening them into an agent-centric BEV map keeps enough geometric detail for correct navigation decisions.

What would settle it

Run the same navigation model on a standard VLN benchmark once with the proposed BEV maps and once with the original dense RGB patches, using identical training data and no DAgger; success rate and token count would decide whether the geometric compression preserves or loses necessary spatial information.

Figures

Figures reproduced from arXiv: 2605.22036 by Jiahao Yang, Shuqiang Jiang, Xiangyang Li, Xing Zhu, Yinghao Xu, Yujun Shen, Zihan Wang.

**Figure 1.** Figure 1: Illustration of different representations for VLN. (A) Dense image-based representations contain heavy token redundancy and lack explicit spatial structure. (B) Our Geometry-Aware BEV (GA-BEV) representation combines explicit depth-projected features with implicit geometry priors from 3D foundation models, producing a highly compact yet spatially expressive representation tailored for VLN. in visual conte… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Geometry-Aware Vision-Language Navigation (GA-VLN) framework. Given RGB-D current and historical front views, our method constructs a Geometry-Aware BEV (GA-BEV) representation by combining explicit depth-guided projections with implicit geometry priors from a pretrained 3D foundation model. The projected features are aggregated into BEV grid cells to form compact and spatially exp… view at source ↗

**Figure 3.** Figure 3: An example of the GA-VLN real-world result. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Real-world VLN results of GA-VLN: Example #S1. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world VLN results of GA-VLN: Example #S2. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of token usage across navigation steps for different configurations. The number shows in each legend corresponds [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GA-VLN combines depth-based BEV projection with 3D foundation model priors to cut tokens in MLLM navigation while claiming SOTA from navigation data alone.

read the letter

The main thing to know is that this paper builds an agent-centric BEV from RGB-D inputs by projecting features into 3D space, then adds structural priors from a pretrained 3D model to give the language model better spatial grounding with fewer tokens. The experiments report state-of-the-art navigation results using only standard navigation trajectories, skipping DAgger and mixed VQA data entirely, and the ablations separate the depth projection from the implicit priors to show where each helps.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes GA-VLN, a framework for Vision-Language Navigation that introduces a Geometry-Aware BEV (GA-BEV) representation. This is constructed by projecting visual features from RGB-D inputs into 3D space and aggregating them into an agent-centric layout to preserve geometric consistency while reducing token redundancy; pretrained 3D foundation model features are then injected to supply structural priors. The approach is integrated into MLLM-based navigation and claims state-of-the-art performance using only navigation data, without DAgger augmentation or mixed VQA training.

Significance. If the reported results hold, the work offers a concrete route to more efficient VLN by replacing dense RGB token streams with a compact, explicitly geometric BEV map augmented by 3D priors. The demonstrated data efficiency—achieving strong performance from navigation data alone—is a clear strength that could reduce reliance on expensive augmentation pipelines. The explicit separation of depth-based projection and learned 3D priors also provides a useful ablation axis for future geometric VLN research.

minor comments (3)

[§4] §4 (Experiments): the main results table should report absolute success rate, SPL, and navigation error for all baselines and ablations in a single, easily comparable format rather than scattering key numbers across text and supplementary material.
[§3.2] §3.2 (BEV construction): the aggregation step that maps projected 3D points to the agent-centric grid is described only at a high level; adding the explicit binning or interpolation formula would improve reproducibility.
[Figure 3] Figure 3 caption: the visualization of GA-BEV features would be clearer if it explicitly labeled which channels correspond to the depth-projected RGB features versus the injected 3D-foundation-model features.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. We appreciate the recognition of the data efficiency and geometric contributions of GA-VLN.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs GA-BEV via explicit depth-based projection of RGB-D features into 3D space followed by agent-centric aggregation, plus injection of features from an external pretrained 3D foundation model. These steps rely on standard geometric projection pipelines and off-the-shelf priors rather than any self-referential fitting, parameter estimation from the target navigation outputs, or load-bearing self-citations. Ablations isolate each cue's contribution and demonstrate gains on navigation data alone without DAgger or VQA mixing. No equation reduces a claimed prediction to its own inputs by construction, and the central efficiency and SOTA claims rest on independent experimental controls.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameters or background lemmas; listed items are inferred from stated construction steps.

axioms (2)

domain assumption RGB-D inputs supply reliable depth values that can be projected into consistent 3D space without significant sensor noise or calibration error.
Invoked when constructing BEV spatial maps from RGB-D inputs by projecting visual features into 3D space.
domain assumption Features from a pretrained 3D foundation model transfer useful structural priors to the navigation BEV space without domain-specific fine-tuning.
Invoked when incorporating features from a pretrained 3D foundation model into the BEV space.

invented entities (1)

GA-BEV representation no independent evidence
purpose: Compact agent-centric layout that integrates explicit depth cues and implicit 3D priors for MLLM navigation.
Newly proposed construct that reduces token redundancy while preserving geometric consistency.

pith-pipeline@v0.9.0 · 5755 in / 1430 out tokens · 35353 ms · 2026-05-22T06:41:43.156168+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 8 internal anchors

[1]

Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 2

work page arXiv 2022
[2]

Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5

work page 2024
[3]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

work page
[4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5

work page 2022
[5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Affordances-oriented planning using foundation models for continuous vision- language navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6

work page 2025
[7]

Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 6

work page 2025
[8]

Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6

work page 2022
[9]

Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537– 16547, 2022. 2, 5

work page 2022
[10]

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 3, 4, 5, 6

work page arXiv 2024
[11]

InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025

InternNav Contributors. InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025. 6

work page 2025
[12]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6

work page 2022
[13]

Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Keji He, Yan Huang, Ya Jing, Qi Wu, and Liang Wang. Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2

work page 2026
[14]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2

work page 2023
[15]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Confer- ence on Computer Vision, pages 104–120. Springer, 2020. 1, 2, 5

work page 2020
[16]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 1, 2, 5

work page arXiv 2010
[17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Bird’s-eye-view scene graph for vision-language navigation

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023. 2

work page 2023
[19]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024. 6

work page arXiv 2024
[20]

monovln: Bridging the observation gap between monocular and panoramic vision and language navigation

Renjie Lu, Yu Zhou, Hao Cheng, Jingke Meng, and Wei- Shi Zheng. monovln: Bridging the observation gap between monocular and panoramic vision and language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9477–9486, 2025. 2

work page 2025
[21]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[22]

Reverie: Remote embodied visual referring ex- pression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring ex- pression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020. 1, 2

work page 2020
[23]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6

work page 2021
[25]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 5

work page 2019
[26]

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Hao Tan, Licheng Yu, and Mohit Bansal. Learning to nav- igate unseen environments: Back translation with environ- mental dropout.arXiv preprint arXiv:1904.04195, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1904
[27]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 4, 5

work page 2025
[28]

Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 1, 6

work page arXiv 2025
[29]

Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 1, 6

work page arXiv 2025
[30]

Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025

Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, and Ren- jing Xu. Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025. 6

work page arXiv 2025
[31]

g3d-lf: Generalizable 3d- language feature fields for embodied tasks

Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d- language feature fields for embodied tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14191–14202, 2025. 2, 5, 6

work page 2025
[32]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5

work page 2023
[33]

Gridmm: Grid memory map for vision- and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625– 15636, 2023. 2

work page 2023
[34]

Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024

Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, et al. Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024. 5

work page arXiv 2024
[35]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 2

work page 2024
[36]

Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 5, 6

work page arXiv 2024
[37]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

work page arXiv
[38]

Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025

Zihan Wang, Yaohui Zhu, Gim Hee Lee, and Yachun Fan. Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025. 1, 2, 5

work page arXiv 2025
[39]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 3, 4, 5, 6

work page arXiv 2025
[40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 5

work page 2023
[41]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 1, 3, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 3, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Embodied navigation foundation model

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1

work page arXiv 2025
[44]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

L Zhang, X Hao, Q Xu, Q Zhang, X Zhang, P Wang, J Zhang, Z Wang, S Zhang, and R MapNav Xu. A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Hierarchical object-to-zone graph for object navigation

Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. Hierarchical object-to-zone graph for object navigation. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 15130– 15140, 2021

work page 2021
[46]

Generative meta-adversarial network for unseen object navigation

Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang. Generative meta-adversarial network for unseen object navigation. InEuropean Conference on Com- puter Vision, pages 301–320. Springer, 2022

work page 2022
[47]

Imagine before go: Self-supervised generative map for object goal navigation

Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16414–16425, 2024

work page 2024
[48]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Towards learning a generalist model for embod- ied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624– 13634, 2024. 3

work page 2024
[50]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

work page arXiv
[51]

Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models. InEuropean Con- ference on Computer Vision, pages 260–278. Springer, 2024. 3 GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation Supplementary Material A. Real-World Robot E...

work page 2024

[1] [1]

Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 2

work page arXiv 2022

[2] [2]

Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5

work page 2024

[3] [3]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

work page

[4] [4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5

work page 2022

[5] [5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Affordances-oriented planning using foundation models for continuous vision- language navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6

work page 2025

[7] [7]

Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 6

work page 2025

[8] [8]

Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6

work page 2022

[9] [9]

Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537– 16547, 2022. 2, 5

work page 2022

[10] [10]

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 3, 4, 5, 6

work page arXiv 2024

[11] [11]

InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025

InternNav Contributors. InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025. 6

work page 2025

[12] [12]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6

work page 2022

[13] [13]

Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Keji He, Yan Huang, Ya Jing, Qi Wu, and Liang Wang. Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2

work page 2026

[14] [14]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2

work page 2023

[15] [15]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Confer- ence on Computer Vision, pages 104–120. Springer, 2020. 1, 2, 5

work page 2020

[16] [16]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 1, 2, 5

work page arXiv 2010

[17] [17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Bird’s-eye-view scene graph for vision-language navigation

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023. 2

work page 2023

[19] [19]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024. 6

work page arXiv 2024

[20] [20]

monovln: Bridging the observation gap between monocular and panoramic vision and language navigation

Renjie Lu, Yu Zhou, Hao Cheng, Jingke Meng, and Wei- Shi Zheng. monovln: Bridging the observation gap between monocular and panoramic vision and language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9477–9486, 2025. 2

work page 2025

[21] [21]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021

[22] [22]

Reverie: Remote embodied visual referring ex- pression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring ex- pression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020. 1, 2

work page 2020

[23] [23]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6

work page 2021

[25] [25]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 5

work page 2019

[26] [26]

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Hao Tan, Licheng Yu, and Mohit Bansal. Learning to nav- igate unseen environments: Back translation with environ- mental dropout.arXiv preprint arXiv:1904.04195, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1904

[27] [27]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 4, 5

work page 2025

[28] [28]

Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 1, 6

work page arXiv 2025

[29] [29]

Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 1, 6

work page arXiv 2025

[30] [30]

Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025

Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, and Ren- jing Xu. Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025. 6

work page arXiv 2025

[31] [31]

g3d-lf: Generalizable 3d- language feature fields for embodied tasks

Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d- language feature fields for embodied tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14191–14202, 2025. 2, 5, 6

work page 2025

[32] [32]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5

work page 2023

[33] [33]

Gridmm: Grid memory map for vision- and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625– 15636, 2023. 2

work page 2023

[34] [34]

Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024

Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, et al. Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024. 5

work page arXiv 2024

[35] [35]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 2

work page 2024

[36] [36]

Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 5, 6

work page arXiv 2024

[37] [37]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

work page arXiv

[38] [38]

Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025

Zihan Wang, Yaohui Zhu, Gim Hee Lee, and Yachun Fan. Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025. 1, 2, 5

work page arXiv 2025

[39] [39]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 3, 4, 5, 6

work page arXiv 2025

[40] [40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 5

work page 2023

[41] [41]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 1, 3, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 3, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Embodied navigation foundation model

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1

work page arXiv 2025

[44] [44]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

L Zhang, X Hao, Q Xu, Q Zhang, X Zhang, P Wang, J Zhang, Z Wang, S Zhang, and R MapNav Xu. A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Hierarchical object-to-zone graph for object navigation

Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. Hierarchical object-to-zone graph for object navigation. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 15130– 15140, 2021

work page 2021

[46] [46]

Generative meta-adversarial network for unseen object navigation

Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang. Generative meta-adversarial network for unseen object navigation. InEuropean Conference on Com- puter Vision, pages 301–320. Springer, 2022

work page 2022

[47] [47]

Imagine before go: Self-supervised generative map for object goal navigation

Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16414–16425, 2024

work page 2024

[48] [48]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Towards learning a generalist model for embod- ied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624– 13634, 2024. 3

work page 2024

[50] [50]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

work page arXiv

[51] [51]

Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models. InEuropean Con- ference on Computer Vision, pages 260–278. Springer, 2024. 3 GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation Supplementary Material A. Real-World Robot E...

work page 2024